Neuchips Tapes Out Recommendation Accelerator for World-Beating Accuracy

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

Taiwanese startup Neuchips has taped out its AI accelerator designed specifically for data center recommendation models. Emulation of the chip suggests it will be the only solution on the market to achieve one million DLRM inferences per Joule of energy (or 20 million inferences per second per 20–Watt chip). The company has already demonstrated that its software can achieve world–beating INT8 DLRM accuracy at 99.97% of FP32 accuracy.

Neuchips was founded in response to a call by Facebook (now Meta) in 2019 for the industry to work on hardware acceleration for recommendation inference. The Taiwanese startup set out to do exactly this, and the company is one of only two startup entrants specifically targeting recommendation (the other is Esperanto with its 1000–core RISC–V design).

Neuchips CEO Youn-Long Lin
Youn–Long Lin (Source: Neuchips)

“According to many reports, most of the AI inference cycles in the data center are actually for recommendation models, not vision or language… so we think recommendation is an important market,” Neuchips CEO Youn–Long Lin told EE Times, adding that the number of recommendation inferences required is growing steadily. “The power consumption is fixed, so the essential issue is that we have to do as much as possible within an energy budget in order to increase prediction accuracy.”

Prediction accuracy is very important for recommendation applications, such as online shopping, where any loss in accuracy means a corresponding loss in revenue for online shopping platforms.

DLRM (deep learning recommendation model), Meta’s open–source recommendation model, has quite different characteristics compared to the CNNs widely used for computer vision. Dense features, those with continuous values such as customer age or income, are extracted by multilayer perceptron (MLP — a type of neural network) while sparse features (yes or no questions) use embedding tables. There may be many hundreds of features or more, and embedding tables can be gigabytes in size. Interactions between these features would indicate the relationship between products and users for online shopping platforms. These interactions are computed explicitly — DLRM uses a dot product. And then these interactions go through another neural network.

Structure of the DLRM recommendation network
Structure of the DLRM recommendation network. Neural networks are marked in orange, embedding tables in purple and dot product in green (Source: Meta)

While neural network computation may be compute–bound, the other operations required for DLRM may be bound by memory capacity, memory bandwidth, or communication. This makes DLRM a very hard model to accelerate with general–purpose AI accelerators, including those developed for applications such as image processing.

Neuchips’ ASIC solution, RecAccel, includes specially designed engines to accelerate embeddings (marked purple in diagram below), matrix multiplication (orange) and feature interaction (green).

Block Diagram of Neuchips RecAccel chip
Neuchips’ recommendation inference accelerator chip includes hardware engines designed for the key parts of the recommendation workload (Source: Neuchips)

“In the embedding engine, mostly the issue is to look up multiple tables simultaneously and very fast,” Lin said. “Recommendation model sizes vary a lot — some are very small, some are very large. The important issue is how to allocate tables to both off–chip and on–chip memory appropriately.”

Neuchips’ embedding engine reduces access to off–chip memory by 50% and increases bandwidth utilization by 30%, the company said, via a novel cache design and DRAM traffic optimization techniques.

Different recommendation models use different operations for feature interaction — DLRM uses dot product, but there are others. Lin said Neuchips’ feature interaction engine supports this kind of flexibility.

The chip has 10 compute engines with 16K MAC per engine.

“The important issue here is how to implement this compute engine with low power consumption and so it can handle sparse matrices efficiently,” Lin said. The compute engines consume 1 microjoule per inference at the SoC level.

Lin added that hardware features can also terminate computation when a certain level of accuracy is reached, to save power.

Software stack

Neuchips already has a complete software stack up and running, including compiler, runtime, and toolchain, as evidenced by two successful MLPerf submissions.

The SDK supports both splitting big models across multiple chips/cards and running multiple smaller inferences per chip (Lin said that Meta has several hundred DLRM models in production with vastly different sizes and characteristics).

Block diagram of Neuchips RecAccel SDK
Neuchips’ software development kit (SDK) includes compiler, runtime and toolchain and has already been demonstrated successfully in previous MLPerf rounds (Source: Neuchips)

Neuchips’ secret weapon is the new 8–bit number format it invented, and patented, called flexible floating point or FFP8.

“[FFP8] means our circuit can be more adaptive to the model, and that’s how we achieve high accuracy,” Lin said. “The training part is always in 32–bit, and you can use 32–bit to inference, if you don’t care about the energy consumption, but with 8–bit, the energy consumption is one–sixteenth… The problem is the trade off between how much accuracy loss you are willing to suffer to gain the computing efficiency.”

Companies such as Nvidia and Tesla are moving towards 8–bit floating point formats where possible, pointing towards a consensus on 8–bit computation for inference, Lin said. Neuchips’ FFP8 is a superset of these formats, with configurable exponent and mantissa widths. There is also an unsigned version which uses the extra bit to increase accuracy of stored activations after ReLU operations.

Neuchips’ calibrator block (part of the compiler) “defines the quantization and representation format according to model and data characteristics,” said Lin. This calibrator was able to achieve what Neuchips says is the world’s best DLRM accuracy at INT8 — 99.97% of the accuracy of an FP32 version of the model. Use calibration in combination with FFP8 (to determine the exact format used for different parts of the model), and accuracy improves to 99.996%, close to what can be achieved with bigger formats like BF16.

Diagram showing mantissa and exponent widths for Neuchips FFP8 format
Neuchips’ FFP8 format has configurable exponent and mantissa widths, and the option to use the sign bit for data to improve accuracy (Source: Neuchips)
Graph of Neuchips RecAccel accuracy achieved for DLRM inference
Neuchips’ accuracy results for its calibration process, and for calibration plus FFP8 format, normalized to FP32 accuracy (Source: Neuchips)

Patents filed

Neuchips was founded in 2019 by Lin, a computer science professor at the National Tsing Hua University in Taiwan, previously co–founder and CTO of design services company Global Unichip Corp (now part of TSMC), along with an experienced team from Mediatek, Novatek, Realtek, GUC, and TSMC.

The company employs 38 people in Taiwan, of which 30 are engineers, including many former students of Lin’s. The company has filed 30 patents so far, and received 8 U.S. and 12 Taiwan patents.

Neuchips’ RecAccel chip has taped out and will be manufactured in TSMC 7nm, occupying 400mm2. The chip will be available on dual M.2 modules ready to go onto Glacier Point carrier cards (6 modules per Glacier Point) and PCIe Gen 5 cards. Both cards will begin sampling in Q4 ’22.

Source link