//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>
SANTA CLARA, CALIF. — AI chip startup Lemurian Labs invented a new logarithmic number format designed for AI acceleration and is building a chip to take advantage of it for data center AI workloads.
“In 2018, I was training models for robotics, and the models were part convolution, part transformer and part reinforcement learning,” Lemurian CEO Jay Dawani told EE Times. “Training this on 10,000 [Nvidia] V100 GPUs would have taken six months…models have grown exponentially but very few people have the compute to even attempt [training], and a lot of ideas are just getting abandoned. I am trying to build for the everyday ML engineer who has great ideas but is compute-starved.”
Simulations of Lemurian’s first chip, which is yet to tape out, show the combination of its new number system and specially designed silicon will outperform Nvidia’s H100 based on H100’s most recent MLPerf inference benchmark results. The simulation of Lemurian’s chip can handle 17.54 inferences per second per chip for the MLPerf version of GPT-J in offline mode (Nvidia H100 in offline mode can handle 13.07 inferences per second). Dawani said Lemurian’s simulations are likely within 10% of true silicon performance, but that his team intends to squeeze more performance from software going forward. Software optimizations plus sparsity could improve performance a further 3-5×, he said.
Logarithmic number system
Lemurian’s secret sauce is based on the new number format the company has come up with, which it calls PAL (parallel adaptive logarithms).
“As an industry we started rushing towards 8-bit integer quantization because that’s the most efficient thing we have, from a hardware perspective,” Dawani said. “No software engineer ever said: I want 8-bit integers!”
For today’s large language model inference, INT8’s precision has proved to be insufficient, and the industry has moved towards FP8. But Dawani explained that the nature of AI workloads means numbers are frequently in the subnormal range—the area close to zero, where FP8 can represent fewer numbers and is therefore less precise. FP8’s gap in coverage in the subnormal range is the reason many training schemes require higher precision datatypes like BF16 and FP32.
Dawani’s co-founder, Vassil Dimitrov, came up with the idea of extending the existing logarithmic number system (LNS), used for decades in digital signal processors (DSPs), by using multiple bases and multiple exponents.
“We interleave the representation of multiple exponents to recreate the precision and range of floating point,” Dawani said. “This gives you better coverage…it naturally creates a tapered profile with very high bands of precision where it counts, in the subnormal range.”
This band of precision can be biased to cover the area required, similar to how it works in floating point, but Dawani said it allows for finer grained control over biasing than floating point does.
Lemurian developed PAL formats from PAL2 to PAL64, with a 14-bit format that’s comparable to BF16. PAL8 gets around an extra bit-worth of precision compared to FP8 and is about 1.2× the size of INT8. Dawani expects other companies to also adopt these formats going forward.
“I want more people to be using this, because I think it’s time we got rid of floating point,” he said. “[PAL] can be applied to any application that floating point is currently used for, from DSP to HPC and in between, not just AI, though that is our current focus. We are more likely to work with other companies building silicon for these applications to help them adopt our format.”
LNS has been used for a long time in DSP workloads where most of the operations are multiplies, since it simplifies multiplications. The multiplication of two numbers represented in LNS is the addition of those two log numbers. However, adding two LNS numbers is harder. DSPs traditionally used large lookup tables (LUTs) to achieve addition, which while relatively inefficient, was good enough if most of the operations required were multiplies.
For AI workloads, matrix multiplication requires both multiply and accumulate. Part of Lemurian’s secret sauce is that it has “solved logarithmic addition in hardware,” Dawani said.
“We have done away with LUTs entirely and created a purely logarithmic adder,” he said. “We have an exact one that is much more accurate than floating point. We’re still making more optimizations to see if we can make it cheaper and faster. It’s already more than two times better in PPA [power, performance, area] than FP8.”
Lemurian has filed several patents on this adder.
“The DSP world is famous for looking at a workload and understanding what it’s looking for, numerically, and then exploiting that and bringing it to silicon,” he said. “That’s no different from what we’re doing—instead of building an ASIC that just does one thing, we’ve looked at the numerics of the entire neural network space and built a domain-specific architecture that has the right amount of flexibility.”
Implementation of the PAL format in an efficient way requires both hardware and software.
“It took a lot of work trying to think about how to make [the hardware] easier to program, because no architecture is going to fly unless you can make engineer productivity the first thing you accelerate,” Dawani said. “I would rather have a [terrible] hardware architecture and a great software stack than the other way around.”
Lemurian built around 40% of its compiler before it even started thinking about its hardware architecture, he said. Today, Lemurian’s software stack is up and running, and Dawani wants to keep it fully open so users can write their own kernels and fusings.
The stack includes Paladynn, Lemurian’s mixed-precision logarithmic quantizer that can map floating point and integer workloads to PAL formats while retaining accuracy.
“We took a lot of the ideas that existed in neural architecture search and applied them to quantization, because we want to make that part easy,” he said.
While convolutional neural networks are relatively easy to quantize, Dawani said, transformers aren’t—there are outliers in the activation functions that require higher precision, so transformers will likely require more complicated mixed precision approaches overall. However, Dawani said he’s following multiple research efforts, which indicate transformers won’t be around by the time Lemurian’s silicon hits the market.
Future AI workloads could follow the path set by Google’s Gemini and others, which will run for a non-deterministic number of steps. This breaks the assumptions of most hardware and software stacks, he said.
“If you don’t know a priori how many steps your model needs to run, how do you schedule it and how much compute do you need to schedule it on?” he said. “You need something that’s more dynamic in nature, and that influenced a lot of our thinking.”
The chip will be a 300 W data center accelerator with 128 GB of HBM3 offering 3.5 POPS of dense compute (sparsity will come later). Overall, Dawani’s aim is to build a chip with better performance than the H100 and make it price-comparable with Nvidia’s previous-generation A100. Target applications include on-prem AI servers (in any sector) and some tier 2 or specialty cloud companies (not hyperscalers).
The Lemurian team is currently 27 people in the U.S. and Canada and the company recently raised a seed round of $9 million. Dawani aims to tape out Lemurian’s first chip in Q3 ’24, with the first production software stack release coming in Q2 ’24. Today, a virtual dev kit is available for customers who want to “kick the tires,” Dawani said.