//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>
SANTA CLARA, CALIF. — Nvidia has doubled large language model (LLM) inference performance on its H100, A100 and L4 GPUs with a new open-source software library called TensorRT-LLM.
As evidenced by benchmark results that improve round after round for the same hardware, software is often as important as the hardware when it comes to squeezing the best possible performance out of specialized AI chips.
“A huge part of what we do is a combination of hardware and software, and today Nvidia has more software engineers than hardware engineers,” Ian Buck, VP and general manager of Nvidia’s hyperscale and HPC computing business, told EE Times. “This is part of a decision going back to the original CUDA and the motivation around delivering not just a chip with an instruction set, but a complete stack to meet developers where they are.
“This offers an opportunity to innovate at all the levels: change the hardware architecture, change the instruction set, change the compilers, change the drivers, change the tools, the libraries, everything, so we can move the whole platform forward,” he said. “That’s played itself out multiple times in the last 20 years of doing accelerated computing, and it’s true for AI inference too.”
TensorRT-LLM is an evolution of Nvidia’s original deep learning software library with optimizations for LLM inference. It’s designed to support H100, but can also be applied to A100 and L4 deployments.
“[In TensorRT-LLM, we] made sure we have the best possible tensor core optimizations for large language models,” Buck said. “This allows people to take any large language model and pass it through TensorRT-LLM to get the benefit of Hopper’s transformer engine, which enables the FP8 compute capabilities of Hopper…but without any loss of accuracy in the production workflow.”
Nvidia’s Hopper architecture introduced the transformer engine, a software library that intelligently manages precision for training and inference workloads for the optimum performance. The transformer engine has required a deep understanding of the mathematics, statistics, and data involved and a lot of work on Nvidia’s compiler, Buck said. It helps maintain prediction accuracy for models once they reach production, which can be a challenge.
“You can easily just take a 32- or 16-bit calculation and cram it into an FPGA, but chances are you’re going to get the wrong answer, because it won’t have the production level accuracy you want,” Buck said. “Doing that thoughtfully and carefully, maintaining scale and bias to keep the calculations in the range of only 8 bits in some cases—keeping FP16 for some parts of the model—this is something Nvidia has been working on for some time.”
TensorRT-LLM also includes a new feature called in-flight batching.
Buck explained that LLM workloads, even inference workloads for the same model, are diverse. LLMs started with simpler use cases like sentiment analysis, but today’s LLMs might be answering questions, reading long texts and summarizing them, or generating long or short texts for emails, articles, presentations and more. Data centers serving LLM inference may also offer many different services to many different users.
Compared to existing AI workloads, which are more likely to be similar in size and therefore easy to batch, Buck said LLM queries coming in for the same model can differ by orders of magnitude in terms of size, ranging from those that take milliseconds to complete to those that need a couple of seconds. Models can also be stacked, making things more complicated.
“Our standard batching approaches would always wait for the longest query to complete,” he said. “Image queries all roughly took the same time—that wasn’t a problem from an efficiency standpoint, and queries could be padded out, so it wasn’t a big deal.”
With the new in-flight batching feature, once queries complete, they can retire and the software can insert another query—all while a longer query is still in flight. This helps improve GPU utilization for LLMs with diverse query lengths.
“Frankly, the result surprised even me,” Buck said. “It doubled the performance of Hopper. Hopper is such a powerful GPU, it can handle lots of queries in the same GPU in parallel, but without the in-flight batching, if you gave it diverse queries, it would wait for the longest one and not be fully utilized.”
TensorRT-LLM is open source, along with all of Nvidia’s LLM work, including many LLM models, such as GPT, Bloom and Falcon that’ve been optimized with techniques like kernel fusion, faster attention, multi-headed attention, etc. Kernels for all these operations have been open sourced as part of TensorRT-LLM.
“This allows researchers who are interested in performance to have a starting point to make it even faster,” Buck said. “Our customers and users appreciate that they have something they can optimize further for their use case, if they have a specific idea they want to deploy.”
Innovation is coming from the academic world but also from companies like Meta, Microsoft and Google. And while Nvidia works with them to optimize inference, and while the optimizations may make it into an academic paper, “there wasn’t a good place for the world to go to get those optimizations and the work that Nvidia engineers are doing wasn’t getting shared in a place that could help the rest of the world,” Buck said.
The performance boost from TensorRT-LLM should be obvious in the next round of MLPerf inference scores, Buck added, which are due next spring.