The latest round of MLPerf AI inference benchmark scores are in. Nvidia has dominated both MLPerf training and inference results since the beginning, but in this round Qualcomm appears to be close on Nvidia’s tail when it comes to data center/edge server inference.
Qualcomm submitted MLPerf inference scores for a system with 16x of its Cloud AI100 accelerators, which won ResNet-50 and SSD-Large benchmarks for data center inference in the closed division. However, Nvidia’s biggest A100 system in this division had 8x A100s. The 8x A100 systems outperformed scores for Qualcomm’s 8x Cloud AI100 system on ResNet-50 and SSD-Large comfortably.
Notably, on some power efficiency metrics, Qualcomm’s Cloud AI100 could be interpreted as outperforming some systems based on Nvidia’s A100. Qualcomm’s Cloud AI100 has a TDP around 75 W, much lower than the hundreds of Watts some of its GPU competitors typically need. While performance figures were lower than Nvidia’s A100 overall, when divided by power consumption, the Qualcomm part comes into the lead in certain cases.
As well as many iterations of A10, A30, A100, AGX Xavier and Xavier NX accelerated systems, Nvidia also submitted some scores for a system using an Arm-based server CPU (in this case, an Ampere Altra CPU), alongside scores for a system with the same accelerator setup but with an AMD Epyc x86 server CPU. This allowed direct comparison between an Arm-based CPU and x86 equivalent for the first time. The results have the two systems roughly on parity for performance. Nvidia says this not only proves that Arm CPUs are ready for the data center, but that Nvidia’s own software is ready for that eventuality.
Here’s a deeper dive into the results. The full spreadsheet of MLPerf Inference 1.1 scores can be viewed here.
Nvidia’s take on the data center inference results, closed division, is summarized in the graph below. These figures are normalized per accelerator chip and then normalized to the performance of the A30. The two tallest bars on each workload show a comparison between the Arm-based server and the x86-based server with the same Nvidia accelerators. While the performance was similar, the x86 system was marginally better in almost all cases.
Arm-based CPUs have been touted as a power efficient solution for the data center. However, Nvidia did not submit power measurement scores for the Arm-based system in this round. It remains to be seen whether producing benchmark scores the same or slightly worse than a similar x86 based system will be enough to persuade data center operators to make the switch.
Nvidia’s graph has Qualcomm’s Cloud AI100 outperforming the Nvidia A30, its mainstream inference accelerator GPU with a TDP around 165 W, for ResNet-50 benchmarks only.
Qualcomm won the ResNet-50 benchmark overall in the closed data center division, with scores around 310,000 inferences per second in server mode (with latency target) and 342,000 inferences per second in the offline mode (no latency target) for a system with 16 Cloud AI100s. It doesn’t appear that way in the graph above because Nvidia have normalized per accelerator, but Qualcomm’s John Kehrli, senior director of product management, noted that normalizing per accelerator isn’t the only way of comparing the scores.
“The data center submission for Nvidia had eight A100 cards at 500 Watts each — [total] 4 kilowatts,” he said. “We submitted 16 cards at 75 Watts each, [total] 1.2 kilowatts, so we are leading performance with a fraction of the power… we think that’s a very, very compelling story.”
(Note that 1.2 kW in Kehrli’s example is the consumption of the accelerators only. The whole 2U system in this case consumes “under 1.84 kW,” according to Qualcomm).
Nvidia also presented a graph of the best scores from the closed edge division, again, normalized per accelerator chip, this time normalized to the performance of the Nvidia Jetson Xavier NX (below). Broken down this way, Qualcomm’s Cloud AI100 results place it somewhere between the Jetson Xavier NX and the AGX Xavier for image processing, though it easily outperformed both those systems on the NLP benchmark Bert.
The data center inference power results are where things start to get more interesting. Qualcomm showed off its inferences per second per Watt score for ResNet-50, comparing it to various scores for Nvidia accelerators submitted by Nvidia and Nvidia’s partner, Dell, which it soundly beats (graph below — note that this is normalized by power consumption, not by number of accelerators).
Here is Nvidia’s version of the same graph, also not normalized by accelerator. Qualcomm’s lead in performance per Watt on ResNet-50 and SSD-Large can be seen. Nvidia’s performance per Watt win on Bert-Large benchmarks can also be seen, as well as the benchmarks Qualcomm did not submit results for.
“Our main message here is relating to the versatility of being able to run every workload,” Nvidia’s Dave Salvator told EE Times. “On ResNet, we’re certainly very efficient, but not quite the most efficient. When you look at things like Bert, which is a more recent vintage workload, frankly, we lead on both performance and efficiency. We are able to run everything and be performant on everything.”
In the edge power results, Qualcomm again claimed a victory. The company put its Cloud AI100 development kit (which comes in 15 and 20W TDP constrained versions) up against Nvidia AGX Xavier and Xavier NX systems.
For edge servers, Qualcomm entered a five-accelerator Cloud AI100 system against single Nvidia A10 and dual A100-accelerated systems entered by Nvidia partners Dell and Inspur.
“We are 50% better than the competitor on ResNet-50,” said Qualcomm’s John Kehrli. “This is a five-card 75-Watt solution for us; the Cloud AI100 versus two 300-Watt cards for the competitor. So again, we’re a little bit over half the half the power, but about half the power with 50% better performance.”
Nvidia’s Dave Salvator again pointed out that Qualcomm appeared to cherry-pick its results, whereas Nvidia submitted scores for all the workloads for each system it entered.
“There’s the matter of being able to do one or two things well versus being able to do everything, either very well or at least, well,” he said.
Elsewhere in the results
Nvidia also submitted results for otherwise identical systems using Nvidia’s custom, heavily optimized code versus its Triton inference server software. For an 8x Nvidia A10-accelerated system, custom code was slightly better on ResNet-50 in the server set up, while the Triton-enabled system was slightly ahead in offline DLRM results, for example. Overall, there wasn’t much to choose between the results from the two systems.
This strategy is intended to answer detractors who have previously complained that Nvidia spends its massive resources on optimizing its systems for submitting benchmark scores. Pointing out that Triton is not only for GPU-accelerated systems (it works with CPU-only systems, for example), Nvidia’s Dave Salvator said Triton is designed to make it easier to deploy AI models at scale.
“Triton is easing deployment for infrastructure managers bringing high integration for Kubernetes support, for managed Kubernetes services within [cloud services providers], [giving them] the ability to do auto load balancing and auto scaling,” he said.
Nvidia also submitted several results using MIG, its multi-instance GPU capability which splits a large GPU into several smaller instances to run smaller workloads simultaneously. For example, an A100-80GB can be split into 7 smaller accelerators, each with 10 GB memory (the eighth partition is used for control and state management). An A100 MIG submission, loaded with the benchmarked workload on one partition, then concurrently running the rest of the benchmark suite in the others, could achieve around 95% of the performance of an A100 running the benchmarked workload alone. This is useful, said Salvator, when running applications like conversational AI which require running multiple neural networks at once. The A30 also has MIG capabilities (it can be split into 4 smaller GPUs).
Intel Xeon scores
While Nvidia and Qualcomm battled it out for overall crown, Intel also submitted scores for its 3rd Generation Ice Lake and Cooper Lake Xeon CPUs in the closed data center performance division, with the intention of proving it is practical to run a broad range of AI workloads on Xeon CPUs.
The CPU leader highlighted its improved results versus inference scores for the same systems submitted in the last round back in April, which it put down to both hardware and software improvements.
While Intel didn’t allow us to compare Ice Lake and Cooper Lake scores directly in this round, it is possible to make comparisons between 2nd generation Xeons (Cascade Lake) from the last round and these most recent scores. Ice Lake offers more compute, memory capacity and memory bandwidth than its predecessor Cascade Lake, and this is reflected in a 1.5x performance boost, said Intel.
Ice Lake scores this round used DLBoost acceleration (includes vector neural network (VNNI) instructions) for INT8 inference workloads using either PyTorch or OpenVINO frameworks. OpenVino scores appeared up to 1.3X faster than PyTorch. Compared to Cascade Lake submissions in the last round, Ice Lake DLRM scores improved 2.2x.
Cooper Lake, which has the benefit of DLBoost plus support for BF16 precision, achieved 1.8X performance improvement for the RNN-T workload based on software improvements alone (compared to previous Cooper Lake scores). This was achieved using a mix of BF16 and INT8 precision.
“As buoyant as all these data scientists are supposed to be, at [our hyperscale customers], it turns out that actually getting stated accuracy on all models at 8-bit is really, really hard,” Jordan Plawner, Director of AI Products and Business at Intel told EE Times. “And if you’re not one of those state-of-the-art companies, it’s even harder. Mixed precision is a way to get half the efficiency [but] get closer to that state-of-the-art accuracy and we tend to need in the data center.”
Intel’s DLRM inference score for its 2-CPU Ice Lake system reached around 20,000-23,000 inferences per second. While this might have doubled since the last round, it’s still an order of magnitude below a dual Nvidia A10-accelerated system, and another order of magnitude below some of the bigger Nvidia A100-enabled systems entered.
Does Intel still count this as a win, and how does Intel use these low scores to sell Xeon CPUs?
“One of the adverse impacts of something like MLPerf is it is reduces the value of a chip to just peak performance,” Plawner said. “We’re privileged that 100% of our customers already use Xeon today, so they value many aspects or attributes of the Xeon processor. One is that they have them. Two is that they’re general purpose, and three is that they’re incredibly scalable in terms of adding and subtracting compute and dynamically creating clusters.”
Plawner added that recommendation is a very valuable workload to hyperscalers which is now scaling out as it matures to other types of customers with similar goals, and pointed out that beyond DLRM, customers’ recommendation models are actually very diverse.
“What we’ve seen is that on many internal models, CPUs actually perform equal to or even better than the competition, because they are doing a massive amount of memory lookup, thousands and thousands of embedding attributes,” Plawner said. “The amount of memory bandwidth is massive and that is all in balance with the compute. When you’re hitting those embedding tables in memory all the time, having a massive amount of FLOPS means [the compute] just sits idle.”
According to Plawner, Intel is not trying to win business where an accelerator would do better – Intel has its own AI accelerators to offer, such as Ponte Vecchio and Habana Labs – instead, the company is trying to gain mindshare.
“Our only goal [with MLPerf] is to show that we’re improving gen to gen, and that we can support a wide number of models, because we have a competitor that has massive mindshare,” he said. “A lot of it is just signalling to the market that out of the box, anyone can do AI on Xeon.”
Virtualization vs bare metal
VMWare, in partnership with Dell, submitted scores with and without VMWare vSphere virtualization on otherwise identical systems. The system’s Dell PowerEdge 7525 host has two AMD EPYC 7502 processors with 128 logical cores. VMware achieved 96% or better of the equivalent bare metal performance with only 24 CPU cores (plus the system’s 3 Nvidia GPU A100s). This means the remaining 104 logical CPU cores would be available for additional tasks. However, it should be noted that scores for the virtualized system were submitted for SSD-Large and Bert workloads only.
“The general perception is ML requires a lot of performance, so people run it in the bare metal environment,” said Uday Kurkure, lead staff engineer with VMware Performance Engineering. “You can get the benefits of virtualization in the data center as well as at the edge if you run virtualized hypervisors. But you can still get the performance close to the bare metal, and that’s why we submitted results, to show that running the ML workload in the virtualised environment is beneficial to customers and can bring down the cost of data center operations.”
Startup OctoML is developing a deployment platform to automatically optimize, benchmark and deploy AI models from any deep learning framework to any target hardware based on user requirements and constraints.
OctoML’s Apache TVM compiler was used for two edge inference benchmark scores in the closed division for ResNet-50 (single stream and offline mode), one for a Raspberry Pi and one on AWS EC2 (Graviton 2) hardware. The Apache TVM compiler is designed to compile deep learning models from any framework to diverse hardware backends. OctoML also has some scores in the open edge inference division using this compiler.
“When we work with customers, we get lots of questions about why it is so difficult to optimize models for new hardware,” said OctoML vice president of MLOps, Grigory Fursin. “Current frameworks are difficult to extend and it’s a time-consuming process… We are focused on making the process simpler.”
OctoML has also been working on the Collective Knowledge (CK) framework, an open-source framework which automates the process of preparing, assembling, running, validating and reproducing MLPerf inference benchmarks. OctoML hopes that by combining Apache TVM with CK, it can lower the barrier to entry for MLPerf scores, thereby democratizing the submission of quality MLPerf results. CK has been donated to MLCommons, the organization responsible for MLPerf. It is already in use by other organizations; Qualcomm’s benchmark partner Krai used the CK automation suite in this round, for example.
Other notable entries this time around came from Furiosa AI and Neuchips.
Furiosa AI, a Korean startup founded in 2017, submitted a second score for its Warboy inference accelerator chip (following its FPGA prototype submission in 2019). Warboy is intended for data center and enterprise data center applications. The chip has peak performance of 64 TOPS (INT8). While this may seem underpowered for a data center chip, it came in ahead of an Nvidia T4 accelerated system (130 TOPS/previous generation technology) for ResNet and SSD-Small workloads (latency and samples/s, closed edge inference division). Furiosa is claiming a win here at low batch size, adding that its chip is priced much lower than the T4. Production of Warboy will be ramped next year.
Taiwanese startup Neuchips again submitted one score in the closed data center inference category, research/development class (for systems not expected to be commercially available in the next 6 months). The company submitted a DLRM score for a system accelerated by 3 of its RecAccel accelerators, which are tailored specifically for DLRM. Scores were an order of magnitude below 2x Intel CPU scores, though the company says its scores are improving compared to in previous rounds.