Long Live the Inference King- EE Times

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

Crowned “The Inference King” by SemiAnalysis, His Royal Highness Jensen Huang addressed his faithful subjects for more than two hours, as has become traditional on the first day of Nvidia’s developer conference, GTC 2026.

With fewer theatrics than recent years—no T-shirts fired into the crowd and only one robot—the Nvidia CEO nevertheless bragged about the company’s inference capabilities, showed off Vera Rubin (its next-generation CPU-GPU combo), and unveiled dramatic architectural changes for the Rubin generation based on its recent all-but acquisition of AI accelerator startup Groq.

Huang started by paying tribute to what many feel is Nvidia’s moat: its installed base for low-level GPU software stack CUDA, whose early success he accredited to the adoption of GeForce GPUs commonly used in gaming consoles.

“I know how many of you grew up with GeForce; GeForce is Nvidia’s greatest marketing campaign,” Huang joked. “We attract future customers, starting long before you could afford to pay for it yourself… Your parents paid for you to be Nvidia customers. And every single year, they paid up. Year after year after year, until someday you became an amazing computer scientist and became a proper customer.”

Telink TL322X+ML3228A Launched at Embedded World 2026

By Telink 03.20.2026

Integrating Digital Isolators in Smart Home Devices

By Monolithic Power Systems 03.19.2026

Hybrid LIC + BBU: Solving AI Server Power Gaps

By Shanghai Yongming Electronic Co.,Ltd 03.19.2026

Referring to Nvidia as “the house that GeForce made,” Huang acknowledged GeForce’s role in the development of CUDA over the last 25 years.

“One of the biggest investments that we made—we couldn’t afford it at the time, it consumed the vast majority of our company’s profits—was to take CUDA on the back of GeForce to every single computer,” Huang said. “We dedicated ourselves to creating this platform because we felt so strongly about its potential. But ultimately, the company’s dedication to it, despite the hardships in the beginning, believing in it every single day for 13 generations, 20 years, we now have CUDA installed everywhere.”

Huang mimes holding up the InferenceMAX title belt at GTC. (Source: Nvidia)

Tokenomics

Huang focused on inference performance and token economics for a big part of his keynote, saying the company has made great progress since GTC 2025 through innovations like proprietary connectivity protocol NVLink, the numerical format NVFP4, and software like TensorRT-LLM. Nvidia’s DGX Cloud supercomputer has been put to use writing CUDA kernels for inference, too.

The results have come together to drastically affect the economics of serving tokens, Huang said.

“Every CEO in the world will study their business from now on in the way I’m about to describe,” he said.

Tokens are the “product” of the AI factory, limited by power, making the critical throughput metric tokens per Watt, Huang said. Interactivity, another critical metric, shows the speed of inference in tokens per second per user. Since interactivity determines how large models can be and how much content can be processed, it’s equivalent to “smartness” of AI, he said. The third key factor is cost per token.

“Nvidia’s token cost is world class—basically, untouchable at the moment,” Huang said. “The reason that’s true is because of extreme co-design.”

Nvidia has the highest performance in the world, he said, noting that we might have expected Moore’s Law to double the performance. Grace Blackwell is advertised at 35× the performance of previous-generation Hopper H200, he noted.

“[SemiAnalysis chief analyst] Dylan Patel accused me of sandbagging—it’s actually 50×,” he said. “He’s not wrong.”

Two line graphs comparing AI hardware efficiency. The left chart, 'Tokens per Watt Drives Factory Revenue,' shows NVIDIA B200 achieving 50X higher tokens per watt compared to previous generations. The right chart, 'Performance Drives Token Cost,' shows a 35X lower token cost for the B200 compared to competition as performance increases." — GB300 can offer up to 50× higher performance per Watt versus Hopper. (Source: Nvidia)

With power-limited infrastructure, Huang said, companies had better make sure that their architecture is as optimized as possible to produce their new commodity: tokens.

“In the future, every single computer company, every single cloud company, every single AI company, and every single company, period, will be thinking about their token factory’s effectiveness,” he said. “The reason I know this is because everyone in this room is powered by intelligence, and in the future, that intelligence will be augmented by tokens.”

Groq LPX

The reveal we’ve all been waiting for came about an hour in. After all-but acquiring Groq three months ago, Huang unveiled a brand new architecture heavily featuring a next-gen Groq LPU, which has been rebranded somewhat as a token generation accelerator and paired with the new Vera Rubin CPU-GPU. Together, the Vera-Rubin-Groq combination will offer 35× more throughput per Watt at the highest interactivity levels, Huang said, noting that Groq chips could make up 25% of the future AI factory.

“The [Groq] architecture is designed with massive amounts of SRAM,” Huang said. “It is designed just for inference; it’s one workload. Now this one workload, as it turns out, is the workload of AI factories. And as the world continues to increase the amount of high-speed tokens it wants to generate, which is super-smart tokens it wants to generate, the value of this integration is going to get even higher.”

Nvidia has put the parts of the workload that make sense on Vera Rubin, including the memory-capacity-limited KV cache, and offloaded token generation (the low-latency, bandwidth-limited part) to Groq. The attention part of decode stays on Vera Rubin, with only the token generation part of decode going to Groq.

“We united, unified, two processors of extreme differences—one for high throughput, one for low latency,” Huang said.

Dynamo, Nvidia’s inference serving software, orchestrates the workload across both types of chips. The result is a 35× increase in throughput at the highest interactivity levels possible today, and it extends the interactivity levels possible with the Rubin architecture.

Groq chips, 256 to a rack, were shown next to a rack of Vera Rubins. Groq LPUs will ship from Nvidia in Q3.

[Stay tuned for more details on Nvidia’s plans for Groq in another article coming shortly to EE Times]

A side-by-side technical comparison of the Rubin GPU and Groq 3 LPU. The Rubin GPU features 288 GB HBM4, 22 TB/s bandwidth, and 50 PFLOPS (NVFP4). The Groq 3 LPU features 500 MB SRAM, 150 TB/s SRAM bandwidth, and 1.2 PFLOPS (FP8). The text below reads 'Uniting Processors of Extreme Performances.' — The new Groq 3 LPU contrasted with the Rubin GPU. (Source: Nvidia)

Feynman preview

After Vera Rubin and Rubin Ultra will come Feynman, Nvidia’s next generation of technology, which will have new GPUs and a new Groq chip, a new CPU called Rosa, and a new Bluefield networking chip. It will use the Khyber rack design originally designed for Rubin Ultra, which allows 144 GPUs to be connected in an NVLink domain.

Khyber will use both copper and optics for scale-up, Huang said.

“For the first time, we will scale-up with both copper and co-packaged optics,” Huang said. “A lot of people have been asking, ‘Jensen, is copper going to still be important?’ The answer is yes. ‘Jensen, are you going to scale up optically?’ Yes. ‘Are you going to scale out optically?’ Yes.”

More capacity will be needed from the ecosystem for both copper and optics, he said.

NemoClaw

One of the biggest parts of Huang’s presentation concerned OpenClaw, the agentic AI framework that has seen a meteoric rise in the last few months.

“OpenClaw has open-sourced the operating system of agent computers,” Huang said. “It’s no different than how Windows made it possible for us to create personal computers. Now, OpenClaw has made it possible for us to create personal agents. The implications are incredible.”

CEOs of all companies should be considering their OpenClaw strategies, he said, noting that all SaaS companies will effectively become agents as a service companies.

A Star History chart comparing the GitHub star growth of three repositories: torvalds/linux (blue), facebook/react (yellow), and openclaw/openclaw (red). While Linux and React show steady upward slopes since 2012, the OpenClaw repository shows a near-vertical surge in early 2026, reaching over 200,000 stars instantly. A 3D cartoon lobster sits in the bottom left foreground watching the chart. — OpenClaw’s meteoric rise. (Source: Nvidia)

“OpenClaw gave us exactly what we needed at exactly the right time,” Huang said. “There’s just one catch—agentic systems in the corporate network can have access to sensitive information, it can execute code, and it can communicate externally…obviously, this can’t possibly be allowed.”

Nvidia has taken the OpenClaw open-source stack and added its safety and security guardrails to make it enterprise-ready. The reference stack is called NemoClaw.

“The OpenClaw event cannot be understated,” Huang said. “This is as big of a deal as HTML. This is as big of a deal as Linux. We now have a world-class open agentic framework that all of us can use to build our OpenClaw strategy.”

Humanoid robotics

Unlike last year, when Huang was joined by a dozen humanoid robots on stage, this year the robot count was limited to one: a short, waddling characterization of Olaf the snowman from Disney’s Frozen. “Olaf” had a brief, live conversation on stage with Huang in which Huang said, “I thought you’d be tall.”

The show ended with an animation of various Nvidia-powered humanoid robots singing a country song about the keynote content around the campfire, accompanied by Toy Jensen on harmonica, only slightly less surreal than the visuals of racks of Groq chips in the Nvidia AI factory of the future.

Tokenomics

Groq LPX

Feynman preview

NemoClaw

Humanoid robotics

See also: