How ‘Why Not’ Led to a $20 Billion Deal For Groq


//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

SAN JOSE, Calif. — At GTC2026, Groq CEO, now Nvidia chief software architect, Jonathan Ross said that the two companies had been working together behind the scenes for almost a year. This behind-the-scenes project, which disaggregated LLM inference workloads between Nvidia GPUs and Groq LPUs, led to one of the biggest deals in semiconductor history, with Nvidia licensing Groq’s technology and hiring its technical team for a reported $20 billion.

“It all started early in 2025 when Nvidia released NVLink and was going to allow partners to connect to it,” Ross said.

Groq COO Sunny Madra connected with Nvidia to ask whether the GPU giant would allow its communication protocol, originally intended for GPU-GPU connectivity but licensed to third-party CPU makers, to be used by another AI accelerator company. The answer from Nvidia—in Ross’s retelling, directly from Nvidia CEO Jensen Huang—was “why not.”

“We got some GPUs […] and we started trying to get something working across GPUs and LPUs where we took different portions of the workload that ran better on each different chip,” Ross said. “It worked.”

Telink TL322X+ML3228A Launched at Embedded World 2026

By Telink   03.20.2026

Integrating Digital Isolators in Smart Home Devices 

By Monolithic Power Systems  03.19.2026

Hybrid LIC + BBU: Solving AI Server Power Gaps

By Shanghai Yongming Electronic Co.,Ltd  03.19.2026

“We presented [our demo] to Jensen. Three days later, Jensen called up and said, ‘Why don’t we work more closely together?’ Three weeks later, the deal was done. And one day after that, I was at Nvidia working full time, and that was actually December 25th, Christmas. That’s when I got my laptop and started working [at Nvidia].”

Groq’s selling point for its tokens was per-user token speed, but the company’s SRAM-based architecture meant many racks of chips were needed to hold a single model, which was uneconomical.

“Sunny proposed [disaggregation with Nvidia GPUs] to me, and at first I was against it,” Ross said. “Not because I didn’t think it was a good idea, I just didn’t know if it was going to work, and there were a whole bunch of other things that were going to work.”  

“We almost didn’t do it,” he said, noting that a year on, being able to use AI to perform experiments like these would have made the decision to run those experiments an easier “yes.”

“If that happened today? It would have been no question,” he said. “The only question was opportunity cost. We had a finite amount of engineers. We were trying to deliver things for customers. Those customers told us how many dollars they were going to give us if we deliver, and so we were on a path. Sunny advocated for this, and I said, fine, take a small contingent and do it.”

“Imagine if [I had] said no,” he added.

Groq LP30

Announced with much fanfare during Huang’s GTC keynote speech, Nvidia will add a new Groq chip to its Rubin-generation lineup.

Nvidia has productized racks of Groq LP30 chips as the Groq 3 LPX Rack, which will sit shoulder to shoulder with racks of Vera Rubins in the AI factory of the future. Together, they offer 35× the token throughput of Vera Rubin alone for workloads that need high interactivity (high tokens per second per user), Huang said.

The business case is: people will pay for speed. Low interactivity tokens (those the user experiences as “slow”) can be free or low value. The fastest tokens—200 and 400 tokens per second per user, in Huang’s example—will be charged for at the “premium” tier( i.e., they are more valuable per token because of their speed). It’s these premium tokens that Groq’s chip and software are making possible. GPUs, even Rubin-generation GPUs, had high throughput but couldn’t achieve the highest interactivity levels. Groq’s chip bends the downward slope of Rubin’s interactivity level upwards at the right hand side of the graph (below, upper right beige line).

Nvidia’s graph of throughput (tokens per second) versus interactivity (tokens per second per user). Hopper performance is grey line, Blackwell is dark green line, Rubin is light green line. Groq extension to Rubin performance at high interactivity is beige line. (Source: Nvidia)

“This is probably the single most important chart for the future of AI factories,” Huang said, referring to the graph above. “Every CEO in the world will be studying this, will be studying it very deeply.”

This graph, he said, will lead directly to revenues for AI factories.

“If most of your workload is high-throughput, I would stick with just 100% Vera Rubin,” Huang said. “If a lot of your workload is coding and very high-value engineering token generation, I would add Groq to it. I would add Groq to maybe 25% of my total data center. The rest of my data center is 100% Vera Rubin.”

Groq chips have a large amount of SRAM (500 MB on the Groq v3), and the compiler schedules all the computation at compile time. This architecture is suited to inference.

“It’s one workload. Now this one workload, as it turns out, is the workload of AI factories,” Huang said. “As the world continues to increase the amount of high-speed tokens it wants to generate, which is, super-smart tokens it wants to generate, the value of this integration is going to get even higher.”

Groq’s ability to reach the mainstream has been limited, Huang said.

“What if we re-architected the way that inference is done in the pipeline so that we could put the work that makes perfect sense on Vera Rubin, and then offload the decode generation, the low latency, bandwidth-limited part of the workload to Groq?” Huang said. “We united two processors of extreme differences. One for high throughput, one for low latency.”

LLM inference workloads are split into prefill and decode stages (Source: Nvidia)
Nvidia’s new system architecture splits decode across Vera Rubin and Groq racks (Source: Nvidia)

In other words, it just so happens that Vera Rubin’s weakness is Groq’s strength, and Groq’s weakness is Vera Rubin’s strength. LLM inference workloads will be split across adjacent racks of heterogeneous hardware. Vera Rubin will handle prefill, which is typically compute-bound, and the attention part of decode, which is memory capacity-bound. Groq 3 LPUs will handle the feed-forward network part of decode (labelled FFN in the slide above), the part used to generate the next token in a sentence, which is typically memory bandwidth-bound. Groq chips aren’t able to handle the whole of the decode stage on their own because they don’t have the memory capacity to hold context. More specifically, the KV cache, but a rack of chips could hold all the weights of a single model as required for token generation.

The sweet spot is one Vera Rubin rack for every one to four Groq LPX racks.

The revenue opportunity unlocked by the combination of Vera Rubin and the Groq 3 LPU is close to $300 billion per gigawatt for Nvidia customers, Huang said, mostly due to the ability to produce higher-value tokens.

Nvidia Vera Rubin and Groq LP30. One’s strength is the other’s weakness. (Source: Nvidia)

Inside the Groq LPX rack, the Groq compute tray contains eight LPUs for a total of 256 in a rack, which uses the same MGX rack architecture as GB200 and Vera Rubin. The Groq rack is dubbed LPX and will be liquid cooled, in the same power envelope as the whole Vera Rubin rack. An FPGA in the system helps workload synchronization across the chips for accurate execution.

The Groq 3 (LP30) chip uses Groq’s proprietary Ethernet-based, chip-to-chip links, Stuart Pitts, Groq’s head of product and commercial marketing, now in Nvidia’s accelerated computing and inference products group, told EE Times.

“We’ll continue to co-innovate,” Pitts said. “For the Groq 3 chip that’s 96 direct links per chip, which obviously scales out pretty significantly across the fleet.”

NVLink-C2C, due to be added to the next generation Groq 4 chip, enables 72 GPUs in a single NVLink domain today, with Rubin Ultra due to increase domain size to 576 with co-packaged optics. There are 256 Groq chips in each LPX rack.

Groq’s entire platform had been built on the company’s first-generation chip, available since 2019, with a trailed second-generation chip never materializing. LP30 is being billed as third-generation.

“We skipped V2,” Pitts said. “We had been working with Nvidia for a little bit before the licensing agreement was signed, and then once it was signed, Jensen said, let’s go, I’ll take V3 please, and I’ll take it tomorrow. We have literally accelerated a multi-generational leap in what this platform is capable of doing.”

The Groq LPX rack with 256 liquid-cooled Groq chips (Source: Nvidia)

Software stack

Performance is needed for the next phase of generative AI, where the models will approach one trillion parameters with half a million tokens of context, with 1,000 tokens per second required.

“The LPX rack with 256 Groq chips, or multiple LPX racks combined with Vera Rubin can make these next-generation workloads possible, economical and performant so that agents can talk to other agents with trillion-parameter intelligence,” said Ian Buck, VP hyperscale and HPC at Nvidia. “[But] the reality is the chips do not work without enormous amounts of software.”

There are X-factors to be had from software, Buck said. Nvidia’s Dynamo inference cluster orchestration software increased Blackwell’s performance by 7×.

“It’s not about who has the faster chip, it’s who has the integration and the software to actually execute and run, and we’re not done. The models are just getting faster,” Buck said.

Groq’s compiler, which orchestrates the entire execution of the workload on the chip, is a big part of Groq’s IP, and the company had also developed software for sharding inference across the extreme number of chips it needed for big models. How much of Groq’s software stack will Nvidia use?

“We licensed all of it, we’re going to use all of it,” Buck said. “[Groq] has a really impressive compiler to target this processor, and a very impressive technology to split the model and compile it to execute across all of those chips. You’re going to need that. The disaggregation software is vital.”

Groq’s LP30 (Source: EE Times)

Groq’s engineers have joined Nvidia’s Dynamo team, Buck said.

“We’re now integrating all [Groq’s stack] so that the existing LPU interconnect, between the LPUs, we’re accelerating it a bit, throwing more people at it, and accelerating their own LPU software roadmap, and adding the GPU into that stack,” he said.

Nvidia is investing heavily in Groq’s hardware and software, Buck confirmed.

Groq was serving tokens to customers via an API. That is, the lower levels of its software stack were internal only. Does Buck, the inventor of CUDA, expect to open up Groq’s software stack in the style of CUDA to enable users to write lower-level code?

“Step one on LPX is working with our biggest customers,” Buck said. “We will eventually open up that programming environment to [foundation model builders, then] the rest of the world, but for the first generation it will be very similar to the Groq model.”

Before the Groq deal, Nvidia’s idea for inference disaggregation was to make a smaller GPU with a different balance of compute and memory, designed for faster prefill (faster time to first token). As a result of the Groq deal, Rubin CPX has been put on the back burner, Buck said.

“We decided to focus [on decode] to improve the dollars per token, and the token rate,” Buck said. “We can still do pre-fill with [Vera Rubin]—the CPX was going to lower the cost…but it wasn’t a big part of the workload. It was only going to improve the time to first token versus the actual token speed.”

“Rubin CPX is still a good idea; I think we’ll revisit it in the Feynman generation,” he added.


See also:

Groq: Nvidia’s $20 Billion Bet on AI Inference

Fallout From Nvidia-Groq Deal Validates AI Chip Startup Landscape

What Is Groq-Nvidia Deal Really About?



Source link