Cirrascale CEO: ‘More Specialized Hardware Will Be Needed’

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

SANTA CLARA, Calif. – At the AI Infra Summit, in the wake of Nvidia’s Rubin-CPX announcement, Cirrascale CEO Dave Driggers told EE Times that the AI neocloud company has always been a proponent of heterogeneous compute in the data center.

“We’re big believers in horses for courses, and that we’re going to see more and more specialized hardware,” Driggers said. “As inference workloads change, and they will change dramatically as we move to MoE use cases with richer data, everything’s going to have to change. We’re going to need more specialized processors versus just pure general-purpose for doing inference.”

Cirrascale customers run inference on models between around 1B and 400B today. Driggers said no single product today is capable of serving both ends of the scale economically.

“What processor you use radically changes the cost per token,” he said. “No single product today fits all models – some products don’t fit any model – and that’s just with simple LLMs. As we move into LLMs that handle video and audio in multi-modal scenarios, we’re going to need products that don’t exist today.”

Hong Kong Electronics Fair 2025 and electronicAsia: Premier Global Electronics Showcase Event

By HKTDC 10.03.2025

iMotion Taps Renesas R-Car Technology to Take the Pain Out of Parking

By Pu Zhang, Senior Principal Engineer, High Performance Computing, Renesas 10.02.2025

Northern Poland Emerging as Europe’s Next Semiconductor Hub

By Invest in Pomerania 10.01.2025

Hardware is lagging

Cirrascale has deployed Nvidia hardware at scale, utilizing RTX-series graphics GPUs for inference extensively to date. The flexibility to deploy different workloads (training, inference, and graphics) is valuable, Driggers said, especially given LLMs’ evolution into multi-modal models that will need to process graphics at the same time (Nvidia Rubin-CPX will introduce graphics codecs in hardware).

heterogeneous compute — *Dave Driggers (Source: Cirrascale)*

“One of the reasons why you need a much bigger context window is because you’re putting in richer data,” Driggers said. “That richer data is typically going to be images or video. The fourth wave of AI is going to be where we get into true augmented reality.”
AR glasses will input images, audio, and videos into large models, which requires a considerable larger context size than text.
“We can come up with the applications, but the challenge is the hardware is still lagging,” Driggers said. “We need hardware that can handle large context windows, we need hardware that can do pre-processing of video in a near real-time or real-time fashion.”

In addition to video codecs, embedded DSPs can be useful for pre-processing incoming data for inference, Driggers said.

“More and more processors are multi-core, with diverse cores,” he said. “In the next generation, I think we’ll see AMD is extremely well positioned to put a mixture of different cores into a [large] inference processor.”

Cirrascale has 1000+ AMD MI300 clusters up and running, with some MI350s and MI355s also used for inference (training is still a challenge for AMD’s ROCm software stack, Driggers said).
Network switches, NICs, and GPUs have to be tightly integrated to move data efficiently. Driggers is therefore eagerly anticipating closer integration between AMD’s big GPUs and its Pensando NICs. “The jury is still out” on how well AMD can compete with already vertically-integrated Nvidia offerings until then, he said.

Startup hardware

Cirrascale’s heterogeneous vision means the company has teamed with several other AI accelerator companies over the years, including Graphcore, Cerebras, and Qualcomm (the company currently runs Qualcomm’s Cloud AI Playground in its cloud on AI100s).

“We’ll certainly try to work with any new startup that’s got a product that we think is interesting,” Driggers said, noting that Cirrascale is working with SambaNova, Tenstorrent, and others.

Some of these AI chip companies are shifting their own business models closer to Cirrascale’s as they begin to sell tokens directly to the end user via an API, rather than sell or rent hardware. Working with companies like this can be a challenge, Driggers said.

“If a company goes down that road, [inference] as a service, how can I compete with them? I don’t want to compete against my suppliers,” he said.

Driggers notes that many chip startups’ immature software stacks mean custom workloads often require significant manual tuning to run efficiently (or at all, in some cases, he said). Startups frequently invest time and resources in doing this work, and are understandably not keen to hand off this work to a third party like Cirrascale, because “I’m not going to do that at zero margin,” he said.

“I can understand why these companies are doing this right now,” Driggers said. “If [their tech] becomes easy to deploy so they don’t need to hold every customer’s hand, maybe they’ll go back to a model where they want to have this running all over the globe and they don’t want to be in the data center business or the services business.”

Software remains the primary challenge for AI chip startups. For example, Cirrascale has tested Tenstorrent’s previous-gen Wormhole and current-gen Blackhole products, which only became competitive with the latest software release, Driggers said. Cirrascale plans to deploy Tenstorrent hardware for inference of specialized models that can take advantage of the hardware’s features, including MoE models and diverse workloads where the hardware needs to run more than one thing at a time.

“We like the [Tenstorrent] product, we’re excited about it, and we anticipate we’ll be productizing it fairly soon,” he added.

Workload orchestration

With an increasingly heterogeneous cloud offering to manage, Cirrascale is developing its in-house workload orchestration software stack, which assigns jobs to specific hardware based on the nature of the workloads and customer requirements, such as latency and the number of concurrent sessions.
This stack will assign workloads to different hardware and different geographies based on workload characteristics.

Currently, customer models are initially tested on a “landing SKU” – usually an Nvidia B200 or B300, “because we know it’s going to run [on Nvidia],” Driggers said. This allows Cirrascale to profile the workload, taking the customer’s use case and requirements into account, to assign hardware and determine the customer’s cost per token.

“Depending on what they’re doing, it takes more or less compute,” he said. “Everybody has a price out there for what a standard Llama model costs, but as soon as you do anything to it, all bets are off – we don’t know, we’ve got to run it. The profiling process is partially automated today, but eventually it will be completely automated.”

For example, AMD GPUs often work well with large models and for use cases that require a high number of sessions, as their memory footprint is substantial, while Nvidia RTX 6000 Pros are “phenomenal” for mid-sized models with a limited number of sessions, Driggers said.
“The key is being in the sweet spot,” he said. Economical tokens for tomorrow’s workloads will depend on heterogeneous hardware, but finding the sweet spot for every workload, every use case, and every server will be what enables it.

Source link