Nvidia rival Cerebras says it's revived Moore's Law with third-gen waferscale chips

Startup is also working with Qualcomm on optimized models for its Cloud AI 100 Ultra inference chips

Cerebras revealed its latest dinner-plate sized AI chip on Wednesday, which it claims offers twice the performance per watt of its predecessor, alongside a collaboration with Qualcomm aimed at accelerating machine learning inferencing.

The chip, dubbed the WSE-3, is Cerebras' third-gen waferscale processor and measures in at a whopping 46,225mm2 (that's about 71.6 inches2 in freedom units.) The 4 trillion transistor part is fabbed on TSMC's 5nm process and is imprinted with 900,000 cores and 44GB of SRAM, good for 125 AI petaFLOPS of performance, which in this case refers to highly sparse FP16 — more on that in a minute.

Cerebras claims its CS-3 Systems are twice as fast as its predecessor (click to enlarge)

A single WSE-3 forms the basis of Cerebras' new CS-3 platform, which it claims offers 2x higher performance, while consuming the same 23kW as the older CS-2 platform. "So, this would be a true Moore's Law step," CEO Andrew Feldman boasted during a press briefing Tuesday. "We haven't seen that in a long time in our industry."

Compared to Nvidia's H100, the WSE-3 is roughly 57x larger and boasts roughly 62x the sparse FP16 performance. But considering the CS-3's size and power consumption, it might be more accurate to compare it to a pair of 8U DGX systems [PDF] with a total of 16 H100s inside. In this comparison, the CS-3 is still about 4x faster, but that's only when looking at sparse FP16 performance.

The lead over the two DGX H100 systems is even smaller – at 2x – when you take into account that Nvidia's chips support FP8. Though this wouldn't exactly be an apples to apples comparison.

One major advantage Cerebras has is memory bandwidth. Thanks to the 44GB of onboard SRAM — yes, you read that correctly — Cerebras' latest accelerator boasts 21PBps of memory bandwidth, compared to the 3.9TBps the H100's HBM3 maxes out at.

That's not to say Cerebras' systems are faster in every scenario. The company's performance claims rely heavily on sparsity.

While Nvidia is able to achieve a doubling in floating point operations using sparsity, Cerebras claims to have achieved a roughly 8x improvement.

That means Cerebras' new CS-3 systems should be a little slower in dense FP16 workloads than a pair of DGX H100 servers consuming roughly the same amount of energy and space at somewhere around 15 petaFLOPS vs 15.8 petaFLOPS (16x H100s 989 teraFLOPS.) We've asked Cerebras for clarification on the CS-3's dense floating performance; we'll let you know if we hear anything back.

Considering the speed-up, we have a hard time imagining anyone would opt for Cerebras' infrastructure if they couldn't take advantage of sparsity, but even if you can't, it's pretty dang close.

Cerebras is already working to put its new systems to work in the third stage of its Condor Galaxy AI supercluster. Announced last year, Condor Galaxy is being developed in collaboration with G42 and will eventually span nine sites around the globe.

Cerebras' Condor Galaxy 1 system installed at Colovore's Santa Clara datacenter (click to enlarge)

The first two systems — CG-1 and CG-2 — were installed last year and each featured 64 of Cerebras' CS-2 machines and were capable of 4 AI exaFLOPS a piece.

Wednesday, Cerebras revealed that CG-3 was destined for Dallas, Texas, and would implement the newer CS-3 platform boosting the sites performance to 8 AI exaFLOPS. Assuming that the remaining six sites also feature 64 CS-3s, the nine-site cluster would actually boast 64 AI exaFLOPS of collective compute rather than the 36 exaFLOPS of sparse FP16 initially promised.

However, it's worth noting that Cerebras' CS-3 isn't limited to clusters of 64. The company claims that it can now scale to up to 2,048 systems capable of pushing 256 AI exaFLOPS.

According to Feldman, such a system would be capable of training Meta's Llama 70B model in about a day.

Qualcomm, Cerebras collab on optimized inference

Alongside its next-gen accelerators, Cerebras also revealed it's working with Qualcomm to build optimized models for the Arm SoC giant's datacenter inference chips.

The two companies have been teasing the prospect of a collab going back to at least November. A release revealing Qualcomm's Cloud AI100 Ultra accelerator included a rather peculiar quote by Feldman praising the chip. 

If you missed its launch, the 140W single-slot accelerator boasts 64 AI cores and 128GB of LPDDR4x memory capable of pushing 870 TOPS at Int8 precision and 548GB/s of memory bandwidth.

A few months later, a Cerebras blog post highlighted how Qualcomm was able to get a 10 billion parameter model running on a Snapdragon SoC.

The partnership, now official, will see the two companies work to optimize models for the AI 100 Ultra which take advantage of techniques like sparsity, speculative decoding, MX6, and network architecture search.

Under the partnership, Cerebras and Qualcomm will develop optimized models for the latter's AI 100 Ultra inference chips (click to enlarge)

As we've already established, sparsity, when properly implemented, has the potential to more than double an accelerator's performance. Speculative decoding, Feldman explains, is a process of improving the efficiency of the model in deployment by using a small, lightweight model to generate the initial response, and then using a larger model to check the accuracy of that response.

"It turns out, to generate text is more compute intensive than to check text," he said. "By using the big model to check it's faster and uses less compute."

The two companies are looking at MX6 to help reduce the memory footprint of models. MX6 is a form of quantization that can be used to shrink a model by compressing its weights to a lower precision. Meanwhile, network architecture search is a process of automating the design of neural networks for specific tasks in order to boost their performance.

Combined, Cerebras claims these techniques contribute to a 10x improvement in performance per dollar. ®

Don't miss The Next Platform's breakdown of Cerebras's technologies right here.

Send us news

AI chip sales predicted to jump by a third this year – then cool off

Gartner gives us a ray of hope amid ongoing hype and pressure to buy more hardware

Payoff from AI projects is 'dismal', biz leaders complain

No wonder most orgs are slowing their spending

AI PC vendors gotta have their TOPS – but is this just the GHz wars all over again?

As usual, things are more complicated than 'bigger number better.'

Among AI infrastructure hopefuls, Qualcomm has become an unlikely ally

The enemy of my enemy is my best friend

Mistral AI raises $644M, hits $6.2B in valuation

French firm has nearly tripled in value since beginning of the year

Microsoft cancels universal Recall release in favor of Windows Insider preview

Wider release coming real soon – promise – after the Windows faithful give it a thrashing

Microsoft to spend $3.2B on expanding cloud and AI in green energy-rich Sweden

Budget to be blown on construction and 20K GPUs among other things in the next 2 years

AI smartphones must balance promise against hype and privacy concerns

Color us shocked: 66% of Apple users said they wouldn't switch for any reason

China's new sanctions loophole: Use export-controlled chips <i>inside</i> the US

Oracle, Nvidia accused of working with ByteDance, others to provide access to advanced parts within America

Digital Realty CTO weighs in on AI's insatiable thirst for power

If the grid can't keep up, DCs may be forced to roll their own primary supplies, Sharp tells El Reg

Apple built custom servers and OS for its AI cloud

Mashup of iOS and macOS runs on homebrew silicon, with precious little for sysadmins to scry

DuckDuckGo AI Chat promises privacy for bot conversations

There's also an off switch