On-Prem

Cerebras to light up datacenters in North America and France packed with AI accelerators

Plus, startup's inference service makes debut on Hugging Face


Cerebras has begun deploying more than a thousand of its dinner-plate sized-accelerators across North America and parts of France as the startup looks to establish itself as one of the largest and fastest suppliers of AI inference services.

The expansion, confirmed at the HumanX AI conference in Las Vegas, will see Cerebras - by the end of this year - bring online new datacenters in Texas, Minnesota, Oklahoma, and Georgia, along with its first facilities in Montreal, Canada, and France.

Of these facilities, Cerebras will maintain full ownership of the Oklahoma City and Montreal sites, while the remainder are jointly operated under an agreement with Emirati financier G42 Cloud.

The largest of the US facilities will be located in Minneapolis, Minnesota, and will feature 512 of its CS-3 AI accelerators totaling 64 exaFLOPS of FP16 compute, when it comes online in the second quarter of 2025.

Unlike many of the large-scale AI supercomputers and datacenter buildouts announced over the past year, Cerebras's will be powered by its in-house accelerators.

Announced a year ago this week, Cerebras's CS-3 systems feature a wafer-scale processor measuring 46,225 mm2, which contains four trillion transistors spread across 900,000 cores and 44 GB of SRAM.

Next to the hundreds of thousands of GPUs hyperscalers and cloud providers are already deploying, a thousand-plus CS-3s might not sound like that much compute until you realize each is capable of producing 125 petaFLOPS of highly sparse FP16 performance compared to just 2 petaFLOPS on an H100 or H200 and 5 petaFLOPS on Nvidia's most powerful Blackwell GPUs.

When the CS-3 made its debut, Cerebras was still focused exclusively on model training. However, since then the company has expanded its offering to inference. The company claims it can serve Llama 3.1 70B at up to 2,100 tokens a second.

This is possible, in part, because large language model (LLM) inferencing is primarily memory-bound, and while a single CS-3 doesn't offer much in terms of capacity, it makes up for that in memory bandwidth, which peaks at 21 petabytes per second. An H100, for reference, offers nearly twice the memory capacity, but just 3.35 TBps of memory bandwidth. However, this alone only gets Cerebras to around 450 tokens a second.

As we've previously discussed, the remaining performance is achieved via a technique called speculative decoding, which uses a small draft model to generate the initial output, while a larger model acts as a fact-checker in order to preserve accuracy. So long as the draft model doesn't make too many mistakes, the performance improvement can be dramatic, up to a 6x increase in tokens per second, according to some reports.

Amid a sea of GPU bit barns peddling managed inference services, Cerebras is leaning heavily on its accelerator's massive bandwidth advantage and experience with speculative decoding to differentiate itself, especially as "reasoning" models like DeepSeek-R1 and QwQ become more common.

Because these models rely on chain-of-thought reasoning, a response could potentially require thousands of tokens of "thought" to reach a final answer depending on its complexity. So the faster you can churn out tokens, the less time folks are left waiting for a response, and, presumably, the more folks are willing to pay for the privilege.

Of course, with just 44 GB of memory per accelerator, supporting larger models remains Cerebras's sore spot. Llama 3.3 70B, for instance, requires at least four of Cerebras's CS-3s to run at full 16-bit precision. A model like Llama 3.1 405B – which Cerebras has demoed – would need more than 20 to run with a meaningful context size. As fast as Cerebras's SRAM might be, the company is still some way from serving up multi-trillion-parameter scale models at anything close to the speeds they're advertising.

With that said, the speed of Cerebras's inference service has already helped it win contracts with Mistral AI and, most recently, Perplexity. This week, the company announced yet another customer win with market intelligence platform AlphaSense, which, we're told, plans to swap three closed source model providers for an open model running on Cerebras's CS-3s.

Finally, as part of its infrastructure buildout, Cerebras aims to extend API access to its accelerators to more developers through an agreement with model repo Hugging Face.

Cerebras's inference service is now available as part of Hugging Face's Inference Providers line-up, which provides access to a variety of inference-as-a-service providers, including SambaNova, TogetherAI, Replicate, and others, via a common interface and API. ®

Send us news
4 Comments

Nvidia won the AI training race, but inference is still anyone's game

When it's all abstracted by an API endpoint, do you even care what's behind the curtain?

AI bubble? What AI bubble? Datacenter investors all in despite whispers of a pop

Billions continue to pour into bit barns across the globe

Datacenter vacancies hit record low as power shortages stall projects

Supply chain and tariffs issues could spell trouble across multiple markets, warns JLL

Schneider Electric plugs into AI's power hunger with Nvidia digital twin tech

Because guesswork won't keep the lights on

Cloudflare builds an AI to lead AI scraper bots into a horrible maze of junk content

Slop-making machine will feed unauthorized scrapers what they so richly deserve, hopefully without poisoning the internet

AI running out of juice despite Microsoft's hard squeezing

Biz leaders still dream of obedient agents replacing workers. In the actual workplace, they're going AWOL

Nvidia's Vera Rubin CPU, GPU roadmap charts course for hot-hot-hot 600 kW racks

Now that's what we call dense floating-point compute

Microsoft tempted to hit the gas as renewables can't keep up with AI

So much for 'carbon negative by 2030'

We heard you like HBM – Nvidia's Blackwell Ultra GPUs will have 288 GB of it

There goes AMD's capacity advantage

DeepSeek-R1-beating perf in a 32B package? El Reg digs its claws into Alibaba's QwQ

How to tame its hypersensitive hyperparameters and get it running on your PC

Nvidia wants to put a GB300 Superchip on your desk with DGX Station, Spark PCs

Or a 96 GB RTX PRO in your desktop or server

Frack to the future? Geothermal energy pitched as datacenter savior

If operators are willing to cough up a 'green premium' and tax credits are not repealed