This article is more than 1 year old
Cerebras's Condor Galaxy AI supercomputer takes flight carrying 36 exaFLOPS
Nine-site system built for UAE's G42, but there'll be plenty to spare
AI biz Cerebras has unveiled its Condor Galaxy supercomputer, a distributed cluster that, when complete, will span nine sites capable of 36 exaFLOPS of combined FP16 performance.
The company revealed the first phase of the system Thursday, which was built for the United Arab Emirates' G42, a multinational conglomerate with an interest in AI research and development, using Cerebras's CS-2 accelerators to power the process.
Cerebras's accelerators aren't like the GPUs or AI accelerators that you'll find in most AI clusters today. They don't come as PCIe cards or SXM modules like Nvidia's H100.
Instead, the company's WSE-2 are massive, dinner-plate sized affairs, each of which houses 850,000 cores and 40GB of SRAM capable of 20PBps of bandwidth. That's an order of magnitude faster than the HBM typical of other accelerators. Each of these wafers pack a dozen 100Gbps interfaces that allow the system to be extended to up to 192 systems.
Condor Galaxy coalesces
In its current form, Condor Galaxy 1 (CG-1) spans 32 racks, each of which is equipped with the chipmaker's waferscale CS-2 accelerators, making it twice the size of Cerebras's Andromeda system we looked at last year.
For the moment, CG-1 has 32 of these systems, which are fed by 36,352 AMD Epyc cores. Assuming Cerebras has stuck with AMD's 64-core CPUs, that works out to 568 sockets. We've asked Cerebras for clarification as this doesn't divide neatly into 32 racks, although some systems in the cluster no doubt fill ancillary roles.
Put together, the machine packs 41TB of memory – though the WSE-2 wafer's SRAM only accounts for 1.28TB of that – 194 Tbps of internal bandwidth, and peak performance of two exaFLOPS. But before you get too excited, we'll remind you that these aren't the same exaFLOPS we expect to see from Argonne's newly completed Aurora supercomputer.
HPC systems are measured in double precision (FP64), often using the LINPACK benchmark. AI systems on the other hand don't benefit from this level of precision and can get away with FP32, FP16, FP8, and sometimes even Int8 calculations. In this case, Cerebras's systems achieve their most flattering figures in FP16 with sparsity.
While two exaFLOPS of FP16 is impressive on its own, this is only half the setup. When complete, the roughly $100 million system will span 64 racks each with a CS-2 accelerator.
We're told the system should scale linearly so that the complete cluster will deliver four exaFLOPS of sparse FP16 performance – four times that of Andromeda. Cerebras expects to complete installation of the final 32 racks within the next three months.
Cerebras's completed Condor Galaxy 1 supercomputer will span 64 racks, each equipped with its waferscale accelerators
A distributed AI supercomputer
Of course, four exaFLOPS of AI performance commands a substantial amount of power and thermal management. Assuming linear scaling from Andromeda, we estimate the system is capable of drawing upwards of two megawatts.
Because of this, Cerebras is housing the system at Colovore's Santa Clara facility. The colocation provider specializes in high-performance compute and AI/ML applications, and recently revealed racks capable of cooling up to 250 kilowatts.
"This is the first of three US-based massive supercomputers that we will build with them in the next year," Cerebras CEO Andrew Feldman told The Register.
Using CG-1 as a template, two more US-based sites will be built in Asheville, North Carolina (CG-2), and another in Austin, Texas (CG-3), with completion slated for the first half of 2024. These systems will then be networked to allow the distribution of models across sites, which Feldman insists is possible for certain large, latency-tolerant workloads.
"Latency is a problem for some problems, not all. In the high-performance compute world, it's a giant problem," he said. "I think there are many AI workloads for which it's not a problem. There are some that we won't distribute. I think we'll do this thoughtfully and carefully."
- Uncle Sam to put Aurora supercomputer to work on catalyst conundrums
- HSBC banks on quantum to lock down comms network
- Yeah, Rishi, it's AI that'll make Britain great again
- SambaNova injects a little AI mojo into US supercomputer lab's nuke sims
The chipmaker is also careful to note that the system will be operated under US law and will not be made available to advisory states. This is likely a reference to US trade policy governing the export of AI chips to certain countries including Russia, China, and North Korea, among others.
However, Feldman claims the decision to build the systems in the US was motivated by a desire to move quickly. "I think standing the first three in the US was a function of a desire for time to market," he said. "I think it was a desire for G42 to expand beyond the Middle East."
The final stage will see Cerebras construct an additional six sites – the location for which has not yet been disclosed – using CG-1 as a template. The complete Condor Galaxy system will feature 576 CS-2 accelerators capable of a claimed 36 exaflops of sparse FP16 performance, though we don't expect to see many, if any, workloads spanning the entire nine site constellation. Cerebras aims to complete installation of all nine sites by the end of 2024.
Availability
While Cerebras will operate and manage the systems, they're owned by G42, which plans to use the systems for its internal workloads. Specifically, Cerebras says it is working with three of the multinational's divisions, including G42 Cloud, the International Institute for AI (IIAI), and G42 Health.
"They partnered with us because we could build and manage big supercomputers, that we could implement massive generative AI models, and that we had a lot of experience cleaning and manipulating very, very large datasets," Feldman said. "They have vast internal demand for compute among their portfolio companies. But with very big models, with very big compute, there's a bin packing problem. There's always an opportunity to slide in other workloads."
And this means that any leftover resources not consumed by G42 will be made available to both G42 and Cerebras's customers. For Cerebras, this is critical as Feldman notes that the company's cloud is already at capacity.
For Feldman and his company, the collaboration with G42 is an opportunity to expose more people to Cerebras's architecture and compete more aggressively with Nvidia, which holds an outsized share of the market for AI accelerators. "Nobody buys your stuff without jumping on your cloud and testing and showing and demonstrating," Feldman added. ®