Amazon built a massive AI supercluster for Anthropic called Project Rainier – here's what we know so far

It's almost like AWS is building its own Stargate

deep dive Amazon Web Services (AWS) is in the process of building out a massive supercomputing cluster containing "hundreds of thousands" of accelerators that promises to give its model building buddies at Anthropic a leg up in the AI arms race.

The system, dubbed Project Rainier, is set to come online later this year with compute spanning multiple sites across the US. Gadi Hutt, the director of product and customer engineering at Amazon's Annapurna Labs, tells El Reg that one site in Indiana will span thirty datacenters at 200,000 square feet apiece. This facility alone was recently reported to consume upwards of 2.2 gigawatts of power.

But unlike OpenAI's Stargate, xAI's Collusus, or AWS's own Project Ceiba, this system isn't using GPUs. Instead, Project Rainier will represent the largest deployment of Amazon's Annapurna AI silicon ever.

"This is the first time we are building such a large-scale training cluster that will allow a customer, in this case Anthropic, to train a single model across all of that infrastructure," Hutt said. "The scale is really unprecedented."

Amazon, in case you've forgotten, is among Anthropic's biggest backers, having already invested $8 billion in the OpenAI rival.

Amazon isn't ready to disclose the full scope of the project, and since it's a multi-site project akin to Stargate as opposed to a singular AI factory like Colossus, Project Rainier may not have upper bounds. And all plans assume the economic conditions that gave rise to the AI boom don't fizzle out.

However, we're told that Anthropic has already managed to get their hands on a sliver of the system's compute. 

While we don't know just how many Trainium chips or datacenters will ultimately power Project Rainier and probably won't until re:Invent in November, we do have a pretty good idea of what it'll look like. So, here's everything we know about Project Rainier so far.

The basic unit of compute

Each Trainium2 package features a pair of 5nm compute dies flanked on either side by high-bandwidth memory

Each Trainium2 package features a pair of 5nm compute dies flanked on either side by high-bandwidth memory - Click to enlarge

The heart of Project Rainier is Annapurna Lab's Trainium2 accelerator, which it let loose on the web back in December.

Despite what its name might suggest, the chip can be used for both training and inference workloads, which will come in handy for customers using reinforcement learning (RL) like we saw with DeepSeek R1 and OpenAI's o1 to imbue their models with reasoning capabilities.

"RL as a workload has a lot of inference built into it because we need to verify the results during the steps of training," Hutt said.

The chip itself features a pair of 5nm compute dies glued together using TSMC's chip-on-wafer-on-substrate (CoWoS) packaging tech, which is fed by four HBM stacks. Combined, each Trainium2 accelerator offers 1.3 petaFLOPS of dense FP8 performance, 96GB of HBM, and 2.9TB/s of memory bandwidth.

On its own, the chip doesn't look all that competitive. Nvidia's B200, for instance, boasts 4.5 petaFLOPS of dense FP8, 192GB of HBM3e, and 8TB/s of memory bandwidth.

Support for 4x sparsity, which can dramatically speed up AI training workloads, does help Tranium2 close the gap some, boosting FP8 perf to 5.2 petaFLOPS, but it still falls behind the B200 at 9 petaFLOPS of sparse compute at the same precision.

Trn2

While Tranium2 may look a little anemic in a chip-for-chip comparison with Nvidia's latest accelerators, that doesn't tell the full story.

Unlike the H100 and H200-series GPUs, Nvidia's B200 only comes in an eight-way HGX form factor. Similarly, AWS' minimum configuration for Trainium2, which it refers to as its Trn2 instances, has 16 accelerators.

Here's what the multi-blade system powering Amazon's Trn2 instances looks like

Here's what the multi-blade system powering Amazon's Trn2 instances looks like - Click to enlarge

"When you're talking about large clusters, it's less important what a single chip provides you, it's more of what is called 'good put,'" Hutt explains. "What's your good throughput of training which also takes into account downtime? … I don't see a lot of talk about this in the industry, but this is the metric that customers are looking at."

Compared to Nvidia's HGX B200 systems, the performance gap is far closer. The Blackwell-based parts still have an advantage when it comes to memory bandwidth and dense FP8 compute, which are key indicators of inference performance. 

For training workloads, Amazon's Trn2 instances do have a bit of advantage as they — at least on paper — offer higher sparse floating performance at FP8. Yes, Nvidia's Blackwell chips do support 4-bit floating point precision, but we've yet to see anyone train a model at that precision. Sparse compute is most useful when large volumes of data are expected to have values of zero. As a result, sparsity isn't usually that helpful for inference, but can make a big difference in training.

With that out of the way, here's a quick look at how Nvidia's Blackwell B200 stacks up against AWS' Trn2 instances:

  Trn2 DGX B200
CPUs: 2x 48C Intel Sapphire Rapids 2x 56C Intel Emerald Rapids
System Mem: 2TB DDR5 Up to 4TB
Accelerators: 16x Trainium2 8x B200 GPUs
HBM: 1536GB 1440GB
Memory BW: 46.4TB/s 64TB/s
Interconnect BW: 16TB/s 14.4TB/s
Scale-out BW: 3.2Tbps EFAv3 3.2Tbps InfiniBand
Dense FP4: NA 72 petaFLOPS
Dense FP8: 20.8 petaFLOPS 36 petaFLOPS
Sparse FP8: 83.2 petaFLOPS 72 petaFLOPS

Looking closer at each Trn2 cluster, the chips are spread across eight compute blades (2x Trainium2s each) which are managed by a pair of x86 CPUs from Intel. In this respect, the architecture is somewhat reminiscent of Nvidia's NVL72 rack systems.

However, rather than a switched all-to-all topology, like we see in the NVL72, the chips in each Trn2 cluster are connected in a 4x4 2D torus using AWS high-speed NeuronLink v3 interconnect. This topology eliminates the need for high-speed switching, but does add an additional hop or two of latency for chip-to-chip communication.

Here's what AWS' Trn2 interconnect topology looks like

Here's what AWS' Trn2 interconnect topology looks like - Click to enlarge

This intra-instance interconnect, which you can think of in the same vein as Nvidia's NVLink or AMD's InfiniFabric, provides 1TB/s of chip-to-chip bandwidth to each accelerator in the Trn2 cluster.

Reaching rack scale

Amazon's Trn2 UltraServer meshes together four Trn2 systems into a single 64-chip compute domain

Amazon's Trn2 UltraServer meshes together four Trn2 systems into a single 64-chip compute domain - Click to enlarge

Four Trn2 systems can then be meshed together using NeuronLink to expand the compute domain from 16 chips to 64, in a configuration AWS is calling an UltraServer.

This is achieved by essentially stacking each Trn2 system on top of one another to form a 3D torus, which if you're having a hard time imaging looks a bit like this:

Here's how each Trainium2 accelerator connect to one another in Amazon's Trn2 Ultra Server

Here's how each Trainium2 accelerator connect to one another in Amazon's Trn2 Ultra Server - Click to enlarge

According to Amazon's docs, the inter-instance bandwidth provided by NeuronLink between the Trn2 instances is a fair bit lower at 256GB/s of bandwidth per chip.

Once again, this chip-to-chip mesh is achieved without switches, which has the benefit of lower power consumption. This, along with the lower compute density afforded by distributing the system across two racks, has allowed AWS to get away with air cooling – something that can't be said of the NVL72 systems it's deploying as part of Project Ceiba.

  Trn2 UltraServer DGX GB200 NVL72
CPUs: 8x 48C Intel Sapphire Rapids  
System Mem: 8TB DDR5 17TB LPDRR5x
Accelerators: 64x Trainium2 72x Blackwell GPUs
HBM: 6.1TB 13.4TB
Memory BW: 186TB/s 576TB/s
Interconnect BW: 68TB/s 130TB/s
Scale-out BW: 12.8Tbps EFAv3 28.8Tbps InfiniBand
Dense FP4: NA 720 petaFLOPS
Dense FP8: 83.2 petaFLOPS 360 petaFLOPS
Sparse FP8: 332.8 petaFLOPS 720 petaFLOPS
Rack power: Unknown 120kW

As you can see, the NVL72 is still faster than Amazon's Trn2, but as Hutt points out the cost of that compute also needs to be taken into consideration. "The thing that customers ask us for isn't 'give us the fastest chips or the most complex chips.' Customers care about performance and performance at the lowest cost, and of course it has to be easy to use as well."

At the end of the day, customers consume Trainium as software APIs in the cloud, he added.

These UltraServers are the key unit of compute which Amazon will essentially copy and paste as they build out the complete Project Rainier "UltraCluster."

This scalability will be achieved using Amazon's custom EFAv3 network, and we're told each accelerator in the cluster will be equipped with 200Gbps of network bandwidth. That means each Trn2 UltraServer will have 12.8Tbps of connectivity, courtesy of Annapurna's custom Nitro data processing units, to keep all those chips fed with training data.

This isn't your typical Ethernet network, either. Amazon has developed a custom fabric that they say will deliver tens of petabits of bandwidth (from what we understand this is going to vary depending on the number of UltraServers in the cluster) with under 10 microseconds of latency across the network.

And Amazon is clearly prepared for some seriously crowded network cabinets. At re:Invent last year, the cloud titan detailed the lengths it had gone to in order to keep its network cabinets from turning into a rat's nest of fiber optic cables. This included developing a fiber-optic trunk line that crams hundreds of fiber pairs into what can best be described as a photonic rope.

Scaling out

As we mentioned earlier, Amazon has been rather vague about just how big Project Rainier will ultimately be. It's previously boasted the system will contain multiple hundred thousand Trainium2 chips.

Amazon is going need a lot of Trn2 UltraServers to reach the hundreds of thousands of Trainium2 chips promised.

Amazon is going need a lot of Trn2 UltraServers to reach the hundreds of thousands of Trainium2 chips promised. - Click to enlarge

In its most recent blog post it said that "when you connect tens of thousands of these UltraServers and point them all at the same problem, you get Project Rainier."

Even just 10,000 UltraServers would equate to 640,000 accelerators. Considering that a million-accelerator cluster would make for far better headlines, we're going to assume the authors meant to say Trn2 instances, not UltraServers.

With six million square feet of floor space, we don't expect space will be the limiting factor. Having said that, we don't expect Amazon's Indiana campus is being built exclusively for Project Rainier. We have to imagine that a lot of that space will be taken up by conventional IT equipment, like storage arrays, switches, x86 and Graviton CPUs running virtualization and container workloads, and probably a healthy number of GPUs too.

Amazon hasn't said just how much power its chips consume, but assuming it's around 500 watts, we estimate a cluster of 256,000 Tranium2 accelerators would need somewhere between 250 and 300 megawatts of power. For reference, that's roughly on par with xAI's Colossus supercomputer, which contains 200,000 Hopper GPUs.

Is a Project Rainier 2.0 on the way?

So far, all of Amazon's messaging has Trainium2 powering Project Rainier, but with its third-gen accelerators just a few months away, we wouldn't be surprised to find out at least some of the sites end up using the newer and much more powerful chip.

Teased by the Annapurna Labs team at re:Invent last year, the chip will be built on TSMC's 3nm process node, and promises to deliver 40 percent better efficiency than the current generation. Amazon also expects its Trainium3-based UltraServers to deliver about 4x the performance of its Trn2-based systems.

That means we can expect each Trn3 UltraServer to deliver about 332.8 petaFLOPS of dense FP8 or about 1.33 exaFLOPS with sparsity enabled. That assumes that Annapurna isn't dipping into lower precision datatypes like FP6 or FP4 to achieve those performance gains. But beyond these performance metrics, details remain rather light.

There certainly is precedent to support a last-minute change.

As you may recall, Amazon's Project Ceiba was originally supposed to use Nvidia's Grace Hopper superchips, but ultimately ended up going with the much more powerful Blackwell accelerators instead.

Amazon can only talk about chips and systems that have been released, and while we already know a fair bit about Trainium3, it'll be a while before you can start deploying workloads to them. ®

More about

TIP US OFF

Send us news


Other stories you might like