AMD's MI355X is a 1.4 kW liquid-cooled monster built to battle Nvidia's Blackwell

And the House of Zen wants to put 128 of them in your rack

Nvidia's Blackwell accelerators have been on the market for just over six months, and AMD says it's already achieved performance parity with the launch of its MI350-series GPUs on Thursday.

Based on all-new CDNA 4 and refined chiplet architectures, the GPUs seek to shake Nvidia's grip on the AI infrastructure market, boasting up to 10 petaFLOPS of sparse FP4 (double that if you manage to find a workload that can take advantage of sparsity) on the MI355X, 288 GB of HBM3E, and 8 TBps of memory bandwidth.

For those keeping score, AMD's latest Instincts aim to match Nvidia's most powerful Blackwell GPUs on floating point performance and memory bandwidth – two of the most important metrics when it comes to AI training and inference.

This bears out in AMD's benchmarks, which show a pair of MI355Xs going toe-to-toe with Nvidia's dual-GPU GB200 Superchip in Llama 3.1 405B. As with all vendor supplied benchmarks, take these with a grain of salt.

Here's a quick rundown of how AMD's MI350-series parts

Here's a quick rundown of how AMD's MI350-series parts - Click to enlarge

In fact, at least on paper, AMD's latest chips aren't that far behind Nvidia's 288 GB Blackwell Ultra GPUs announced this spring. When the parts start shipping next quarter, they'll not only close the gap on memory capacity, but will offer up to 50 percent higher perf over AMD's MI350 series albeit only for dense FP4. At FP8, FP16, or BF16, AMD and Nvidia are in a dead heat.

And speaking of heat, at 1.4 kW, you'll need a liquid-cooling loop to tame the MI355X's tensor cores and unleash its full potential.

For those for whom liquid cooling isn't practical, AMD is also offering the MI350X, which trades about 8 percent of peak performance for a slightly more reasonable 1 kW TDP. However, in the real world, we're told that the performance delta is actually closer to 20 percent as the liquid-cooled part's larger power limit allows it to boost higher for longer.

On that note, let's take a closer look at the silicon powering AMD's latest Instincts.

Picking apart AMD's next-gen silicon sandwich

Peel back the heat spreader of either chip and you'll find a familiar assortment of compute dies surrounded by high-bandwidth memory.

To the untrained eye, the MI350 series' bare silicon looks a heck of a lot like Nvidia's Blackwell or even Intel's Gaudi3. This is just what AI accelerators look like in 2025. However, as is often the case, looks can be misleading, and that's certainly true of AMD's Instinct line.

Rather than two reticle-sized compute dies as we see in the Intel and Nvidia's accelerators, AMD's Instinct accelerators use a combination of TSMC's 2.5D packaging and 3D hybrid bonding tech to stitch multiple smaller compute and I/O chiplets into one big silicon subsystem.

AMD's MI350-series GPUs feature eight XCD GPU tiles stacked atop a pair of I/O and fed by eight HBM3E modules totaling 288 GB of capacity

AMD's MI350-series GPUs feature eight XCD GPU tiles stacked atop a pair of I/O and fed by eight HBM3E modules totaling 288 GB of capacity – click to enlarge

In the case of the MI350 series, it is quite similar to what we saw with the original MI300X back in 2023. It features eight XCD GPU dies fabbed using TSMC's 3nm process tech that are vertically stacked on top of a pair of 6nm I/O dies.

Each compute chiplet now features 36 CDNA 4 compute units (CUs), 32 of which are actually active, backed by 4 MB of shared L2 cache, for 256 CUs total across eight chiplets, while the chip's 288 GB of HBM3E memory is backed by 256 MB "Infinity" cache.

Here's a closer look at how AMD's MI350-series chips are laid out

Here's a closer look at how AMD's MI350-series chips are laid out – click to enlarge

Meanwhile, the Infinity Fabric-Advanced Package interconnect used to shuttle data between the I/O dies has been upgraded to 5.5 TBps of bisectional bandwidth, up from between 2.4 TBps and 3 TBps last gen.

According to AMD Fellow and Instinct SoC Chief Architect Alan Smith, this wider interconnect reduced the amount of energy per bit required for chip-to-chip communications.

Dense scale-out deployments

While AMD's GPUs may have closed the gap in performance with Nvidia's Blackwell accelerators, the company still has a long way to go in system design.

Unlike Nvidia's Blackwell accelerators – which can be bought in rackscale, HGX, and PCIe form factors – AMD's MI350-series will only be offered in an eight-GPU configuration.

"We felt that this direct connected eight GPU architecture was still well positioned for a large set of the models that would be out there in the 2025 to 2026 time frame," Corporate Vice President Josh Friedrich told the press ahead of AMD's Advancing AI event on Thursday. "We felt introducing a more revolutionary change to a proprietary rack type architecture, and the challenges that can come from Introducing that prematurely was something that we wanted to avoid."

As you can see from the graphic below, the design features eight MI350-series chips connected via AMD's Infinity Fabric in an all-to-all scale-up topology. The GPUs are then connected to a pair of x86 CPUs along with up to eight 400 Gbps NICs via PCIe 5.0 switches.

AMD's MI350-series GPUs stick with a fairly standard configuration with eight GPUs coupled to an equal number of 400 Gbps NICs and a pair of x86 CPUs

AMD's MI350-series GPUs stick with a fairly standard configuration with eight GPUs coupled to an equal number of 400 Gbps NICs and a pair of x86 CPUs – click to enlarge

Each system will offer up to 2.25 TB of HBM3E memory, and between 147 and 160 petaFLOPS of sparse FP4 compute depending on whether you opt for liquid or air cooling.

Naturally, AMD would like to see its Instinct accelerators paired with its Epyc CPUs and Pensando Pollara 400 NICs, but there's nothing stopping vendors from building systems around an Intel processor or ConnectX InfiniBand networking. In fact, that's exactly the configuration Microsoft used for its ND-MI300X-v5 instances.

With the launch of the MI350-series, AMD is moving toward denser rack deployments. As GPU power consumption has increased, we've seen a trend toward larger server chassis, with some as large as ten rack units. But, with the move to liquid cooling, AMD now anticipates densities as high as 16 nodes and 128 accelerators per rack.

With the move to liquid cooling, AMD says it's now possible to cram as many as 128 of its MI355X accelerators into a single rack

With the move to liquid cooling, AMD says it's now possible to cram as many as 128 of its MI355X accelerators into a single rack – click to enlarge

AMD didn't offer specifics as to system-level power consumption, but based on what we've seen of Nvidia's HGX systems, we anticipate both of them to draw somewhere between 14 and 18 kW.

Even on the air-cooled side of things, AMD expects to see racks with as many as eight nodes and 64 accelerators, which will almost certainly require the use of rear-door heat exchangers.

These higher rack densities set the tone for AMD's first rack-scale systems scheduled to launch alongside its MI400-series chips next year.

Availability

AMD says its MI350-series accelerators are shipping to customers and expects to see wide-scale deployments in cloud and hyperscale datacenters, including an AI compute cluster in Oracle OCI containing 131,072 accelerators.

By our estimate, the completed system will be capable of churning out more than 2.6 zettaFLOPS of the sparsest FP4 compute AMD's MI355Xs can muster.

Meanwhile, for those looking to deploy on-prem, MI350-series systems will be offered by Dell, HPE, and Supermicro. ®

More about

TIP US OFF

Send us news


Other stories you might like