Rack-scale networks are the new hotness for massive AI training and inference workloads
Terabytes per second of bandwidth, miles of copper cabling, all crammed into the back of a single rack
Analysis If you thought AI networks weren't complicated enough, the rise of rack-scale architectures from the likes of Nvidia, AMD, and soon Intel has introduced a new layer of complexity.
Compared to the scale-out networks, which typically use Ethernet or InfiniBand, the scale-up fabrics at the heart of these systems often employ proprietary, or at the very least emerging, interconnect technologies that offer several orders of magnitude higher bandwidth per accelerator.
For instance, Nvidia's fifth-gen NVLink interconnect delivers between 9x and 18x higher aggregate bandwidth to each accelerator than Ethernet or InfiniBand today.
This bandwidth means that GPU compute and memory can be pooled, even though they're physically distributed across multiple distinct servers. Nvidia CEO Jensen Huang wasn't kidding when he called the GB200 NVL72 "one giant GPU."
The transition to these rack-scale architectures is being driven in no small part by the demands of model builders, like OpenAI and Meta, and they are primarily aimed at hyperscale cloud providers, neo-cloud operators like CoreWeave or Lambda, and large enterprises that need to keep their AI workloads on-prem.
Given this target market, these machines aren't cheap. Our sibling site The Next Platform estimates the cost of a single NVL72 rack at $3.5 million.
To be clear, the scale-up fabrics that make these rack-scale architectures possible aren't exactly new. It's just that until now, they've rarely extended beyond a single node and usually topped out at eight GPUs. For instance, here's a look at the scale-up fabric found within AMD's newly announced MI350-series systems.

AMD's MI350-series GPU sticks with a fairly standard configuration with eight GPUs coupled to an equal number of 400Gbps NICs and a pair of x86 CPUs - Click to enlarge
As you can see, each chip connects the other seven in an all-to-all topology.
Nvidia's HGX design follows the same basic template for its four GPU H100 systems, but adds four NVLink switches to the mix for its more ubiquitous eight GPU nodes. While Nvidia says these switches have the benefit of cutting down on communication time, they also add complexity.

Rather than an all-to-all mesh, like we see in AMD's 8-GPU nodes, Nvidia's HGX architecture has employed NVLink switches to mesh its GPUs together going back to the Volta generation - Click to enlarge
With the move to rack scale, this same basic topology is simply scaled up — at least for Nvidia's NVL systems. For AMD, an all-to-all mesh simply isn't enough, and switches become unavoidable.
Diving into Nvidia's NVL72 scale-up architecture
We'll dig into the House of Zen's upcoming Helios racks in a bit, but first let's take a look at Nvidia's NVL72. Since it's been in the market a little while longer, we know a fair bit more about it.
As a quick reminder, the rack-scale system features 72 Blackwell GPUs spread across 18 compute nodes. All those GPUs are connected via 18 7.2TB/s NVLink 5 switch chips deployed in pairs across nine blades.
From what we understand, each switch ASIC features 72 ports with 800Gbps or 100GB/s of bi-directional bandwidth each. Nvidia's Blackwell GPUs, meanwhile, boast 1.8TB/s of aggregate bandwidth spread across 18 ports — one for each switch in the rack. The result is a topology that looks a bit like this one:

Each GPU in the rack connects to two NVLink ports in each of the rack's nine NVLink 5 switches. - Click to enlarge
This high-speed all-to-all interconnect fabric means that any GPU in the rack can access another's memory.
Why scale-up?
According to Nvidia, these massive compute domains allow the GPUs to run far more efficiently. For AI training workloads, the GPU giant estimates its GB200 NVL72 systems are up to 4x faster than the equivalent number of H100s, even though the component chips only offer 2.5x higher performance at the same precision.
Meanwhile, for inference, Nvidia says its rack scale configuration is up to 30x faster — in part because various degrees of data, pipeline, tensor, and expert parallelism can be employed to take advantage of all that memory bandwidth even if the model doesn't necessarily benefit from all the memory capacity or compute.
With that said, with between 13.5TB and 20TB of VRAM in Nvidia's Grace-Blackwell-based racks, and roughly 30TB on AMD's upcoming Helios racks, these systems are clearly designed to serve extremely large models like Meta's (apparently delayed) two-trillion parameter Llama 4 Behemoth, which will require 4TB of memory to run at BF16.
Not only are the models getting larger, but the context windows, which you can think of as the LLMs' short term memory, are too. Meta's Llama 4 Scout, for example, isn't particularly big at 109 billion parameters — requiring just 218GB of GPU memory to run at BF16. Its 10-million-token context window, however, will require several times that, especially at higher batch sizes. (We discuss the memory requirements of LLMs in our guide to running LLMs in production here.)
Speculating on AMD's first scale-up system Helios
This is no doubt why AMD has also embraced the rack-scale architecture with its MI400-series accelerators.
At its Advancing AI event earlier this month, AMD revealed its Helios reference design. In a nutshell, the system, much like Nvidia's NVL72, will feature 72 MI400-series accelerators, 18 EPYC Venice CPUs, and AMD's Pensando Vulcano NICs, when it arrives next year.
Details on the system remain thin, but we do know its scale-up fabric will offer 260TB/s of aggregate bandwidth, and will tunnel the emerging UALink over Ethernet.
If you're not familiar, the emerging Ultra Accelerator Link standard is an open alternative to NVLink for scale-up networks. The Ultra Accelerator Link Consortium recently published its first specification in April.
At roughly 3.6TB/s of bidirectional bandwidth per GPU, that'll put Helios on par with Nvidia's first-generation Vera-Rubin rack systems, also due out next year. How AMD intends to do that, we can only speculate — so we did.
Based on what we saw from AMD's keynote, the system rack appears to feature five switch blades, with what looks to be two ASICs a piece. With 72 GPUs per rack, this configuration strikes us as a bit odd.
The simplest explanation is despite there being 5 switch blades, there are actually only nine switch ASICs in there. For this to work, each switch chip would require 144 800Gbps ports. This is a tad unusual for Ethernet, but not far off from what Nvidia did with its NVLink 5 switches, albeit using twice as many ASICs at half the bandwidth.
The result would be a topology that looks quite similar to Nvidia's NVL72.

The simplest way for AMD to connect 72 GPUs together would be to use nine 144-port 800Gbps switches. - Click to enlarge
The tricky bit is, at least to our knowledge, no such switch ASIC capable of delivering that level of bandwidth exists today. Broadcom's Tomahawk 6, which we looked at in depth a few weeks back, comes the closest with up to 128 800Gbps ports and 102.4Tbps of aggregate bandwidth.
For the record, we don't know that AMD is using Broadcom for Helios — it just happens to be one of the few publicly disclosed 102.4Tbps switches not from Nvidia.
But even with 10 of those chips crammed into Helios, you'd still need another 16 ports of 800Gbps Ethernet to reach the 260TB/s bandwidth AMD is claiming. So what gives?
Our best guess is that Helios is using a different topology from Nvidia's NVL72. In Nvidia's rack-scale architecture, the GPUs connect to one another over the NVLink Switches.
However, it looks like AMD's Helios compute blades will retain the chip-to-chip mesh from the MI300-series, albeit with three mesh links connecting each GPU to the other three.

Assuming that AMD MI400-series GPUs retain their chip-to-chip mesh in the node, a 10 switch scale up fabric starts to make more sense. - Click to enlarge
This is all speculation of course, but the numbers do line up rather nicely.
By our estimate, each GPU is dedicating 600GB/s (12x 200Gbps links) of bidirectional bandwidth to the in-node mesh and about 3TB/s of bandwidth (60x 200Gbps links) for the scale-up network. That works out to about 600GB/s per switch blade.

With the four GPUs in each compute blade meshed together, the scale up topology would look like this. - Click to enlarge
If you're thinking that's a lotta ports, we expect they'll get aggregated into around 60 800Gbps ports or potentially even 30 1.6Tbps ports per compute blade. This is somewhat similar to what Intel did with its Gaudi3 systems. From what we understand, the actual cabling will be integrated into a blind-mate backplane just like Nvidia's NVL72 systems. So, if you were sweating the idea of having to network the rack by hand, you can rest easy.
We can see a few benefits to this approach. If we're right, each Helios compute blade will function independently of one another. Nvidia, meanwhile, has a separate SKU called the GB200 NVL4 aimed at HPC applications, which meshes the four Blackwell GPUs together, similar to the above diagram, but doesn't support using NVLink for scale-up.
But again, there's no guarantee this is what AMD is doing — it's just our best guess.
Scaling up doesn't mean you stop scaling out
You might think the larger compute domains enabled by AMD and Nvidia's rack-scale architectures would mean the Ethernet, InfiniBand, or OmniPath — yes they're back! — would take a back seat.
In reality, these scale-up networks can't scale much beyond a rack. The copper flyover cables used in systems like Nvidia's NVL72 and presumably AMD's Helios just can't reach that far.
As we've previously explored, silicon photonics has the potential to change that, but the technology faces its own hurdles with regard to integration. We don’t imagine Nvidia is charting a course toward 600kW racks because it wants to, but rather because it anticipates the photonics tech necessary for these scale-up networks to escape the rack won’t be ready in time.
So if you need more than 72 GPUs — and if you're doing any kind of training, you definitely do — you still need a scale-out fabric. In fact, you need two. One for coordinating compute on the back-end and one for data ingest on the front-end.
Rack-scale doesn't appear to have reduced the amount of scale-out bandwidth required either. For its NVL72 at least, Nvidia has stuck with a 1:1 ratio of NICs to GPUs this generation. Usually there are another two NIC or data processing unit (DPU) ports per blade for the conventional front-end network to move data in and out of storage and so forth.
This makes sense for training, but may not be strictly necessary for inference if your workload can fit within a single 72-GPU compute and memory domain. Spoiler alert: unless you're running some enormous proprietary model, the details for which aren't known, you probably can.
- Omni-Path is back on the AI and HPC menu in a new challenge to Nvidia's InfiniBand
- Broadcom aims a Tomahawk at Nvidia's AI networking empire with 102.4T photonic switch
- HPE Aruba boasts that when network problems come along, its AI will whip them into shape
- Rack scale is on the rise, but it's not for everyone... yet
The good news is we're about to see some seriously high radix switches hit the market over the next six to 12 months.
We've already mentioned Broadcom's Tomahawk 6, which will support anywhere from 64 1.6Tbps ports to 1,024 100Gbps ports. But there's also Nvidia's Spectrum-X SN6810 due out next year, which will offer up to 128 800Gbps ports and will use silicon photonics to do it. Nvidia's SN6800, meanwhile, will feature 512 MPO ports good for 800Gbps a piece.
These switches dramatically reduce the number of switches required for large-scale AI deployments. To connect up a cluster of 128,000 GPUs at 400Gbps, you'd need about 10,000 Quantum-2 InfiniBand switches. Opting for 51.2Tbps Ethernet switches effectively cut that in half.
With the move to 102.4Tbps switching, the number shrinks to 2,500, and if you can live with 200Gbps ports, you'd need just 750, as the radix is sufficiently large that you can get away with a two-tiered network as opposed to the three-tiered fat-tree topologies we often see in large-scale AI training clusters. ®