The network is indeed trying to become the computer
Masked networking costs are coming to AI systems
Analysis Moore's Law has run out of gas and AI workloads need massive amounts of parallel compute and high bandwidth memory right next to it – both of which have become terribly expensive. If it weren't for this situation, the beancounters of the world might be complaining about the cost of networking in the datacenter.
But luckily – we suppose – with these modern systems, some of the costs of networking are masked as compute. Think of the scale-up networks such as the NVLink ports and NVLink Switch fabrics that are part and parcel of an GPU accelerated server node – or, these days, a rackscale system like the DGX NVL72 and its OEM and ODM clones.
These memory sharing networks are vital for ever-embiggening AI training and inference workloads. As their parameter counts and token throughput requirements both rise, they need ever-larger memory domains to do their work. Throw in a mixture of expert models and the need for larger, fatter and faster scale-up networks, as they are now called, is obvious even to an AI model with only 7 billion parameters.
But the cost of NVLink fabrics inside of DGX server nodes and their HGX clones has been masked since they were first introduced by Nvidia with the "Pascal" P100 GPU accelerators way back in 2016. Nvidia does not really sell the SXM socket versions of its GPU accelerators without them being embedded on HGX or now MGX system boards that interconnect the GPU memories.
Another masked networking cost that is coming to AI systems is that of the die-to-die and chip-to-chip interconnects used within the GPU sockets to interlink reticle-sized GPU chiplets (in the case of Nvidia) or smaller compute elements (in the case of AMD). You may be asking yourself why you have to spend so much money on an Nvidia GPU accelerator to do AI or HPC work, and this in-socket NVLink C2C and D2D interconnect is a part of this.
Work in progress
To be fair, a lot of the hardware cost is to support the enormous software development efforts that 75 percent of Nvidia's employees work on every day. Nvidia is a hardware company that supports itself through bundled CUDA-X software that comes "free" with that hardware. It does, however, charge $4,500 per year per GPU for its AI Enterprise stack, so Big Green is gradually starting to extract software revenue streams from its hardware.
Then there is the scale-out network, which is used to link nodes in distributed systems to each other to share work in a less tightly coupled way than the scale-up network affords. This is the normal networking we are familiar with in distributed HPC systems, which is normally Ethernet or InfiniBand and sometimes proprietary networks like those from Cray, SGI, Fujitsu, NEC, and others from days gone by.
On top of this, we have the normal north-south networking stack that allows people to connect to systems and the east-west networks that allow distributed corporate systems running databases, web infrastructure, and other front-office systems to communicate with each other. Further out, there is the "data center interconnect," or DCI, which is used to link datacenters together into regions and the fiber optic networks employed by the hyperscalers, cloud builders, telcos, and service providers to link regions around the globe.
Suffice it to say, there is a lot of networking going on. How much is difficult to say. We have yet to see someone characterize it well.
The conventional wisdom is that the hyperscalers and cloud builders do not like to spend anywhere north of 10 percent of their datacenter budgets on networking, and when it was pushing up against 15 percent with the early implementations of 100 Gbps Ethernet being too hot and too expensive, Arista Networks, Broadcom, Google, Mellanox Technologies (now part of Nvidia) and Microsoft banded together in July 2014 to create a better 100 Gbps standard based on faster signaling rates and fewer lanes per port and basically forced the IEEE to adopt it. That got the networking costs back down to below 10 percent.
And then AI happened, and more recently, GenAI happened. And the visible scale-up network costs have been soaring. It is hard to say by how much, but we cooked up a chart to show it:
2020 | 2021 | 2022 | 2023 | 2024 | |
---|---|---|---|---|---|
Gartner Datacenter Systems | $178.6 billion | $190.7 billion | $227.1 billion | $236.1 billion | $329.1 billion |
Growth | 6.3% | 16.0% | 3.8% | 28.3% | |
IDC Datacenter Ethernet Switch Revenues | $12.2 billion | $13.3 billion | $15.9 billion | $18.0 billion | $20.7 billion |
Growth | 8.6% | 16.3% | 11.9% | 13.1% | |
TNP Nvidia Infiniband Switch Revenues | $0.7 billion | $0.9 billion | $1.3 billion | $5.2 billion | $5.8 billion |
Growth | 23.3% | 26.8% | 75.4% | 10.6% | |
Ethernet + Infiniband Switch Revenues | $12.9 billion | $14.2 billion | $17.2 billion | $23.2 billion | $26.6 billion |
Growth | 9.6% | 17.0% | 26.1% | 12.6% | |
Ethernet + Infiniband Switch Share of Systems | 7.2% | 7.5% | 7.6% | 9.8% | 8.1% |
Obviously, with AI systems comprising about half of total server spending in 2024 according to various market researchers, the overall datacenter systems market – which means servers, switches, and storage together – has exploded.
According to Gartner's data, system sales have increased by a factor of 1.84X between 2020 and 2024, and IDC says that datacenter Ethernet switch sales have almost kept pace, growing by a factor of 1.71X over the same time. We have calendarized Nvidia's networking sales (they have a fiscal year that is one month out of phase with reality) and estimated its InfiniBand switch sales over the same time. InfiniBand revenues have grown by a factor of 8.X over the five years shown above. That is almost entirely due to AI scale-out networking. (There is a little HPC in there.)
Other networking that is not extracted out above is Fiber Channel networking for storage area networks – yes enterprises are still doing that sort of thing – and any proprietary networks for HPC systems.
This analysis also does not include the cost of DPUs, which are doing a lot of the heavy lifting for adding security and multitenancy to clouds in general and AI systems both in the cloud and on premises. These DPUs are also being given a lot of work in reassembling packets after they have been sprayed across all of the links in a cluster (rather than point to point routing around congested links in the network fabric.) Going forward, DPUs will be either in the server nodes or in the switches – and maybe both. And they ain't cheap, but they should be considered extensions of the switching infrastructure, not offloads from the servers.
- The SmartNIC revolution fell flat, but AI might change that
- Rack-scale networks are the new hotness for massive AI training and inference workloads
- PCIe 7.0 specs finalized at 512 GBps bandwidth, PCIe 8.0 in the pipeline
- Omni-Path is back on the AI and HPC menu in a new challenge to Nvidia's InfiniBand
Given all of this, it is not hard to imagine that networking, in the aggregate, will be a lot more than 10 percent of the system cost when properly allocated, and may even be higher than the 20 percent we hear people talking about with rackscale AI systems. Add it all up, and networking in its many guises might actually be as high as 30 percent of the real cost of an AI cluster.
Hence AMD and the rebel alliance are putting together Infinity Fabric and UALink analogs to NVLink and NVSwitch to bring some competitive pressure to the AI systems space. Moreover, everyone thinks that InfiniBand's days are numbered once the Ultra Ethernet standard is adopted into products, perhaps starting in late 2025 for commercial systems in 2026. This will bring competition to both the scale-up and scale-out networks for AI systems, which in theory will push the true network costs back down below 20 percent. ®