Rack scale is on the rise, but it's not for everyone... yet
Still buying B200s and MI300Xs? Don't feel bad, Nvidia and AMD's NVL72 and Helios rack systems aren't really for the enterprise anyway
Analysis With all the hype around Nvidia's NVL72, AMD's newly announced Helios, and Intel's upcoming Jaguar Shores rack systems, you'd be forgiven for thinking the days of eight-way HGX servers are numbered.
Spoiler alert: they're probably not going anywhere anytime soon, the EVP of AMD's datacenter solutions group told press at the House of Zen's Advancing AI event last week.
It doesn't help that these rack-scale architectures are big, complex, and power-hungry – not to mention expensive. It's estimated Nvidia's GB200 NVL72s are selling for nearly $3.5 million a pop. Even if AMD manages to undercut its much larger rival, that's a lot for any enterprise to swallow at a time when most C-suites are still struggling to find an application for AI that'll actually pay for itself.
Then again, AMD's 72-GPU Helios reference design — check out our day one coverage here — wasn't exactly designed with enterprise demands in mind. "Helios started out life as a specific design for two hyperscale customers, driven directly by their requirements," the House of Zen's Forrest Norrod explained.
"We think Helios or derivatives thereof are a good solution for hyperscalers and a lot of the tier-two and neo-clouds, and some enterprises as well. But again, this is not the only thing we're doing," Norrod added.
The shift toward rack-scale architecture underscores a change in appetites among model devs. Up to this point, the vast majority of foundation models have been trained on eight-way GPUs systems like Nvidia's DGX H100, Norrod noted.
"I do think, going forward, for the big training machines, they're going to want a big, scale-up domain — almost the larger, the better," he said. "Seventy two [GPUs] is an interesting waypoint; I think a lot of people would love to see 256, 512, 1K."
These larger compute domains offer a number of advantages for compute and memory hungry training workloads.
The network is one of the biggest bottlenecks when training. Compared to 800 Gbps Ethernet, the scale-up interconnects found in AMD's Helios or Nvidia's Vera-Rubin NVL144 systems are roughly 18x faster.
To this end, Nvidia has previously estimated that its 72-GPU GB200 NVL72 systems are up to 4x faster than an equivalent number of H100s, despite only offering 2.5x higher floating point performance at a given precision.
In other words, the more GPUs you can fit on the scale-up network the better.
When it comes to inference, things aren't nearly as clear cut. Depending on how big the model is, its underlying architecture, and whether you're optimizing for throughput or latency, you may only need eight GPUs to run it.
This is no doubt why AMD and Nvidia continue to invest in the form factor, with systems like the MI355X or B200, even as they push march toward ever larger and power hungry rack-scale systems, including at least one rated for 600 kilowatts.
Some of this is down to the fact that AI startups have, for the most part, built their models to the lowest common denominator. For example, when Meta launched Llama 3.1 405B last summer, that was about as big as you could go and still run on a typical H100 system — then amongst the most commonly deployed GPU servers on the market.
"I think, because of the familiarity of the installed base, a hive of eight is going to be super popular for a long time," Norrod said. "That's what people know and that's what people have done a lot of development on."
But as rack-scale systems extend the compute domain from eight accelerators to 72 or more, there's no reason to think parameter counts won't grow to fill them as well.
You probably won't be running a 10 trillion parameter model at FP8 on eight GPUs anytime soon, but you could on a GB300 NVL72 or Helios rack.
"As [Nvidia's] NVL72 rolls out — if they get it to work — there'll be a bunch of guys inferring on that size as well," Norrod said. "Over time lots of guys will find ways to do lots of innovative things with that pod size for inference."
- Schneider, Nvidia sign pact to cool Europe's AI ambitions
- Enterprise AI adoption stalls as inferencing costs confound cloud customers
- Google Cloud flexes as first to host Nvidia RTX PRO 6000 Server VMs
- AMD preps rack-scale Helios systems to contend with Nvidia's Vera Rubin NVL144
Even still, eight-way systems are likely to remain popular among enterprise customers. Compared to either GPU slinger's rack-scale reference designs, these 8-GPU boxes may be less powerful, but they're also far less complex, nowhere near as expensive, and don't need facility water cooling to deploy them.
"We're gonna have to cover multiple, multiple bets," Norrod said. "Because Nvidia is the de facto standard right now, our general belief is there's a vector that's as big as possible for the really big guys, 72ish [GPUs] for a bunch of guys, and eight for a bunch of guys. That's our supposition." ®