What Nvidia's Blackwell efficiency gains mean for DC operators
Air cooling's diminishing returns on full display with Nv's B-series silicon
Analysis Hotter and more power-hungry CPUs and GPUs were already causing headaches for datacenter operators before Nvidia unveiled its 1,200W Blackwell GPUs at GTC last week.
Over the past year, datacenter operators and colocation providers have expand support for high-density deployments through the use of rear-door heat exchangers (RDHX) and in some cases direct-to-chip (DTC) liquid cooling, in anticipation of rising chip temps.
Looking at Nvidia's Blackwell lineup, it appears these modifications have been warranted. At roughly 60kW per rack — 14.3kW per node — a stack of four DGX B200 systems is already pushing the limits of standard air cooled racks in Digital Realty's facilities.
And that's not even Nvidia's most powerful system. Its latest GB200 NVL72 rack-scale systems, which we looked at in detail last week, are rated for 120kW and - to no one's surprise - absolutely demand liquid cooling.
This is a lot of heat to wick in a rack but there's more to the story. Let's take a look at Blackwell's power and efficiency gains.
Performance per watt
During the Blackwell launch, Nvidia made bold claims about the performance and efficiency of its chips. We'll get to those a bit later, but for now let's take a look at how these chips stack up in terms of raw floating point operations per watt (FLOPS/W).
GPU Perf/W | B100 (SXM) | B200 (SXM) | GB200 (GPU only) | H100 (SXM) | A100 (SXM) |
---|---|---|---|---|---|
TDP | 700W | 1,000W | 2,400W | 700W | 400W |
TF32 | 2.6 TFLOPS/W | 2.2 TFLOPS/W | 2.08 TFLOPS/W | 1.41 TFLOPS/W | 0.78 TFLOPS/W |
FP16 | 5 TLOPS/W | 4.5 TFLOPS/W | 4.16 TFLOPS/W | 2.82 TFLOPS/W | 1.56 TFLOPS/W |
FP8/INT8 | 10T(FL)OPS/W | 9 T(FL)OPS/W | 8.33 T(FL)OPS/W | 5.65 T(FL)OPS/W | 3.12 TOPS/W |
FP4 | 20TFLOPS/W | 18 TFLOPS/W | 16.66 TFLOPS/W | NA | NA |
Note: We didn't include FP64 performance in this lineup as Blackwell actually performs worse than Hopper in double precision workloads.
Looking solely at the GPU efficiency, Blackwell shows strong gains, offering about 1.7x higher efficiency compared to Hopper and 3.2x that of Ampere when normalized to FP16. Obviously, if your workload can take advantage of lower precision then you can expect to see even stronger gains, but the conclusions remain about the same.
But when we compare the Blackwell GPU SKU's we start to see diminishing returns on performance past the 700W mark. While it might look like we're just trading power for FLOPS with the 1,000W B200 and the GB200's twin 1,200W accelerators, that's not quite accurate.
Unlike the H100, none of the Blackwell parts are available as a standalone PCIe card, yet.
This means you're going to be buying them as part of an HGX, DGX, or Superchip-derived configuration. This means the minimum configuration is going to be two GPUs with the GB200 or eight with the HGX B100 or B200-based systems.
System Perf/W | HGX B100* | DGX B200 | GB200 NVL72 | DGX H100 | DGX A100 |
---|---|---|---|---|---|
TDP | 10.2kW | 14.3kW | 120kW | 10.2kW | 6.5kW |
TF32 | 1.41 TFLOPS/W | 1.23 TFLOPS/W | 1.5 TFLOPS/W | 0.77 TFLOPS/W | 0.38 TFLOPS/W |
FP16 | 2.74 TFLOPS/W | 2.51 TFLOPS/W | 3 TFLOPS/W | 1.55 TFLOPS/W | 0.76 TFLOPS/W |
FP8/INT8 | 5.49 T(FL)OPS/W | 5.03 T(FL)OPS/W | 6 T(FL)OPS/W | 3.10 T(FL)OPS/W | 1.53 T(FL)OPS/W |
FP4 | 10.98 TFLOPS/W | 10.06TFLOPS/W | 12 TFLOPS/W | NA | NA |
Note: Since there is no DGX B100 configuration, our "HGX B100" figures are based around the DGX H100's 10.2kW max power draw, since it's a drop-in replacement designed to work within the same thermal and power constraints.
Looking at efficiency of the the air-cooled systems fully loaded with CPUs, memory, networking, and storage, we see that even with a larger 10U chassis to accommodate a larger stack, the DGX B200 appears to be less efficient than the HGX B100.
So what's going on? As you might already suspect, 1,000W is one heck of a lot harder to cool than 700W, especially since the fans have to spin faster to push more air through the heat sinks.
We get a better view of this when we add in Nvidia's power hungry GB200 NVL72 with its 120kW appetite.
At the rack scale, we're comparing four DGX style systems per rack against a single GB200 NVL72 setup. Again we see a familiar trend. Despite the water-cooled system's GPUs running 200W hotter than the DGX B200, the rackscale system manages to exert 2.5x the performance while consuming a little over twice the power.
From the graph, you can also see that the liquid cooled NVL system is actually the most efficient of the bunch, no doubt owing to the fact it isn't dumping 15-20 percent of its power into fans.
Further, you will still need to power facility equipment like coolant distribution units (CDUs), which aren't accounted for in these figures, but neither are the air handlers required to cool the conventional systems either.
And here is where we can start drawing some conclusions about what Blackwell will mean from a practical standpoint for datacenter operators.
Nvidia's HGX is still a safe bet
One of the biggest takeaways from Nvidia's Blackwell generation is that power and thermals matter. The more power you give these chips and the cooler you can keep them, the better they perform — up to a point.
If your facility is right on the edge of being able to support Nvidia's DGX H100, B100 shouldn't be any harder to manage, and, of the air-cooled systems, it looks to be the more efficient option, at least based on our estimates.
While the DGX B200 might not be as efficient at full load, it is still 28 percent faster than the B100 box. In the real world where chips seldom are running right up to the redline 24/7, the two may be closer than they look on paper.
In either case, you're still looking at a considerable improvement in compute density over Hopper. Four DGX B200 boxes are able to replace 9-18 H100 systems depending on whether or not you can take advantage of Blackwell's FP4 precision.
Fewer, denser racks point the way to a liquid cooled future
One of the bigger challenges datacenter operators are likely to face accommodating DGX B200 is the higher rack power density. With four boxes in a rack, we're looking at roughly 50 percent higher power and cooling requirements than the H100 systems.
If your datacenter can't support these denser configurations, then you may be forced to opt for two node racks, effectively eliminating any space savings Blackwell might have bought you. This might not be a big deal if your models haven't gotten any bigger or you can accommodate a longer training time and take advantage of Blackwell's capacious 192GBs HBM3e, but if your models have grown or your training or fine tuning time tables have shrunken, this could prove something of a headache.
The GB200 NVL72 is a rackscale system that uses NVLink switch appliances to stitch together 36 Grace-Blackwell Superchips into a single system. - Click to enlarge
The situation is a little different for the GB200 NVL72 range. More than 22 HGX H100 systems can be condensed into just one of these liquid cooled systems. Or put another way, in the space required to support one model, you can now support one 5.5x larger.
Having said that, doing so is going to require liquid cooling If you want to push Blackwell to its full potential.
The good news is many of the bit barns we've seen announce support for Nvidia's DGX H100 systems, including Equinix and Digital Realty, are already using a form of liquid cooling — usually using rear door heat exchangers — but DTC is becoming more common.
Some of these rear door configurations claim to support 100 or more kilowatts of heat rejection, so theoretically you could strap one of these to the NVL72 and dump that heat into the hot aisle. Whether or not your facility air handlers can cope with that is another matter entirely.
As such, we suspect that liquid-to-liquid CDUs are going to be the preferred means of cooling racks this dense.
- Nvidia software exec Kari Briski on NIM, CUDA, and dogfooding AI
- As AI booms, land near nuclear power plants becomes hot real estate
- AI bubble or not, Nvidia is betting everything on a GPU-accelerated future
- One rack. 120kW of compute. Taking a closer look at Nvidia's DGX GB200 NVL72 beast
It's not just about FLOPS
In Jensen Huang's keynote, he made more bold claims regarding Blackwell's inference performance, saying it is 30x faster than the Hopper generation when inferencing a 1.8 trillion parameter mixture-of-experts model.
Nvidia says its NVL72 is up to 30x more performant in inference workloads next to a comparable H100 setup - Click to enlarge
Looking at the fine print, we see there are a number of factors that play into these gains. In terms of raw FLOPS, the drop to FP4 nets Nvidia's best specced Blackwell parts a 5x performance boost over the H100 running at FP8.
Blackwell also boasts 1.4x more HBM that happens to offer 1.38x more memory bandwidth, clocking in at 8TB/s per GPU compared to the H100's 3.35TB/s.
However, the additional FLOPS and memory bandwidth on their own aren't enough to explain a 30x boost in inference performance. Here the footnotes offer some clues.
Results: based on token-to-token latency = 50 ms; real-time, first token latency = 5,000 ms; input sequence length = 32,768; output sequence length = 1,024 output, 8x eight-way HGX H100 air-cooled: 400 GB IB Network vs 18 GB200 Superchip liquid-cooled: NVL36, per GPU performance comparison. Projected performance is subject to change.
In the Hopper configuration, each server has eight H100s that can talk to one another over a speedy 900GB/s NVLink switch fabric. However, a 1.8 trillion parameter model isn't going to fit into one server. At FP8, such a model is going to require at minimum 1.8TB of memory, plus some additional capacity for the key value cache. So we're going to need more boxes which have to communicate with each other over a 400Gb/s InfiniBand network. That translates to one 100GB/s of total bandwidth per GPU, a rather substantial bottleneck compared to NVLink.
By comparison, in Nvidia's NVL systems every GPU is connected to each other at 1.8TB/s. What's more, lower precision FP4 mathematics cuts the required memory in half from 1.8TB to 900GB and also reduces the bandwidth requirements which should theoretically bolster throughput.
While Nvidia's NVL systems may have an advantage when running massive trillion-plus parameter models, it appears that Blackwell's inferencing lead over Hopper would be considerably smaller for models that can fit within a single box.
It remains to be seen just how reproducible Nvidia's inference performance results will ultimately be, but the sales pitch is clear. One NVL rack could replace a lot more H100 nodes than the system's floating point performance might lead you to believe — if of course you happen to be inferencing trillion plus parameter models at scale. ®