On-Prem

Systems

Nvidia's MLPerf submission shows B200 offers up to 2.2x training performance of H100

Is Huang leaving even more juice on the table by opting for mid-tier Blackwell part? Signs point to yes


Analysis Nvidia offered the first look at how its upcoming Blackwell accelerators stack up against the venerable H100 in real-world training workloads, claiming up to 2.2x higher performance.

The benchmarks, released as part of this week's MLPerf results, are in line with what we expected from Blackwell at this stage. The DGX B200 systems – used in Nvidia's Nyx supercomputer – boast about 2.27x higher peak floating point performance across FP8, FP16, BF16, and TF32 precisions than last gen's H100 systems.

And this is borne out in the results. Against the H100, the B200 managed 2.2x higher performance when fine-tuning Llama 2 70B and twice the performance when pre-training GPT-3 175B.

However, it's not just raw FLOPS at play here. According to Nvidia, Blackwell's substantially higher memory bandwidth – up to 8 TBps on the flagship parts – also came into play.

"Taking advantage of higher-bandwidth HBM3e memory, just 64 Blackwell GPUs were run in the GPT-3 LLM benchmark without compromising per-GPU performance. The same benchmark run using Hopper needed 256 GPUs to achieve the same performance," the acceleration champ explained in a blog post.

While that benchmark was conducted using just 64 GPUs across eight nodes, it's not clear if just one partition of a larger system was used or if Nyx is "super" in terms of performance rather than scale – in fact, details regarding the system are quite sparse. But from what we've gathered from images and past DGX configurations, we're looking at a modular system consisting of three, maybe four, eight-GPU nodes per rack, with the number of racks and interconnect bandwidth the two major question marks.

Nyx cluster – click to enlarge

The Register reached out to Nvidia for clarification on Nyx, and we'll let you know if we hear anything back.

The fact that Nvidia is using the B200 as the basis for its first training submissions tells us that there's still a good amount of performance on the table.

On paper, the B200 is capable of churning out 9 petaFLOPS of sparse FP8 performance, and is rated for a kilowatt of power and heat. The 1.2 kW GPUs found in Nvidia's flagship GB200, on the other hand, are each capable of churning out 10 petaFLOPS at the same precision.

However, it's not just that GB200 systems have higher peak performance – the GPU domain is also considerably larger. Traditionally, DGX systems have housed eight GPUs interconnected by a high-speed NVLink switch fabric, with additional scale achieved using multiple InfiniBand links between the nodes.

With Blackwell, Nvidia has expanded the NVLink domain from eight to 72 accelerators with its NVL72 reference designs.

How large of a difference this actually makes in terms of time to train is hard to say – but we could see a sizable uplift in performance by the time MLCommons releases its next batch of training results. We expect to see some gains from this considering that time to train is often limited by data movement and NVLink is several times faster than InfiniBand.

Even if the next training submission we see is still from Nvidia's B200-based systems, improvements in software and networking infrastructure could still drive improvements.

Next-gen ConnectX-8 SuperNICs are set to double InfiniBand bandwidth to 800 Gbps. Meanwhile, software optimizations and other upgrades have driven considerable performance improvements since Hopper made its debut on the MLPerf ranking.

Blackwell's training results come just months after it shared the first MLPerf inference benchmarks for the compute platform. In those tests, Nvidia was able to achieve a 4x uplift over Hopper.

In addition to Blackwell, Nvidia also shared large-scale training results for the GPT-3 175B benchmark using 11,616 Hopper GPUs. This is significant – it's not uncommon to see clusters several times that deployed to support model development. ®

Send us news
6 Comments

Schneider, Nvidia sign pact to cool Europe's AI ambitions

With 1 MW racks and prefab bit barns, the heat is on

Nvidia bets on Gates-backed nuclear startup to keep its AI ambitions from melting down

$650M funding round aims to bring TerraPower's Natrium power plant in Wyoming online by 2030

AMD's MI355X is a 1.4 kW liquid-cooled monster built to battle Nvidia's Blackwell

And the House of Zen wants to put 128 of them in your rack

Nvidia hits the gas on autonomous vehicle software

DRIVE stack promises safer roads and smarter cars – eventually

Google Cloud flexes as first to host Nvidia RTX PRO 6000 Server VMs

Baby got Blackwell GPUs

Rack scale is on the rise, but it's not for everyone... yet

Still buying B200s and MI300Xs? Don't feel bad, Nvidia and AMD's NVL72 and Helios rack systems aren't really for the enterprise anyway

AMD bets on rack-scale compute to boost AI efficiency 20x by 2030

Who'd have thunk? The bigger the iron, the more efficient it gets

AI gives the sleeping network switch market a good kick

Q1 revenue jumps to $11.7B, with 400 and 800 GbE driving the spike

Here’s what it’ll take for Nvidia and other US chipmakers to flog AI chips in China

Jensen be limbo, Jensen be quick, Jensen go under the Uncle Sam’s limbo stick

Omni-Path is back on the AI and HPC menu in a new challenge to Nvidia's InfiniBand

After a five-year hiatus, Cornelis' interconnect returns at 400Gbps, with Ethernet support next

The launch of ChatGPT polluted the world forever, like the first atomic weapons tests

Academics mull the need for the digital equivalent of low-background steel

Enterprises are getting stuck in AI pilot hell, say Chatterbox Labs execs

Security, not model performance, is what's stalling adoption