On-Prem

Systems

Nvidia's MLPerf submission shows B200 offers up to 2.2x training performance of H100

Is Huang leaving even more juice on the table by opting for mid-tier Blackwell part? Signs point to yes


Analysis Nvidia offered the first look at how its upcoming Blackwell accelerators stack up against the venerable H100 in real-world training workloads, claiming up to 2.2x higher performance.

The benchmarks, released as part of this week's MLPerf results, are in line with what we expected from Blackwell at this stage. The DGX B200 systems – used in Nvidia's Nyx supercomputer – boast about 2.27x higher peak floating point performance across FP8, FP16, BF16, and TF32 precisions than last gen's H100 systems.

And this is borne out in the results. Against the H100, the B200 managed 2.2x higher performance when fine-tuning Llama 2 70B and twice the performance when pre-training GPT-3 175B.

However, it's not just raw FLOPS at play here. According to Nvidia, Blackwell's substantially higher memory bandwidth – up to 8 TBps on the flagship parts – also came into play.

"Taking advantage of higher-bandwidth HBM3e memory, just 64 Blackwell GPUs were run in the GPT-3 LLM benchmark without compromising per-GPU performance. The same benchmark run using Hopper needed 256 GPUs to achieve the same performance," the acceleration champ explained in a blog post.

While that benchmark was conducted using just 64 GPUs across eight nodes, it's not clear if just one partition of a larger system was used or if Nyx is "super" in terms of performance rather than scale – in fact, details regarding the system are quite sparse. But from what we've gathered from images and past DGX configurations, we're looking at a modular system consisting of three, maybe four, eight-GPU nodes per rack, with the number of racks and interconnect bandwidth the two major question marks.

Nyx cluster – click to enlarge

The Register reached out to Nvidia for clarification on Nyx, and we'll let you know if we hear anything back.

The fact that Nvidia is using the B200 as the basis for its first training submissions tells us that there's still a good amount of performance on the table.

On paper, the B200 is capable of churning out 9 petaFLOPS of sparse FP8 performance, and is rated for a kilowatt of power and heat. The 1.2 kW GPUs found in Nvidia's flagship GB200, on the other hand, are each capable of churning out 10 petaFLOPS at the same precision.

However, it's not just that GB200 systems have higher peak performance – the GPU domain is also considerably larger. Traditionally, DGX systems have housed eight GPUs interconnected by a high-speed NVLink switch fabric, with additional scale achieved using multiple InfiniBand links between the nodes.

With Blackwell, Nvidia has expanded the NVLink domain from eight to 72 accelerators with its NVL72 reference designs.

How large of a difference this actually makes in terms of time to train is hard to say – but we could see a sizable uplift in performance by the time MLCommons releases its next batch of training results. We expect to see some gains from this considering that time to train is often limited by data movement and NVLink is several times faster than InfiniBand.

Even if the next training submission we see is still from Nvidia's B200-based systems, improvements in software and networking infrastructure could still drive improvements.

Next-gen ConnectX-8 SuperNICs are set to double InfiniBand bandwidth to 800 Gbps. Meanwhile, software optimizations and other upgrades have driven considerable performance improvements since Hopper made its debut on the MLPerf ranking.

Blackwell's training results come just months after it shared the first MLPerf inference benchmarks for the compute platform. In those tests, Nvidia was able to achieve a 4x uplift over Hopper.

In addition to Blackwell, Nvidia also shared large-scale training results for the GPT-3 175B benchmark using 11,616 Hopper GPUs. This is significant – it's not uncommon to see clusters several times that deployed to support model development. ®

Send us news
6 Comments

Where does Microsoft's NPU obsession leave Nvidia's AI PC ambitions?

While Microsoft pushes AI PC experiences, Nvidia is busy wooing developers

Nvidia snaps back at Biden's 'innovation-killing' AI chip export restrictions

'New rule threatens to squander America's hard-won technological advantage' says GPU supremo

Just as your LLM once again goes off the rails, Cisco, Nvidia are at the door smiling

Some of you have apparently already botched chatbots or allowed ‘shadow AI’ to creep in

Additional Microprocessors Decoded: Quick guide to what AMD is flinging out next for AI PCs, gamers, business

Plus: A peek at Nvidia's latest hype

Biden said to weigh global limits on AI exports in 11th-hour trade war blitz

China faces outright ban while others vie for Uncle Sam's favor

Nvidia shovels $500M into Israeli boffinry supercomputer

System to feature hundreds of liquid-cooled Blackwell systems

CoreWeave drops £1bn in UK datacenters – but don't expect the latest Nvidia magic just yet

Rent-a-GPU outfit's latest datacenters are packed to the brim with H200s

Nvidia shrinks Grace-Blackwell Superchip to power $3K mini PC

Tuned for running chunky models on the desktop with 128GB of RAM, custom Ubuntu

AI frenzy continues as Macquarie commits up to $5B for Applied Digital datacenters

Bubble? What bubble?

Europe hopes Trump trumps Biden's plan for US to play AI gatekeeper

Export controls would limit shipments of GPUs to large swaths of EU

Demand for AI servers sees Foxconn fly and suppliers come along for the ride

Record quarterly revenue at contract manufacturing giant suggests strong demand for hardware of all sorts

AI's rising tide lifts all chips as AMD Instinct, cloudy silicon vie for a slice of Nvidia's pie

Analyst estimates show growing apetite for alternative infrastructure