How Nvidia is overcoming slowdown issues in GPU clusters
Claims it has cracked the scalability problem with incoming H100
GTC If you're training a large deep learning model, and you want it to train it faster, you should just throw more GPUs at it, right? Well, that works for a while, but there is actually a limit to the amount of proportional performance you can expect from adding additional GPUs in a server cluster.
"A key challenge to reducing this time to train is the performance gains start to decline as you increase the number of GPUs in a data center," said Paresh Kharya, Nvidia's director of datacenter computing, at the GPU giant's virtual GTC 2022 event this week.
Nvidia claims it has cracked the case on the so-called scaling issue faced by large GPU clusters thanks to its upcoming H100 data center GPU along with associated software and interconnect technologies, which were revealed at GTC.
This is important because large, so-called transformer models for popular AI applications used by the world's biggest internet companies are getting huge and the darned things just won't stop growing, as evidenced by the 530 billion parameters of the Megatron-Turing conversational AI model.
To help illustrate the GPU scaling abilities of Nvidia's upcoming GPU and interconnect technologies, Kharya said when Nvidia's new Hopper-based H100 GPU comes out in the third quarter, an H100-based cluster of 8,000 GPUs will be able to train the 395 billion-parameter Mixture of Experts transformer model nine times faster than an equivalent cluster using Nvidia's A100 GPU that came out in 2020. That means it would only take 20 hours to train the model rather than the seven days that was possible with the A100.
"It'll also enable next generation of advanced AI models to be created, because they'll be within the reach for practical amount of time it takes to train them," he said.
The H100's high level of scalability also has big implications for inference, according to Kharya. When using the 530 billion-parameter Megatron model for real-time chat bots, the H100 can offer 30x more throughput than the A100. This is notable, he added, because chatbots often require a latency threshold of 1 second (you know how people get waiting for a response).
Beyond the massive performance boost the H100 provides over the A100, Kharya said there are a few main factors that are allowing Nvidia to make large GPU clusters more efficient.
First, there is the fourth-generation NVLink interconnect, which Nvidia is using to connect every H100 GPU in its new DGX H100 system. This interconnect tech can provide 900GBps of throughput, 50 percent faster than the previous NVLink generation. This NVLink technology is the basis for Nvidia's new NVSwitch, which enables GPU-to-GPU communication between the eight GPUs within the DGX.
Why Nvidia sees a future in software and services: Recurring revenueREAD MORE
Just as important is Nvidia's new external NVLink Switch, which can connect up to 32 DGX H100 systems for a total of 256 GPUs to form the company's new DGX SuperPOD supercomputer. Kharya said the new switch system provides nine times higher bandwidth than Nvidia's Quantum-1 InfiniBand interconnect, which comes from the company's 2020 acquisition of Mellanox Technologies.
Kharya said organizations can also use Nvidia's new Quantum-2 InfiniBand interconnect for large-scale GPU clusters, whether it's for connecting 256 GPUs or more. He said the new generation of high-speed interconnect technology has two times higher bandwidth than Quantum-1.
One key enabling technology for the NVLink Switch and Quantum-2 InfiniBand interconnect, according to Kharya, is the company's SHARP in-network computing software, which stands for the Scalable Hierarchical Aggregation and Reduction Protocol.
Now in its third generation, SHARP's main goal is to speed up the Message Passing Interface API that is used to send data across a cluster of multiple nodes. Kharya said SHARP does this by offloading messaging operations from CPUs in clusters to the network switch, which would be the NVLink Switch or the Quantum-2 InfiniBand switch in this case.
"What that does is it eliminates the need for sending data back and forth between the different endpoints," Kharya said, which makes the overall network more efficient. It also frees up computing resources for CPUs to improve the overall performance of the cluster.
- Nvidia launches Cambridge-1, UK's most powerful supercomputer, in Arm's neighbourhood
- Nvidia CEO Jensen Huang talks chips, GPUs, metaverse
- Supermicro's 'universal GPU' system welcomes all elements
- Six months after A100 super-GPU's debut, Nvidia doubles memory, ups bandwidth
"The net effect is you get 15 times more in-network computing in a cluster of 256 GPUs," he added.
The last element that is helpful when it comes to transformer models is the new Transformer Engine within the H100. This engine speeds up the processing of these large deep learning models by "intelligently" managing their precision between 8-bit and 16-bit formats while maintaining accuracy to speed up the training of said models by as much as 6x compared to the A100. And it's made possible by working in conjunction with software.
Nvidia's ability to optimize performance across so many areas – from the GPU, to the system, to the network, to the software – underlines why CEO Jensen Huang believes the Nvidia needs to be a "full-stack computing company" rather than just a provider of GPUs.
Which, in turn, is why the company has been so eager to acquire companies in different areas.
"So the net effect of all of these technologies is to improve both the performance and the scalability," Kharya said. ®