This article is more than 1 year old
Azure-based 'AI supercomputer' to use Nvidia GPUs and software
Is it a super or a cloud service? Users would string together components as required
Microsoft and Nvidia say they are teaming up to build an "AI supercomputer" using Azure infrastructure combined with Nvidia's GPU accelerators, network kit, and its software stack.
The target market will be enterprises looking to train and deploy large state-of-the-art AI models at scale.
According to Nvidia, the project will be a multi-year effort that will aim to deliver "one of the most powerful AI supercomputers in the world," and the move will also make Azure the first public cloud to incorporate Nvidia's AI software stack.
To build this system, the pair will use the Azure cloud platform's ND-series and NC-series virtual machines, which are GPU-based instances, but the project will involve bringing "tens of thousands" of Nvidia's A100 and the latest H100 GPUs to the platform.
Nvidia claims that Microsoft's Azure is the first public cloud to incorporate its Quantum-2 InfiniBand networking switches for its AI-optimized virtual machines. The current Azure instances feature 200Gb/s Quantum InfiniBand and A100 GPUs, but future ones will have 400Gbps Quantum-2 InfiniBand and the newer H100 GPUs.
The two companies offered no time frame for this project to become operational, but it appears that this so-called AI supercomputer may actually not exist as such.
Reading between the lines, it appears that Microsoft is simply adding to Azure a bunch of AI-optimized instances with Nvidia's latest hardware and software, which customers will string together as required for their specific project.
We asked Nvidia for clarification, and a spokesperson told us that: "All new capabilities are in Azure instances, but the setup is such that enterprises will be able to scale those instances all the way up to supercomputing status." So that's an AI cloud service, in that case.
Nvidia's spokesperson added: "Customers can acquire resources as they normally would with a real supercomputer. For both you have a software layer that reserves resources. Just now it is in the cloud and not on a dedicated supercomputer. The most important thing is scalability. The resource on Azure can be scaled up to supercomputer standards with the same AI software, network capability and compute nodes."
The partnership will also see Nvidia make use of Azure resources to conduct research into generative AI models, which the company describes as a "rapidly emerging area" of AI in which foundational models such as Megatron Turing NLG 530B are the basis for new algorithms capable of synthesizing text, computer code, images, and even video.
- Nvidia's datacenter growth can't save it from gaming GPU woes
- Nvidia turns to optical trickery to boost long-haul InfiniBand performance
- Nvidia, Lockheed team up to build digital twin of the Earth for climate researchers
- How AMD, Intel, Nvidia are keeping their cores from starving
The model in question was developed by a team from Microsoft and Nvidia, using Nvidia's Megatron-LM transformer model and Microsoft's DeepSpeed deep learning optimization library, the latter of which the pair will also work on optimizing.
Scott Guthrie, Microsoft EVP for its Cloud + AI Group, said that AI will fuel the next wave of automation in enterprises, enabling organizations to do more with less.
"Our collaboration with Nvidia unlocks the world's most scalable supercomputer platform, which delivers state-of-the-art AI capabilities for every enterprise on Microsoft Azure."
Nvidia's VP of enterprise computing, Manuvir Das, said the partnership aimed to provide researchers and companies with infrastructure and software to exploit AI.
"The breakthrough of foundation models has triggered a tidal wave of research, fostered new startups and enabled new enterprise applications," he said.
To this end, the AI supercomputer/cloud service will support a broad range of applications and services, including Microsoft DeepSpeed and Nvidia's AI Enterprise software suite. The latter is already certified and supported on Azure instances with A100 GPUs, while support for Azure instances with the newer H100 GPUs will be added in a future software release, Nvidia said. ®