Build Microsoft has spun up a custom AI supercomputer in its Azure cloud for OpenAI that ranks within the top five publicly known most powerful supers on Earth, it is claimed.
The behemoth is said to contain “more than 285,000 CPU cores and 10,000 GPUs,” with up to 400Gbps of network connectivity for each GPU server.
If this beast is in the top five, that would mean – judging from the latest list of the world's most powerful publicly known supercomputers – it is somewhere between the 450,000-core 38-PFLOPS Frontera system at the University of Texas, and Uncle Sam's 2.4-million-core 200-PFLOPS Summit.
We have a feeling it's somewhere around fifth place and fourth, which is China's 4.9-million-core 100PFLOPS Tianhe-2A, but we're just guessing.
Microsoft's brag was made during its annual Build conference, a virtual developer-focused event taking place this week.
AI research is computationally intensive, and OpenAI needs copious amounts of compute to its train giant machine-learning models from massive blocks of data. Some of its most ambitious projects include GPT-2, a text-generating system with up to a billion parameters that required 256 Google TPU3 cores to train it on 40GB of text scraped from Reddit; and the OpenAI Five, a bot capable of playing Dota 2, which needed more than 128,000 CPU cores and 256 Nvidia P100 GPUs to school.
Google: We've achieved quantum supremacy! IBM: Nope. And stop using that word, pleaseREAD MORE
“As we’ve learned more and more about what we need and the different limits of all the components that make up a supercomputer, we were really able to say, ‘If we could design our dream system, what would it look like?” said OpenAI CEO Sam Altman. “And then Microsoft was able to build it.”
It’s difficult to know how brawny the cluster really is, however, since the Windows giant declined to disclose any further technical details beyond the above numbers. When pressed, a spokesperson told The Register: “We were not able to provide the system’s benchmark processing speed, only that it would rank among the top five on the TOP500 list of the fastest supercomputers in the world,” adding that it had “no information to share on types [of chips] used at the moment.” OpenAI also said it had “no additional details to share beyond that at this time.”
The custom Azure-hosted cluster allocated exclusively for OpenAI is part Microsoft’s $1bn investment in the San Francisco-based research lab. Last year, it pledged to support OpenAI’s efforts to build “a hardware and software platform within Microsoft Azure which will scale to artificial general intelligence." As part of the deal, OpenAI promised to make MICROS~1 its “preferred partner,” if and when it decides to commercialize any of its machine-learning projects.
Microsoft said its partnership with OpenAI was a “first step toward making the next generation of very large AI models and the infrastructure needed to train them available as a platform for other organizations and developers to build upon.”
The Windows giant is particularly interested in natural language models that it believes will improve search, and generate and summarize text. Earlier this year, it built Microsoft Turing, the world’s largest language model with a whopping 17 billion parameters. The model has been used to boost Microsoft’s Bing, Office, and Dynamics software, the US super-corp said. Now, it’s planning to open source its various Microsoft Turing models along with instructions on how to train it all for specific tasks in its Azure cloud platform.
“The exciting thing about these models is the breadth of things they’re going to enable,” gushed Microsoft Chief Technical Officer Kevin Scott.
“This is about being able to do a hundred exciting things in natural language processing at once and a hundred exciting things in computer vision, and when you start to see combinations of these perceptual domains, you’re going to have new applications that are hard to even imagine right now."
Other flashy AI announcements from Microsoft Build include updated versions of DeepSpeed and the ONNX Runtime. Deepspeed is a Python library that speeds up the process of training large machine-learning models across multiple compute nodes. ONNX Runtime is similar, and it’s more flexible in that it can optimize models for training and inference for a number of machine-learning frameworks. ®
Updated to add
We understand the supercomputer uses AMD second-generation Rome Epyc server processors, and Nvidia V100 GPUs, from sources in the industry.