This article is more than 1 year old
Meta says it's building world's largest AI supercomputer out of Nvidia, AMD chips
Facebook owner needs 16,000 GPUs, 4,000 Epyc processors – good luck, everyone else
Facebook owner Meta is building the world's largest AI supercomputer to power machine-learning research that will bring the metaverse to life in the future, it claimed on Monday.
The new super – dubbed the Research Super Computer, or RSC – will contain 16,000 Nvidia A100 GPUs and 4,000 AMD Epyc Rome 7742 processors. Each compute node is an Nvidia DGX-A100 system, containing eight GPU chips and two Epyc microprocessors, totaling 2,000 nodes.
It's expected to hit a peak performance of 5 exaFLOPS at mixed precision – FP16 and FP32 – and can feed in 16 terabytes of training information per second from up to 1EB of cache-based storage, we're told.
The RSC is being built with the help of Penguin Computing, a HPC supplier based in California, who will provide the infrastructure and managed security.
"Meta has developed what we believe is the world's fastest AI supercomputer," CEO Mark Zuckerberg said in a statement to The Register.
"We're calling it RSC for AI Research SuperCluster and it'll be complete later this year. The experiences we're building for the metaverse require enormous compute power – quintillions of operations per second – and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more."
Nvidia confirmed the massive machine is expected to be the largest customer installation of DGX A100 systems once it's fully built and up-and-running by mid-2022. "RSC took just 18 months to go from an idea on paper to a working AI supercomputer," Nvidia said.
Read more from our sister sites:
- Meta buys, rather than builds and opens, its massive AI supercomputer – Next Platform
- Pure Storage gets Meta's blessing – Blocks & Files
The RSC already exists albeit in a less flashy form, delivering 1,895 PFLOPS of TF32 performance. Right now it's made up of 760 Nvidia DGX-A100 systems containing 1,520 AMD Rome 7742 processors and 6,080 GPUs. Each GPU is connected via Nvidia's Quantum InfiniBand, which is capable of shuttling data back and forth at 200 gigabytes per second. More and more nodes will be added as the year goes on.
The RSC can also store up to 175 petabytes in Pure Storage FlashArray hardware, 46 petabytes in a cache storage, and 10 petabytes in Pure's FlashBlade object storage equipment. It's hoped that this capacity will grow into the exabyte territory, again using Pure products. To put that into perspective, Meta said 1EB could hold 36,000 years of high-quality video.
The supercomputer is estimated to be 9X faster than Meta's previous research cluster, made up of 22,000 of Nvidia's older generation V100 GPUs, and 20X faster than its current systems that run AI models in production. That older research cluster could run up to 35,000 training jobs a day, we're told.
"The benefit of this new design is that we are able to scale to many GPUs without performance drops," a Meta spokesperson told The Register. "We expect to have a smaller number of training jobs running than our previous AI research infrastructure but each job would train larger models to fully utilize the design."
- Meta trains data2vec neural network to understand speech, images, text so it can 'understand the world'
- US watchdog pokes Facebook a second time: Meta faces fresh monopoly lawsuit
- The data must flow: America's 'Team Telecom' backs switch-on of Google and Meta's US-APAC undersea cable
- Web3: The next generation of the web is here… apparently
Meta is focused on building self-supervised learning and transformer-based models. These complex architectures are easy to scale up. They can handle multiple types of data, such as audio, text, and images, within a single model. The RSC has been planned specifically for training models with more than a trillion parameters.
"We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together," Meta technical program manager Kevin Lee and software engineer Shubho Sengupta explained in a blog post.
"Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role." ®