Microsoft is building datacenter superclusters that span continents
The 100 trillion-parameter models of the near future can't be built in one place
Microsoft believes the next generation of AI models will use hundreds of trillions of parameters. To train them, it's not just building bigger, more efficient datacenters – it's started connecting distant facilities using high-speed networks spanning hundreds or thousands of miles.
The first node of this multi-datacenter cluster came online in October, connecting Microsoft's datacenter campus in Mount Pleasant, Wisconsin, to a facility in Atlanta, Georgia.
The software giant’s goal is to eventually scale AI workloads across datacenters using similar methods as employed to distribute high-performance computing and AI workloads across multiple servers today.
"To make improvements in the capabilities of the AI, you need to have larger and larger infrastructure to train it," said Microsoft Azure CTO Mark Russinovich in a canned statement. "The amount of infrastructure required now to train these models is not just one datacenter, not two, but multiples of that."
These aren't any ordinary datacenters, either. The facilities are the first in a family of bit barns Microsoft is calling its “Fairwater” clusters. These facilities are two stories tall, use direct-to-chip liquid cooling, and consume "almost zero water," Microsoft boasts.
Eventually, Microsoft envisions this network of datacenters will scale to hundreds of thousands of diverse GPUs chosen to match workloads and availability. At its Atlanta facility, Microsoft will deploy Nvidia's GB200 NVL72 rack systems, each rated to host over 120 kilowatts of kit and to offer 720 petaFLOPS of sparse FP8 compute for training, helped by the presence of 13TB HBM3e memory,.
Spreading the load
By connecting its datacenters, Microsoft will be able to train much larger models and give itself the chance to choose different locations for its facilities – meaning it can choose places with cheap land, cooler climates, and – perhaps most importantly – access to ample power.
Microsoft doesn't specify what technology it's using to bridge the roughly 1,000 kilometer (as the vulture flies) distance between the two datacenters, but it has plenty of options.
Last month, Cisco revealed the Cisco 8223, a 51.2 Tbps router designed to connect AI datacenters up to 1,000 kilometers away. Broadcom intends its Jericho 4 hardware, announced in August, to do the same job and provide similar bandwidth.
Meanwhile, Nvidia, which has quietly become one of the largest networking vendors in the world on the back of the AI boom, has teased its Spectrum-XGS network switches with crypto-miner-turned-rent-a-GPU outfit Coreweave signed up as an early adopter.
- What happens when we can't just build bigger AI datacenters anymore?
- DeepMind working on distributed training of large AI models
- Cisco's new router unites disparate datacenters into AI training behemoths
- Broadcom's Jericho4 ASICs just opened the door to multi-datacenter AI training
We've asked Microsoft to comment on which of these technologies it's using at its Fairwater facilities, and will update this story if we hear back. But Redmond’s close ties to Nvidia certainly makes Spectrum-XGS a likely contender.
Microsoft is famously one of the few hyperscalers that's standardized on Nvidia's InfiniBand network protocol over Ethernet or a proprietary data fabric like Amazon Web Service's EFA for its high-performance compute environments.
While Microsoft has no shortage of options for stitching datacenters together, distributing AI workloads without incurring bandwidth- or latency-related penalties remains a topic of interest to researchers.
They're making good progress: Readers may recall that earlier this year, Google's DeepMind team published a report showing that many of the challenges can be overcome by compressing models during training and strategically scheduling communications between datacenters. ®