Our AI habit is already changing the way we build datacenters
If you thought a 700W GPU was hot, imagine what it takes to keep racks full of 15kW accelerators cool
Analysis The mad dash to secure and deploy AI infrastructure is forcing datacenter operators to reevaluate the way they build and run their facilities.
In your typical datacenter, cold air is pulled through a rack full of compute, networking, and storage systems. At the back, the heated air is then captured and ejected by the facility's cooling infrastructure.
This paradigm works just fine for 6-10kW racks, but starts to fall apart when you start deploying the kinds of systems used to train AI models like GPT-4. Modern GPU nodes can easily consume an entire rack's worth of power. And this is forcing datacenter operators to make some serious design changes.
'Ludicrous mode' for datacenters
Tesla appears to be the latest to realize this. As we reported earlier this week, the US electric vehicle manufacturer is looking for folks to help it build "first of its kind datacenters."
In a recent job posting, the company said it was looking for a senior engineering program manager for datacenters, who will "lead the end-to-end design and engineering of Tesla's first of its kind datacenters and will be one of the key members of its engineering team."
This person would also be responsible for overseeing the construction of a new datacenter. This suggests that this may be unrelated to reports by The Information claiming that Tesla recently took over a datacenter lease in Sacramento abandoned by Twitter following the social network's acquisition by CEO Elon Musk.
While it's not exactly clear what the company means by "first of its kind datacenters" – we've asked Tesla and have yet to hear back – it may have something to do with the custom Dojo AI accelerator it showed off at Hot Chips last year.
The company plans to dump upwards of $1 billion into the project between now and the end of 2024 to accelerate the development of its autonomous driving software. Speaking in July, Musk revealed the complete system could exceed 100 exaFLOPS, of what we presume to be BF16 performance.
That means Tesla is going to have to find somewhere capable of housing the thing, and someone to keep the lights on and all those points floating. And based on what we know of the Dojo accelerator, architecting and managing a facility capable of delivering adequate power and cooling to keep the AI accelerator humming could be a bit of a nightmare.
Dojo is a composable supercomputer, developed entirely in house by Tesla. Everything from the compute, networking, IO, to the instruction set architecture, power delivery, packaging, and cooling, was custom built with the express purpose of accelerating Tesla's machine learning algorithms.
The basic building block of this system is Tesla's D1 chiplet. Twenty-five of these are packaged together using TSMC's system-on-wafer technology into the Dojo Training tile. All told, the half-cubic-foot system features 11GB of SRAM, 9TB/s of fabric connectivity, and can manage 9 petaFLOPS of BF16 performance. You can find a full breakdown of the massive AI accelerator on our sibling site, The Next Platform.
Of course, cramming all that performance into such a compact form factor does present some unique challenges, like how do you power and cool a single 15kW accelerator, let alone the six of them that make up the 1 exaFLOPS Dojo V1 system. And that's just the accelerators. You also need to power and cool all the supporting systems used to feed and coordinate the flow of data through the accelerators.
Then there's the matter of the high-speed mesh, which could prove prohibitive in terms of how these tiles can be deployed. At those speeds, the closer you can pack them, the better, but also the greater the thermal load is going to be. As such, it wouldn't be surprising if Tesla ditched the idea of using traditional racks altogether in favor of something completely unique.
This humble vulture would personally love to see a return to the wild and wacky supercomputing designs of yore. Supercomputers used to be weird and fun. Don't believe me? Just look up Thinking Machine's CM-1 or the Cray-2. Those were some good looking machines.
Whatever form this system ultimately takes, one thing is for sure, wherever Tesla decides to deploy the system is going to need supercomputing levels of water cooling capacity.
AI is already changing the face of datacenters
It's not just Tesla. The cooling and power requirements imposed by AI infrastructure is already driving several large hyperscalers and DC operators to reevaluate how they build their datacenters.
One of the companies driving these changes is Facebook parent company Meta. The company is heavily invested in AI research and development, having commissioned an AI supercomputer composed of 16,000 Nvidia A100 GPUs last year.
This infrastructure has not only helped to fuel development of AI models, like the not-exactly open source Llama 2 large language model, but served to shape the infrastructure itself. Meta, or rather Facebook, launched the Open Compute Project (OCP) all the way back in 2011 to accelerate the development of datacenter infrastructure.
At the OCP Summit last year, Meta revealed its Grand Teton AI training platform alongside its Open Rack v3 (ORV3) specification, which was designed to accommodate the higher power and thermal loads of the system. For example, under the spec, Meta says a single bus bar can support 30kW racks.
- Tesla is looking for people to build 'first of its kind Data Centers'
- Digital Realty: We hear you like your racks dense, how does 70kW sound?
- Tesla's Dojo supercomputer is a billion-dollar bet to make AI better at driving than humans
- AVX10: The benefits of AVX-512 without all the baggage
"With higher socket power comes increasingly complex thermal management overhead. The ORV3 ecosystem has been designed to accommodate several different forms of liquid cooling strategies, including air-assisted liquid cooling and facility water cooling," Meta's VP of Infrastructure, Alexis Bjorlin, wrote in a blog post last fall. "The power trend increases we are seeing, and the need for liquid cooling advances are forcing us to think differently about all elements of our platform, rack, power, and datacenter design."
That last point on datacenter design is particularly salient as not long after that blog post published, Meta canceled two Dutch datacenters and announced it would redesign a third in Huntsville, Alabama, amid what the company described as a "strategic investment in artificial intelligence."
Air-assisted liquid cooling takes center stage
One of the key technologies Meta and others are investing in is something called air-assisted liquid cooling. As its name suggests, the technology is something of a half step toward the kinds of fully liquid cooled infrastructure we've seen in HPE Cray, Atos, and Lenovo supercomputers for years.
The tech makes extensive use of rear-door heat exchangers (RDHx) to reduce the facility-wide infrastructure investments necessary to support hotter running chips. RDHx are really quite simple, amounting to little more than a rack-sized radiator and some big fans. The tech is favored by many because of its flexibility, which allows it to be deployed in facilities with or without the plumbing required to support rack-level liquid cooling.
In Meta's case, the company is looking at RDHx as a means to more efficiently remove heat from the systems. As we understand it, the implementation involves direct liquid cooled (DLC) servers, which are plumbed up to an in-rack reservoir and pump, which propels heated coolant through the RDHx, where heat from the systems is exhausted to the hot aisle.
In this configuration, the RDHx functions a lot like a custom water cooling loop in a gaming PC, but instead of cooling one system, it is designed to cool the entire rack.
However, this isn't the only way we've seen air-assisted liquid cooling done. RDHx can also be used to cool with air. In this configuration, cold facility water is pumped through the RDHx. As hot air is exhausted out the back of the air-cooled systems, that heat is absorbed by the radiator. Meta published an entire paper on the viability of this technology last October [PDF].
Several colocation providers, including Digital Realty, Equinix, Cyxtera, and Colovore have confirmed support for RDHx cooling in their datacenters. Though it's our understanding that is usually a custom order sort of thing.
One of the biggest benefits of this approach, particularly for colos, is it doesn't require customers to embrace DLC before they're ready, and doesn't require them to support the minefield of conflicting standards that pepper the liquid cooling industry.
The benefits of this technology aren't limited to AI or HPC workloads either. As CPUs grow hotter and more core dense, chipmakers – AMD and Ampere in particular – have been selling the prospect of densification. In other words, consolidating multiple, potentially racks full of older servers down into a handful of high-core count ones.
The problem is these core-dense systems use so much energy that you're likely to run out of power before the rack is close to full. Higher-density rack configurations and rear-door heat exchangers have the potential to allow customers to cram much of their infrastructure into a handful of racks. ®