Schneider Electric warns that existing datacenters aren't buff enough for AI
You're going to need liquid-cooled servers, 415V PDUs, two-ton racks, and plenty of software management
The infrastructure behind popular AI workloads is so demanding that Schneider Electric has suggested it may be time to reevaluate the way we build datacenters.
In a recent white paper [PDF], the French multinational broke down several of the factors that make accommodating AI workloads so challenging and offered its guidance for how future datacenters could be optimized for them. The bad news is some of the recommendations may not make sense for existing facilities.
The problem boils down to the fact that AI workloads often require low-latency, high-bandwidth networking to operate efficiently, which forces densification of racks, and ultimately puts pressure on existing datacenters' power delivery and thermal management systems.
Today it's not uncommon for GPUs to consume upwards of 700W and servers to exceed 10kW. Hundreds of these systems may be required to train a large language model in a reasonable timescale.
According to Schneider, this is already at odds with what most datacenters can manage at 10-20kW per rack. This problem is exacerbated by the fact that training workloads benefit heavily from maximizing the number of systems per rack as it reduces network latency and costs associated with optics.
In other words, spreading the systems out can reduce the load on each rack, but if doing so requires using slower optics, bottlenecks can be introduced that negatively affect cluster performance.
"For example, using GPUs that process data from memory at 900GB/s with a 100GB/s compute fabric would decrease the average GPU utilization because it's waiting on the network to orchestrate what the GPUs do next," the report reads. "This is a bit like buying a 500-horsepower autonomous vehicle with an array of fast sensors communicating over a slow network; the car's speed will be limited by the network speed, and therefore won't fully use the engine's power."
The situation isn't nearly as dire for inferencing – the act of putting trained models to work generating text, images, or analyzing mountains of unstructured data – as fewer AI accelerators per task are required compared to training.
Then how do you safely and reliably deliver adequate power to these dense 20-plus kilowatt racks and how do you efficiently reject the heat generated in the process?
"These challenges are not insurmountable but operators should proceed with a full understanding of the requirements, not only with respect to IT, but to physical infrastructure, especially existing datacenter facilities," the report's authors write.
The whitepaper highlights several changes to datacenter power, cooling, rack configuration, and software management that operators can implement to mitigate the demands of widespread AI adoption.
Needs more power!
The first involves power delivery and calls for replacing 120/280V power distribution with 240/415V systems to reduce the number of circuits within high-density racks. However, this in itself isn't a silver bullet and Schneider notes that even using the highest rated power distribution units (PDUs) today operators will be challenged to deliver adequate power to denser configurations.
As a result, either multiple PDUs may be required per rack or operators may need to source custom PDUs capable of greater than 60-63 amps.
At the higher voltages and currents, Schneider does warn operators to conduct an arc flash risk assessment and load analysis to ensure the right connectors are used to prevent injuries to personnel. Arc flash isn't to be taken lightly and can result in burns, blindness, electric shock, hearing loss, and/or fractures.
- The future of the cloud sure looks like it'll be paved in even more custom silicon
- Microsoft's AI investments skyrocketed in 2022 – and so did its water consumption
- Cloud is here to stay, but customers are starting to question the cost
- Despite the hype, generative AI is not a significant chunk of enterprise cloud spend
Of course they're fans of liquid cooling
When it comes to thermal management, Schneider guidance won't surprise anyone: liquid cooling. "Liquid cooling for IT has been around for half a century for specialized high-performance computing," the authors emphasize.
As for when datacenter operators should seriously consider making the switch, Schneider puts that threshold at 20kW per rack. The company argues that for smaller training or inference workloads, air cooling is adequate up to this point, so long as proper airflow management practices like blanking panels and aisle containment are used. Above 20kW and Schneider says "strong consideration should be given to liquid cooled servers."
As for the specific technology to employ, the company favors direct liquid cooling (DLC), which removes heat by passing fluids through cold plates attached to hotspots, like CPUs and GPUs.
The company isn't as keen on immersion cooling systems, particularly those using two-phase coolants. Some of these fluids, including those manufactured by 3M, have been linked to PFAS – AKA forever chemicals – and pulled from the market. For those already sold on dunking their servers in big tanks of coolant, Schneider suggests sticking with single-phase fluids, but warns they tend to be less efficient at heat transfer.
In any case, Schneider warns that care should be taken when selecting liquid-cooled systems due to a general lack of standardization.
Don't forget the supporting infrastructure, software
Of course all of this assumes that liquid cooling is even practical. Depending on facility constraints – a lack of adequate raised floor height for running piping, for example – retrofitting an existing facility may not be viable.
And where these power and thermal mods can be made, Schneider says operators may need to consider heavier-duty racks. The paper calls for 48U, 40-inch deep cabinets that can support static capacities of just under two tons – for reference, that's about 208 adult badgers – to make room for the larger footprint associated with AI systems and PDUs.
Finally, the group recommends employing a variety of datacenter infrastructure (DCIM), electrical power (EPMS), and building management system (BMS) software platforms to identify problems before they take out adjacent systems and negatively impact business-critical workloads. ®