AI giants call for energy grid kumbaya

Microsoft, Nvidia, and OpenAI researchers warn of uneven power usage associated with AI training, and propose possible fixes

Researchers at Microsoft, Nvidia, and OpenAI have issued a call to designers of software, hardware, infrastructure, and utilities for help finding ways to normalize power demand during AI training.

Nearly 60 scientists at the three firms have co-authored a paper about the need to address the power management challenges of AI training workloads. Their concern is that the fluctuating power demand of AI training threatens the electrical grid's ability to handle the variable load.

The paper, "Power Stabilization for AI Training Datacenters," argues that oscillating energy demand between the power-intensive GPU compute phase and the less-taxing communication phase, where parallelized GPU calculations get synchronized, represents a barrier to the development of AI models.

The authors note that the difference in power consumption between the compute and communication phases is extreme, the former approaching the thermal limits of the GPU and the latter being close to idle time energy usage.

This variation in power demand occurs at the node (server) level and across other nodes at the data center, due to the synchronous nature of AI training. So these oscillations become visible at the rack, datacenter, and power grid levels – imagine 50,000 hairdryers (~2000 watts) being turned on at once.

"At scale, these swings can amount to tens or hundreds of megawatts, occurring at frequencies that, if poorly aligned with the resonant characteristics of power grid components (e.g., turbine generators or long transmission lines), can risk grid instability and mechanical failure," the authors observe. 

"These issues are not theoretical – multiple utility providers have now documented the impact of harmonics induced by synchronized computing loads."

Looking beyond just AI training, Schneider Electric expects the US grid will become less stable by the end of the decade due to data center energy demand. A US Department of Energy report published last December said, "data centers consumed about 4.4 percent of total US electricity in 2023 and are expected to consume approximately 6.7 to 12 percent of total US electricity by 2028."

The boffins from Microsoft, Nvidia, and OpenAI have kicked off the power stabilization party with an evaluation of three different strategies, each of which has pros and cons.

There are software-based approaches, which help even out power usage by injecting secondary workloads when GPU activity falls below a certain threshold. The downsides are performance overhead, the need for customer-cloud provider collaboration, and unreliability.

GPU-level firmware features like power smoothing, supported in the Nvidia GB200, give developers and cloud providers a way to set a power utilization floor and set ramp-up and ramp-down rates. But power smoothing imposes an extra energy cost.

And data center-level capabilities like Battery Energy Storage Systems offer a mechanism for handling power demand spikes locally, without burdening the utility grid. But energy storage hardware can be expensive.

While these options are available today, the researchers argue that an optimal solution involves a combination of all three techniques. And, to make that a viable option, the folks at Microsoft, Nvidia, and OpenAI are asking for more coordination among vendors, so that rack-level energy storage and GPUs can communicate about workload state changes.

Specifically, the researchers want AI framework and system designers to focus on training algorithms that are asynchronous and power-aware; utility and grid operators to share resonance and ramp specifications and to standardize communication channels with data center operators; and for the tech industry to establish interoperable standards for telemetry, load signaling, and sub-synchronous oscillation mitigation.

"Together, we can design for a future where AI training is not only powerful, but also power-aware," the authors conclude. ®

More about

TIP US OFF

Send us news


Other stories you might like