Can gamers teach us anything about datacenter cooling? Lenovo seems to think so
Taming high heat in compact form factors is old hat in the PC community
Analysis It's no secret that CPUs and GPUs are getting hotter. The thermal design power (TDP) for these components is already approaching 300W for mainstream CPUs, and next-gen datacenter GPUs will suck down anywhere from 600W-700W when they arrive later this year.
Liquid cooling technologies developed decades ago for use in mainframe computers and now commonplace in supercomputers would seem to be the obvious answer. For some of these components, it may even be a requirement. But for those who aren't ready to commit to rearchitecting their datacenters to support the tech, could cooling kit developed for gaming systems offer a stopgap?
Scott Tease, Lenovo's VP of HPC and AI, seems to think so. In addition to Lenovo's Neptune direct liquid-cooled systems, his team is also working on adapting technologies you might find in gaming PCs and notebooks for use in the datacenter.
"We used to complain about 145W CPUs. This generation we're seeing 300W CPUs and by 2024, there's no doubt in my mind we will have 500W CPUs in market," he says. "We're kind of running out of room for air to do what we need it to do at the densities we're used to seeing."
Inspired by gamers
One of the technologies Lenovo is exploring is the use of closed-loop liquid-cooling systems, not unlike the all-in-one coolers used in high-end gaming systems.
For those who aren't familiar, these coolers are made up of three main components: a cold plate, which attaches to the CPU or GPU; a pump that moves liquid through the system; and a radiator that dissipates heat, usually with the assistance of fans.
It's not unlike how an internal combustion vehicle is cooled, Tease explains. Liquid is pumped through the engine and then cooled by a radiator at the front of the vehicle.
The tech differs from conventional datacenter liquid cooling setups in that the fluid doesn't actually evacuate the heat from the system, but rather relocates it within the chassis where it is easier to deal with. This, he says, means hotter components like 400W+ SXM GPUs can be packed into smaller chassis sizes. And because they're self-contained, these systems don't require any changes to existing rack or datacenter infrastructure.
Lenovo is also exploring the use of dense heat pipe arrays where liquid cooling might not be practical or desired. Heat pipes are self-contained vapor chambers designed to passively pump heat around. While the tech is by no means new, heat pipes are particularly popular in gaming systems and notebooks to move heat away from components in space-constrained environments.
Lenovo is applying this same concept in multi-socket blade servers to keep the rear CPU cool. Lenovo calls these "thermal transfer modules," and they are essentially a stretched-out heat sink.
"What it allows us to do is one of two things, either put in a higher-end processor – some HPC clients like that – or if you go with a more standardized processor, we're able to lower our fan speeds and then in turn lower acoustics and power consumption," Tease says.
Fanning the flames
And it's not just the CPUs and GPUs that are getting more power hungry, according to Tease. The percentage of power consumed just by a system's fans has skyrocketed in recent years.
"In the old days – by old days I mean 2014, 2015 – it was a small budget, maybe 5 percent. In failure mode, it was maybe 7 percent," he tells The Reg. "Because parts are so high-power and T-cases are so low, we're pushing 13 percent just in normal operation."
That means for a 1KW, 1U dual-socket system, 130W is taken up by the fans. And in environments where customers are operating at extended temperature ranges, Tease says as much as 20 percent of a system's power consumption can be attributed to the fans.
"We think that's going to get worse in the 2024 time frame," he says, adding that anything which reduces the number or the speed of the fans can have a major impact on power consumption.
- Oil company Castrol slips and slides into immersion cooling
- Intel's 13th-gen CPUs are hot, hungry, loaded with cores
- US Department of Energy has $42m to make datacenter cooling more efficient
- Lessons to be learned from Google and Oracle's datacenter heatstroke
Fans are also one of the reasons that liquid-cooled systems are so comparable in cost to air-cooled systems. According to Tease, Lenovo's liquid and air-cooled systems are usually within two percent of each other in terms of upfront cost.
Transitioning to direct liquid cooling, for instance, means you can ditch the fans entirely, and because you no longer have to power 130W of fans, you can opt for a smaller power supply or achieve greater performance per watt by using higher-power CPUs.
"It's far more cost effective, total picture, to go with liquid cooling," Tease says, adding that there are also opportunities to reuse heat captured from liquid-cooled datacenters for things like district heating.
Microsoft, for example, plans to do just that at its new Helsinki datacenter announced in March.
This is all a crucial point: liquid cooling does make a lot of sense, provided the cost is low enough to warrant the switch from air to liquid.
The rack power bottleneck
While liquid cooling may have benefits in terms of power draw compared to fan-heavy, air-cooled systems, customers probably won't need to make the switch just yet.
"I would say we're probably five years away from really being extremely limited in what air quality is going to be able to do for us," Tease says.
Part of the problem relates to rack power. Even if liquid cooling allows users to achieve greater compute density in a smaller form factor, they'll still be limited by how much power can be supplied to a single rack.
With a very healthy 42KW budget, even if you could cram a bunch of 2KW 1U chassis into it, you'd run out of power before your rack is half full. And that's for 42KW, which many colocation providers don't even come close to.
"You could make the case that a 96-core or 128-core, one-socket system is equivalent to a two-socket system from last generation," Tease adds. "Or maybe you go from 1U to 2U with these giant heat sinks and lose density that way, but maybe one two-socket 2U is better than two one-socket 1Us."
In other words, for customers who aren't running HPC or AI/ML workloads, it may be better to opt for a bigger chassis with larger, slower fans and heat sinks, at least until TDPs reach the point at which datacenter operators are forced to increase their rack power budgets. ®