Alibaba reveals 82 percent GPU resource savings – but this is no DeepSeek moment
Better scheduling and resource-sharing for inferencing workloads using multiple models, not a training breakthrough
Chinese tech giant Alibaba has published a paper detailing scheduling tech it has used to achieve impressive utilization improvements across the GPU fleet it uses to power inferencing workloads – which is nice, but not a breakthrough that will worry AI investors.
Titled “Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market”, the paper [PDF] opens by pointing out that model-mart Hugging Face lists over a million AI models, although customers mostly run just a few of them. Alibaba Cloud nonetheless offers many models but found it had to dedicate 17.7 percent of its GPU fleet to serving just 1.35 percent of customer requests.
The reason for that discrepancy is that service providers typically configure their GPUs to run only two or three models, which is all that GPUs can handle because they don’t have enough memory to run more. That approach means that an outfit like Alibaba Cloud could have thousands of idle GPUs dedicated to seldom-used models.
That’s obviously untenable given the cost of GPUs and, for Alibaba, the difficulty of acquiring kit from Nvidia and AMD due to US sanctions.
The Chinese cloud champ therefore developed GPU pooling and memory management tech that means it can run more models on each GPU and offload data into a host’s memory or other storage.
- Alibaba Cloud reveals its uptime and efficiency secrets developed by in-house network boffins
- Alibaba Cloud claims its modular datacenter architecture shrinks build times by 50 percent
- Alibaba Cloud reveals DB cluster manager it says can beat rival hyperscalers
- Alibaba Cloud claims K8s service meshes can require more resources than the apps they run
The headline figures from that approach are impressive: Alibaba once dedicated 1,192 GPUs to running little-used models for customers’ inferencing workloads. During a three-month beta test of Aegaeon Alibaba was able to use just 213 GPUs, an 82 percent GPU resource saving. The company has also managed to get some of its GPUs running “tens” of models.
Alibaba Cloud claims Aegaeon is superior to alternative solutions, and the fact that its paper was accepted for presentation at last week’s Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles – an academic computer science conference – suggests the work is sound.
But it’s not necessarily a breakthrough, because hyperscalers are justifiably reticent to reveal all the tech that powers their platforms. It’s therefore entirely possible that other hyperscalers have already addressed this issue – and perhaps done even better than Alibaba.
Another thing to note is that hyperscalers are past masters at increasing utilization rates for their hardware, as doing so improves their profits. So while this paper describes some clever work by Alibaba, it also reveals the Chinese giant’s previous setup was not efficient.
The Register believes the paper is nonetheless important because we’re often told that as AI matures developers will create many industry-specific or scenario-specific models. Clouds need to be able to run them all efficiently, and Alibaba’s approach suggests it’s on the way to making that possible – which should mean the prices to run obscure models won’t blow out because using them requires more GPU resources.
That’s welcome. But this paper won’t panic AI investors as happened in January 2025’s “DeepSeek moment” when it looked like Chinese techies had found ways to dramatically reduce the quantities of GPUs required to train models. ®