Big Cloud deploys thousands of GPUs for AI – yet most appear under-utilized
If AWS, Microsoft, Google were anywhere close to capacity, their revenues would be way higher
Cloud providers have deployed tens of thousands of GPUs and AI accelerators in their race to capitalize on the surge in demand for large language models.
Yet despite these massive deployments, the evidence suggests the majority of these processors are being under-utilized, TechInsights analyst Owen Rogers tells The Register.
By the analyst firm's estimates in 2023 alone, 878,000 accelerators were responsible for turning out seven million GPU-hours of work, which the research outfit estimates adds up to roughly $5.8 billion in revenue spending.
While cloud providers aren't in the habit of sharing their actual utilization levels, Rogers points out that if GPU clusters were operating at anywhere close to capacity, that revenue figure would be substantially higher.
Using AWS's UltraScale clusters, each of which are composed of 20,000 Nvidia H100 GPUs which can be rented in instances of eight, at a rate of $98.32/hr. Rogers says that assuming one cluster per region running at 100 percent utilization year round, Amazon should be raking in closer to $6.5 billion a year.
"In fact, if each type of accelerator offered by AWS today were situated in a cluster of 20,000 in each region in which they are currently available and sold around the clock, they would generate 50 percent of AWS' revenue for 2023," Rogers wrote in an upcoming report.
And so the only logical explanation is that these accelerators aren't being effectively utilized.
Rogers acknowledges that many cloud providers utilize accelerators for internal workloads, which may skew this conclusion somewhat, but he argues that for them to be viable the systems still need to generate revenue for the hardware investment to be worthwhile.
The problem it seems relates to the way users typically consume cloud services. The cloud offers value in a couple of ways, Rogers opines. The first is that customers can deploy and scale their applications without advanced notice. The second is by providing access to leading-edge technologies on a purely consumption-based model.
Accelerators largely fit in the second category, due in part to their high cost compared to contemporary systems. As our sibling site The Next Platform has previously discussed, demand for GPUs to power generative AI workloads is so great that, at one point, folks were scalping H100 PCIe cards for as much as $40,000 a pop on eBay. For customers that aren't running AI workloads constantly running that job in the cloud is likely to me less expensive than building out a cluster of their own.
But the nature of the cloud means that companies like AWS or Microsoft have to provision far more capacity than they expect to sell, in other words they have to plan for peak demand, Rogers explains.
It's also worth noting that, with a few exceptions, GPUs don't lend themselves to over provisioning in the same way that CPUs do. Generally speaking, GPUs are passed through to a VM, or an entire server is made available to the customer.
Some cloud providers, particularly smaller niche players, do take advantage of Nvidia's multi-instance GPU tech, which allows the accelerator to be partitioned into multiple GPUs. Others, meanwhile, use a technique called time slicing to run multiple workloads on the same GPU.
Having said that, in the era of large language models, most customers aren't going to be spinning up fractional GPUs, especially for training workloads which might require hundreds or thousands of such systems.
Rogers also strongly suspects that reports of capacity shortages for accelerators have more to do with resource stranding and scheduling than anything else.
"I think what's happening is there's a lot of demand for these accelerators, but perhaps demand is happening at exactly the same time, causing contention."
In other words, if you've got five people who all want 8,000 GPUs to train their model, but you've only got 20,000 GPUs go around, some of those customers are gonna have to wait.
And as Rogers points out, there's some evidence to support this. Over the past year, AWS and Google Cloud have rolled out scheduling services designed to help optimize for cost, availability, and improve utilization.
Can abstraction help?
As we mentioned earlier, most GPU instances are offered as VMs and bare metal servers. But as Rogers notes, that isn't the only way to consume AI resources in the cloud, pointing to Amazon's SageMaker platform as an example.
These services take away the complexity of deploying an AI/ML workload. "Their argument would be, if you can't get the capacity; if you're struggling to deal with when to use the capacity; or how to schedule, you could offload that to our platform, and we'll do it all for you," he notes.
High levels of abstraction also mean that customers don't have to think about which accelerator to optimize for. While Nvidia is the dominant player in the AI hardware arena, all of the major cloud providers have developed custom silicon of their own and AMD's recently announced MI300X GPUs are already seeing adoption by Microsoft and others.
- Why Google is waiving egress fees for disgruntled customers ditching GCP
- Nvidia can't sell its best chips to China, but India is more than happy to take them
- Google's TPUs could end up costing it a billion-plus, thanks to this patent challenge
- Avoiding AI-capable PCs will be impossible by 2027
Rogers believes that given time people's skill sets may shift toward platforms like SageMaker. But, "if you're a coder now who understands machine learning and AI, you're probably a coder who understands GPUs and how to utilize them. Whereas, you probably don't understand SageMaker or the Google or Microsoft equivalents… It's probably easier at this stage to do what you know than to have to learn a whole new platform," he said.
Where does this leave the GPU bit barns?
Of course, the cloud providers aren't the only place where you can rent GPUs. Several colocation and metal-as-a-service vendors, like CoreWeave, have popped up over the past few years to serve demand for large GPU deployments.
These company's often boast far more competitive pricing for GPUs. CoreWave has H100s for as little as $2.23 an hour - if you're willing to commit to enough of them.
And while Rogers believes companies like CoreWeave have a place in the market, he argues they're best suited to customers looking to run massive training workloads in a short period of time. "I think there are going to be challenges longer term for them."
One of these challenges involves egress fees for those already invested in cloud storage. If you've got your data in AWS, it's going to cost you to move that data to GPU farm for processing, Rogers explained.
For companies training LLMs from scratch the cost of data movement is likely trivial and thus utilizing services from CoreWeave and the like may make sense. On the other hand if you're a smaller enterprise retraining a Llama 2 7B to serve as a jazzed up customer service chatbot it's probably going to make more sense to run that workload in the cloud.
"If you want to build an application that uses all these GPUs, then a hyperscale cloud provider is inevitably going to have more of the services you need," he said.
And while CoreWeave may be less expensive at scale, Rogers emphasized that can, and probably will, change. "The hyperscalers have large enough revenue and large enough buying power that if they really wanted to, they could cut the prices of accelerators really low and undercut them on price," he said.
"They're far bigger; they have bigger purchasing power; they have margins that can take a hit because they can make it up in the margins of other services."
For Rogers, while there is a tremendous amount of hype around AI, in order for it to be useful, it needs to be tied into other services. "We still need CPUs, we need tons of storage, we need loads of memory, and so, I don't think AI is going to consume the cloud." ®