Cloud giants 'ran out' of fast GPUs for AI boffins

Capacity droughts hit just before conference paper deadlines, say researchers


Top cloud providers struggled to provide enough GPUs on-demand last week, AI experts complained to The Register.

As a deadline for research papers loomed for a major conference in the machine-learning world, teams around the globe scrambled to rent cloud-hosted accelerators to run tests and complete their work in time to submit their studies to be included in the event.

That, we're told, sparked a temporary shortage in available GPU capacity.

Graphics processors are suited for machine learning, as they can perform the necessary vector calculations for neural networks extremely fast in parallel, compared to generic application CPUs.

Using more GPUs means shorter training times and faster results, something wide-eyed researchers desperately craved as they stayed up to the early hours racing to submit their latest efforts to The Conference and Workshop on Neural Information Processing Systems (NIPS) by 1pm Pacific Time on Friday. NIPS is the biggest academic AI conference, drawing over 5,000 attendees.

Conference deadlines tend to put a strain on cloud GPU availability, researchers told The Register, some speaking on condition of anonymity. The run up to this year's NIPS was particularly bad, we gather, and as AI development ramps up, hosting providers must scale to meet demand. It's something to bear in mind the next time a cloud giant boasts about new features and hardware acceleration on its platform: this tech is not always immediately available to everyone.

It's claimed a drought of GPUs hit Google Cloud and Microsoft's Azure in particular as the pair strained to keep up with demand ahead of the NIPS deadline.

One researcher told The Register his team was able to get one of the last remaining Nvidia DGX-1 boxes, with eight Tesla P100 GPUs, from Nimbix, a smaller rival cloud platform, after being unable to get the resources needed at larger cloud players.

A source familiar with Amazon's operations claimed AWS had no capacity issues; it is speculated by others, though, that this may have been due to AWS's spot pricing putting off pressed-for-time AI teams and others who need a stable service.

AWS offers “spot instances,” where customers can bid on spare capacity. Prices fluctuate depending on demand. It can be a cheaper option than on-demand pricing, where users are billed by the hour.

It only works if users offer more than the current spot price to access its CPUs or GPUs. If capacity is low, and the spot price exceeds the maximum price offered, users are notified two minutes before their session is terminated that they are being kicked off.

A graph uploaded by Reza Zadeh, CEO of Matroid, a machine learning startup, and a professor at Stanford, shows that two days before the NIPS deadline, the price to rent out a p2.16Xlarge with 16GPUs was a whopping $144 per hour - the maximum possible price. Considering that models often take days to train, AI research is not cheap.

Adam Gibson, cofounder and CTO of Skymind, a startup geared towards enterprises implementing AI on a large scale, said: “A lot of companies can’t keep up with the GPU demand. Cloud vendors often have data centers in regions. Most data centers don’t have enough GPUs per region. Azure is particularly bad at this.”

Spot instances aren’t ideal for research as “they are inconsistent and jobs can die at any time.” Google and Azure offer similar spot pricing tiers.

Meanwhile, Xavier Amatriain, VP of engineering at Quora, a question-and-answer site, said a Google engineer has been struggling to nab GPUs due to the major influx of requests on Twitter.

Google's cloud offers Nvidia Tesla K80 GPUs; Nvidia P100 and AMD chips are coming soon. Google's pricing calculator is broken right now for GPU instances, but we understand it costs from $0.70 an hour per GPU added to a generic compute instance. A bog-standard 16 vCPU cores and 60GB of RAM starts from about $0.65.

Azure's GPU instances start from one K80, 6 vCPU cores and 56GB of RAM from $0.90 an hour. Amazon K80 instances start from $0.90 an hour for one GPU, 4 vCPU cores, and 61GB of RAM.

Spokespeople for Microsoft Azure and Amazon Web Services (AWS) declined to comment on the AI researchers' claims. A spokesperson for Google Cloud did not respond to a request for comment. ®


Other stories you might like

  • It's 2022 and there are still malware-laden PDFs in emails exploiting bugs from 2017
    Crafty file names, encrypted malicious code, Office flaws – ah, it's like the Before Times

    HP's cybersecurity folks have uncovered an email campaign that ticks all the boxes: messages with a PDF attached that embeds a Word document that upon opening infects the victim's Windows PC with malware by exploiting a four-year-old code-execution vulnerability in Microsoft Office.

    Booby-trapping a PDF with a malicious Word document goes against the norm of the past 10 years, according to the HP Wolf Security researchers. For a decade, miscreants have preferred Office file formats, such as Word and Excel, to deliver malicious code rather than PDFs, as users are more used to getting and opening .docx and .xlsx files. About 45 percent of malware stopped by HP's threat intelligence team in the first quarter of the year leveraged Office formats.

    "The reasons are clear: users are familiar with these file types, the applications used to open them are ubiquitous, and they are suited to social engineering lures," Patrick Schläpfer, malware analyst at HP, explained in a write-up, adding that in this latest campaign, "the malware arrived in a PDF document – a format attackers less commonly use to infect PCs."

    Continue reading
  • New audio server Pipewire coming to next version of Ubuntu
    What does that mean? Better latency and a replacement for PulseAudio

    The next release of Ubuntu, version 22.10 and codenamed Kinetic Kudu, will switch audio servers to the relatively new PipeWire.

    Don't panic. As J M Barrie said: "All of this has happened before, and it will all happen again." Fedora switched to PipeWire in version 34, over a year ago now. Users who aren't pro-level creators or editors of sound and music on Ubuntu may not notice the planned change.

    Currently, most editions of Ubuntu use the PulseAudio server, which it adopted in version 8.04 Hardy Heron, the company's second LTS release. (The Ubuntu Studio edition uses JACK instead.) Fedora 8 also switched to PulseAudio. Before PulseAudio became the standard, many distros used ESD, the Enlightened Sound Daemon, which came out of the Enlightenment project, best known for its desktop.

    Continue reading
  • VMware claims 'bare-metal' performance on virtualized GPUs
    Is... is that why Broadcom wants to buy it?

    The future of high-performance computing will be virtualized, VMware's Uday Kurkure has told The Register.

    Kurkure, the lead engineer for VMware's performance engineering team, has spent the past five years working on ways to virtualize machine-learning workloads running on accelerators. Earlier this month his team reported "near or better than bare-metal performance" for Bidirectional Encoder Representations from Transformers (BERT) and Mask R-CNN — two popular machine-learning workloads — running on virtualized GPUs (vGPU) connected using Nvidia's NVLink interconnect.

    NVLink enables compute and memory resources to be shared across up to four GPUs over a high-bandwidth mesh fabric operating at 6.25GB/s per lane compared to PCIe 4.0's 2.5GB/s. The interconnect enabled Kurkure's team to pool 160GB of GPU memory from the Dell PowerEdge system's four 40GB Nvidia A100 SXM GPUs.

    Continue reading
  • Nvidia promises annual updates across CPU, GPU, and DPU lines
    Arm one year, x86 the next, and always faster than a certain chip shop that still can't ship even one standalone GPU

    Computex Nvidia's push deeper into enterprise computing will see its practice of introducing a new GPU architecture every two years brought to its CPUs and data processing units (DPUs, aka SmartNICs).

    Speaking on the company's pre-recorded keynote released to coincide with the Computex exhibition in Taiwan this week, senior vice president for hardware engineering Brian Kelleher spoke of the company's "reputation for unmatched execution on silicon." That's language that needs to be considered in the context of Intel, an Nvidia rival, again delaying a planned entry to the discrete GPU market.

    "We will extend our execution excellence and give each of our chip architectures a two-year rhythm," Kelleher added.

    Continue reading
  • Amazon puts 'creepy' AI cameras in UK delivery vans
    Big Bezos is watching you

    Amazon is reportedly installing AI-powered cameras in delivery vans to keep tabs on its drivers in the UK.

    The technology was first deployed, with numerous errors that reportedly denied drivers' bonuses after malfunctions, in the US. Last year, the internet giant produced a corporate video detailing how the cameras monitor drivers' driving behavior for safety reasons. The same system is now apparently being rolled out to vehicles in the UK. 

    Multiple camera lenses are placed under the front mirror. One is directed at the person behind the wheel, one is facing the road, and two are located on either side to provide a wider view. The cameras are monitored by software built by Netradyne, a computer-vision startup focused on driver safety. This code uses machine-learning algorithms to figure out what's going on in and around the vehicle.

    Continue reading

Biting the hand that feeds IT © 1998–2022