Microsoft Azure to spin up AMD MI200 GPU clusters for 'large scale' AI training

Windows giant carries a PyTorch for chip designer and its rival Nvidia

Microsoft Build Microsoft Azure on Thursday revealed it will use AMD's top-tier MI200 Instinct GPUs to perform “large-scale” AI training in the cloud.

“Azure will be the first public cloud to deploy clusters of AMD's flagship MI200 GPUs for large-scale AI training,” Microsoft CTO Kevin Scott said during the company’s Build conference this week. “We've already started testing these clusters using some of our own AI workloads with great performance.”

AMD launched its MI200-series GPUs at its Accelerated Datacenter event last fall. The GPUs are based on AMD’s CDNA2 architecture and pack 58 billion transistors and up to 128GB of high-bandwidth memory into a dual-die package.

At launch, AMD’s Forrest Norrod, SVP and GM of datacenter and embedded solutions, claimed the chips were nearly 5X faster than Nvidia’s then top-tier A100 GPU, at least in “highly precise” FP64 calculations. That lead narrowed substantially in more common FP16 workloads, where AMD claims the chips are about 20 percent faster than Nvidia’s A100. Nvidia being the dominant datacenter GPU player.

However, it remains to be seen if and when Azure instances based on the aforementioned AMD's graphics chips will become generally available, or if Microsoft plans to use them for internal workloads.

What we do know is Microsoft is working closely with AMD to optimize the chipmaker’s GPUs for PyTorch machine learning workloads.

“We're also deepening our investments in the open-source PyTorch framework, working with the PyTorch core team and AMD both to optimize the performance and developer experience for customers running PyTorch on Azure, and to ensure that developers' PyTorch projects work great on AMD hardware,” Scott said.

The Register reached out to Microsoft about its plans for AMD’s GPUs. We'll let you know if we hear anything.

The enterprise GPU market heats up

PyTorch development was a central component of Microsoft’s strategic partnership with Meta AI, announced earlier this week.

In addition to working with Microsoft to optimize its infrastructure for PyTorch workloads, the social media giant announced it would to deploy “cutting-edge ML training workloads” on a dedicated Azure cluster of 5,400 Nvidia A100 GPUs.

The deployment came just as the datacenter became Nvidia’s largest business unit at $3.75 billion, outstripping gaming ($3.62 billion) for the first time.

While Nvidia may still command the bulk of the enterprise GPU market, it’s facing its stiffest competition in years, and it's not just AMD knocking at its door.

Intel’s long-hyped Ponte Vecchio GPUs are expected to roll out later this year alongside the chipmaker’s long-delayed Sapphire Rapids Xeon Scalable processors.

And, earlier this month, Intel debuted its second-generation AI training and inference accelerators, which it claims offer A100-beating performance. ®

Other stories you might like

  • Supply chain blamed amid claims of Azure capacity issues
    Microsoft says it'll move to 'restrict trials and internal workloads to prioritize growth of existing customers'

    Microsoft's Azure cloud is having difficulty providing enough capacity to meet demand, according to some customers, with certain regions said to refusing new subscriptions for services.

    Azure comprises over 200 datacenters globally spread across 60 regions, but reports suggest that over two dozen of these are operating with limited capacity, and that the cloud and IT giant is being forced to prioritize resources in order to serve existing customers.

    According to technology news site The Information, capacity issues are affecting Azure datacenters in Washington State in the US as well as across Europe and Asia, and it claims that server capacity is expected to remain limited until early next year, citing a Microsoft insider.

    Continue reading
  • FabricScape: Microsoft warns of vuln in Service Fabric
    Not trying to spin this as a Linux security hole, surely?

    Microsoft is flagging up a security hole in its Service Fabric technology when using containerized Linux workloads, and urged customers to upgrade their clusters to the most recent release.

    The flaw is tracked as CVE-2022-30137, an elevation-of-privilege vulnerability in Microsoft's Service Fabric. An attacker would need read/write access to the cluster as well as the ability to execute code within a Linux container granted access to the Service Fabric runtime in order to wreak havoc.

    Through a compromised container, for instance, a miscreant could gain control of the resource's host Service Fabric node and potentially the entire cluster.

    Continue reading
  • AMD touts big datacenter, AI ambitions in CPU-GPU roadmap
    Epyc future ahead, along with Instinct, Ryzen, Radeon and custom chip push

    After taking serious CPU market share from Intel over the last few years, AMD has revealed larger ambitions in AI, datacenters and other areas with an expanded roadmap of CPUs, GPUs and other kinds of chips for the near future.

    These ambitions were laid out at AMD's Financial Analyst Day 2022 event on Thursday, where it signaled intentions to become a tougher competitor for Intel, Nvidia and other chip companies with a renewed focus on building better and faster chips for servers and other devices, becoming a bigger player in AI, enabling applications with improved software, and making more custom silicon.  

    "These are where we think we can win in terms of differentiation," AMD CEO Lisa Su said in opening remarks at the event. "It's about compute technology leadership. It's about expanding datacenter leadership. It's about expanding our AI footprint. It's expanding our software capability. And then it's really bringing together a broader custom solutions effort because we think this is a growth area going forward."

    Continue reading

Biting the hand that feeds IT © 1998–2022