AI + ML

This article is more than 1 year old

Microsoft Azure to spin up AMD MI200 GPU clusters for 'large scale' AI training

Windows giant carries a PyTorch for chip designer and its rival Nvidia

Thu 26 May 2022 // 22:46 UTC

Microsoft Build Microsoft Azure on Thursday revealed it will use AMD's top-tier MI200 Instinct GPUs to perform “large-scale” AI training in the cloud.

“Azure will be the first public cloud to deploy clusters of AMD's flagship MI200 GPUs for large-scale AI training,” Microsoft CTO Kevin Scott said during the company’s Build conference this week. “We've already started testing these clusters using some of our own AI workloads with great performance.”

AMD launched its MI200-series GPUs at its Accelerated Datacenter event last fall. The GPUs are based on AMD’s CDNA2 architecture and pack 58 billion transistors and up to 128GB of high-bandwidth memory into a dual-die package.

At launch, AMD’s Forrest Norrod, SVP and GM of datacenter and embedded solutions, claimed the chips were nearly 5X faster than Nvidia’s then top-tier A100 GPU, at least in “highly precise” FP64 calculations. That lead narrowed substantially in more common FP16 workloads, where AMD claims the chips are about 20 percent faster than Nvidia’s A100. Nvidia being the dominant datacenter GPU player.

However, it remains to be seen if and when Azure instances based on the aforementioned AMD's graphics chips will become generally available, or if Microsoft plans to use them for internal workloads.

What we do know is Microsoft is working closely with AMD to optimize the chipmaker’s GPUs for PyTorch machine learning workloads.

“We're also deepening our investments in the open-source PyTorch framework, working with the PyTorch core team and AMD both to optimize the performance and developer experience for customers running PyTorch on Azure, and to ensure that developers' PyTorch projects work great on AMD hardware,” Scott said.

The Register reached out to Microsoft about its plans for AMD’s GPUs. We'll let you know if we hear anything.

The enterprise GPU market heats up

PyTorch development was a central component of Microsoft’s strategic partnership with Meta AI, announced earlier this week.

In addition to working with Microsoft to optimize its infrastructure for PyTorch workloads, the social media giant announced it would to deploy “cutting-edge ML training workloads” on a dedicated Azure cluster of 5,400 Nvidia A100 GPUs.

The deployment came just as the datacenter became Nvidia’s largest business unit at $3.75 billion, outstripping gaming ($3.62 billion) for the first time.

While Nvidia may still command the bulk of the enterprise GPU market, it’s facing its stiffest competition in years, and it's not just AMD knocking at its door.

Intel’s long-hyped Ponte Vecchio GPUs are expected to roll out later this year alongside the chipmaker’s long-delayed Sapphire Rapids Xeon Scalable processors.

And, earlier this month, Intel debuted its second-generation AI training and inference accelerators, which it claims offer A100-beating performance. ®

Topics

Special Features

Vendor Voice

Resources

AI + ML

Microsoft Azure to spin up AMD MI200 GPU clusters for 'large scale' AI training

Windows giant carries a PyTorch for chip designer and its rival Nvidia

The enterprise GPU market heats up

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

AI cloud startup TensorWave bets AMD can beat Nvidia

Microsoft foresees a new type of AI PC: A Surface designed with help from machines

Microsoft slammed for lax security that led to China's cyber-raid on Exchange Online

Protecting distributed branch office environments from ransomware

AMD to open source Micro Engine Scheduler firmware for Radeon GPUs

Researchers claim Windows Defender can be fooled into deleting databases

October 2025 will be a support massacre for a bunch of Microsoft products

Microsoft breach allowed Russian spies to steal emails from US government

Microsoft is a national security threat, says ex-White House cyber policy director

Open source versus Microsoft: The new rebellion begins

Intel Gaudi's third and final hurrah is an AI accelerator built to best Nvidia's H100

AI gold rush continues as Microsoft invests $1.5B in UAE's G42

About Us

Our Websites

Your Privacy