AMD tries to catch CUDA with performance-boosting ROCm 7 software

House of Zen promises 3.5x improvement in inference and 3x uplift in training perf over last-gen software

AMD closed the performance gap with Nvidia's Blackwell accelerators with the launch of the MI355X this spring. Now the company just needs to overcome Nvidia's CUDA software advantage and make that perf more accessible to developers. 

The release of AMD's ROCm 7.0 software platform this week is a step in that direction, promising major improvements in inference and training performance that benefit not only its latest chips but its older MI300-series parts as well. The so-called CUDA moat could be getting narrower.

ROCm, if you're not familiar, is a suite of software libraries and development tools, including HIP frameworks, that provides developers a low-level programming interface for running high-performance computing (HPC) and AI workloads on GPUs. The software stack is reminiscent in many ways of the CUDA runtime, but for AMD GPUs rather than Nvidia.

Since the launch of the MI300X, its first truly AI-optimized graphics accelerator, back in 2023, AMD has extended support for new datatypes, improved compatibility with popular runtimes and frameworks, and introduced hardware-specific optimizations through its ROCm runtime.

ROCm 7 is arguably AMD's biggest update yet. Compared to ROCm 6, AMD says that customers can expect a roughly 3.5x uplift in inference performance on the MI300X. Meanwhile, the company says it has managed to boost the effective floating point performance achieved in model training by 3x.

AMD claims that these software enhancements combined give its latest and greatest GPU, the MI355X, a 1.3x edge in inference workloads over Nvidia's B200 when running DeepSeek R1 in SGLang. As usual, you should take all vendor performance claims with a grain of salt.

While the MI350X and MI355X are roughly on par with the B200 in terms of floating point performance, achieving 9.2 and 10 petaFLOPS of dense FP4 to Nvidia's 9 petaFLOPs, the AMD parts boast 108 GB more HBM3e.

The AMD MI355X's main competitor is actually Nvidia's B300, which packs 288 GB of HBM3e and manages 14 petaFLOPS of dense FP4 performance, which on paper could give it an edge in inference workloads.

Speaking of FP4 support, the MI350 series is AMD's first generation of GPUs to offer hardware acceleration for OCP's microscaling datatypes, which we looked at in more detail around OpenAI's gpt-oss launch last month.

These smaller formats have major implications for inference and training performance boosting throughput and cutting memory requirements by a factor of 2 to 4x. ROCm 7.0.0 extends broader support for these low precision datatypes with AMD saying its Quark quantization framework is now production ready.

This is a big improvement compared to giving the card FP8 support which trailed the release of the MI300 by the better part of a year.

Alongside the datatypes, ROCm 7.0.0 also introduces AMD's AI Tensor Engine, or AITER for short, which features specialized operators tuned for maximum GenAI performance.

For inference, AMD says AITER can boost MLA decode operations by 17x and MHA prefill ops by 14x. When applied to models like DeepSeek R1, the GPU slinger says AITER can boost throughput by more than 2x.

More importantly, AITER and the MXFP4 datatype have already been merged into popular inference serving engines like vLLM and SGLang. AMD tells us that enabling the feature is as simple as installing the dependencies and setting the appropriate environment variables.

Other improvements include support for the latest Ubuntu 24.04.3 LTS release as well as Rocky Linux 9 and KVM passthrough for those that want to add GPU acceleration to virtual machines.

ROCm 7 also adds native support for PyTorch 2.7 and 2.9, TensorFlow 2.19.1, and JAX 0.6.

Finally, for those deploying large quantities of Instinct accelerators in production, AMD is rolling out a pair of new dashboards designed to make managing large clusters of GPUs easier. AMD's Resource Manager provides detailed telemetry on the performance and utilization of the cluster, as well as access controls and the ability to set project quotas so that one team doesn't end up hogging all the compute.

Alongside the resource manager, AMD is also rolling out an AI Workbench which is designed to streamline the process of training or fine tuning popular foundation models.

ROCm 7.0 is available for download from AMD's support site, as well as in pre-baked container images on Docker Hub. ®

More about

TIP US OFF

Send us news


Other stories you might like