AMD hopes to unlock MI300’s full potential with fresh code
Devs invited to ROCm out with FP8 precision, quantize to their heart's delight
AMD today released the latest version of ROCm, claiming the improved software will bring about strong performance boosts for its Instinct GPU family.
If you're not familiar, ROCm is to AMD's GPUs what CUDA is to Nvidia's. The open source stack encompasses the various drivers, development tools, libraries, and APIs necessary for running compute workloads on AMD accelerators.
With the launch of ROCm 6.2, AMD has finally announced broader support for 8-bit floating point data types including in vLLM, a framework for scaling large language models (LLMs), such as LLama 3, across multiple GPUs or systems.
That FP8 support is a big deal since it was one of the flagship specs when AMD's MI300A and X APUs and GPUs arrived in December. The AI-centric MI300X boasted 2,614 teraFLOPS of dense FP8 performance (double that with sparsity) putting it well ahead of Nvidia's H100 and H200 at 1,979 teraFLOPS.
Unfortunately, tapping into that performance wasn't exactly easy as, at least at the time, vLLM, AMD's preferred model runner, didn't support FP8 data types. This has since changed, at least in the ROCm branch of vLLM and with the launch of ROCm 6.2 Instinct, developers can now take advantage of the data type.
Other optimizations include support for multi-GPU execution, and 8-bit key-value caches.
On top of vLLM, ROCm 6.2 also extends support for FP8 GEMMs to a variety of frameworks and libraries including PyTorch and JAX via HipBLASLt, JAX and Flax via XLA, as well as RCCL and MIOPEN.
Why FP8 Matters
While FP8 may not sound like a significant addition, lower precision data types have major considerations for running generative AI models. Compared to 16-bit floating point or Brain float precisions, FP8 occupies half the space in memory and substantially reduces memory pressure allowing for lower second token latencies.
At a system level this means that a single 8x MI300X machine with 1.5TB of HBM3 can now accommodate models well in excess of a trillion parameters and still have plenty of memory left over to support meaningful context lengths and batch sizes, at least so long as that model has been trained or quantized to FP8.
As models have grown larger, we're starting to see FP8 models become more popular. Meta's Llama 3.1 405B model, for instance, was launched alongside a version quantized to FP8 in order to fit it into a single Nvidia HGX H100 system. And while you could already run the model at BF16 on an similarly equipped MI300X box, dropping down to FP8 would effectively double the output generation rate.
This will no doubt make AMD's MI300X even more attractive to cloud giants, such as Microsoft, running massive frontier models including OpenAI's GPT-4o, particularly amid reports of delays to Nvidia's upcoming Blackwell GPU family.
Bitsandbytes comes to Instinct
On the topic of quantization, ROCm 6.2 also extends support for the popular Bitsandbytes library.
Bitsandbytes is commonly employed alongside PyTorch to automatically quantize models trained at 32-bit or 16-bit precision, usually down to eight or four bits. As we mentioned earlier, a lower precision reduces the amount of memory required and allows for higher throughput when inferencing.
According to AMD, Bitsandbytes support in ROCm 6.2 isn't limited to inferencing, either. "Using 8-bit optimizers, it can reduce memory usage during AI training, enabling developers to work with larger models on limited hardware," the Epyc slinger explained in its release notes.
Unfortunately, it doesn't appear you can pip install bitsandbytes
and start using it with Instinct accelerators just yet. ROCm support appears to be delivered via a fork of the project requiring manual installation of the library, at time of writing.
If you're not familiar with Bitsandbytes or quantization in general, Hugging Face has an excellent breakdown of the technology, including how to implement it, which you can find here. You can also find our hands-on guide to post-training quantization for more information on its impact on model size, performance, and accuracy here.
AMD rolls out Omnitrace and Omniperf monitoring and optimization tools
- AMD sold $1B of Instinct GPUs last quarter, driving triple-digit datacenter growth
- AMD claims Nvidia's Grace CPU Superchip, Arm are no match for its Epyc Zen 4 cores
- Nvidia's subscription software empire is taking shape
- Another law firm piles on Intel for Raptor Lake CPU failures as complaints grow louder
Alongside performance optimizations, AMD is also rolling out a pair of tools that, while still in beta, aim to make it easier for users to monitor and optimize the performance of their Instinct deployments.
The first of these, dubbed Omnitrace, is said to provide a bird's eye view of CPU, GPU, NIC, and network fabric performance. The idea being that doing so will help users spot and address bottlenecks.
Omniperf, on the other hand, tackles accelerator-level performance by providing real-time, kernel-level analytics, to help developers optimize their code for AMD hardware.
ROCm arrives on Ubuntu 24.04 alongside revamped installer
In addition to new features, ROCm 6.2 also adds extended support for Canonical's latest release of Ubuntu, version 24.04 LTS. Previously, the latest release of Ubuntu support was 22.04 which launched back in early 2022.
AMD has also extended support for Red Hat Enterprise Linux 8.10 and SUSE Linux Enterprise Server version 15 SP6. A full compatibility matrix can be found here.
And the Ryzen designer has rolled out a new offline installer with ROCm 6.2 targeted at users deploying AMD GPUs in environments without internet access or local mirrors.
In addition to installing the relevant binaries and dependencies, AMD says the installer will also handle post-install tasks such as user and group management and driver handling to ensure consistency across nodes.
While this may sound like a strange concept, AMD GPUs are deployed broadly in US Department of Energy supercomputers which are in some case air-gapped once deployed. With the MI300A APUs set to power Lawrence Livermore National Lab's El Capitan system and conduct research for America's nuclear arsenal, such an installer is no doubt invaluable. ®