AMD bets on rack-scale compute to boost AI efficiency 20x by 2030

Who'd have thunk? The bigger the iron, the more efficient it gets

With Moore's Law on its last legs and datacenter power consumption a growing concern, AMD is embarking on an ambitious new goal to boost the energy efficiency of its chips 20-fold before 2030. And it sees rack-scale architectures as a key design point to get there.

"The counterintuitive thing here… is the bigger the device, the more efficient it is," AMD SVP and Fellow Sam Naffziger tells El Reg. "But what we're getting is what used to be a whole rack of compute devices in a single package."

AMD was among the first to apply this logic to its CPUs and GPUs, embracing a chiplet architecture that enabled it to overcome reticle limits and squeeze more performance from each watt consumed.

The ultimate culmination of this philosophy was AMD's MI300 series of APUs and GPUs, which formed a dense sandwich of 3D stacked compute, I/O dies, and interposers.

Rack scale salvation

Now, AMD is looking beyond the chip package and even the node to the rack scale to drive efficiencies over the next few years.

"That's the way we're going to be able to deliver continued significant improvements is being able to architect almost at the data center level," Naffziger said.

AMD isn't the first to reach this conclusion. At GTC last year, Nvidia revealed its first rack-scale system, the GB200 NVL72. 

Traditionally, both companies' GPU systems have used high-speed interconnects like NVLink or InfiniBand to pool their resources, making four or eight accelerators function as one great big one.

With the GB200 NVL72, Nvidia extended this scale-up network to the rack level, using 18 NVLink switch chips to make the 120kW monster's 72 Blackwell GPUs behave as one.

This spring, Nvidia unveiled its plans to extend this architecture to 144 and eventually 576 GPUs and up to 600kW of power.

However, the idea dates back much further.

"Rack scale is really re-inventing the scale-up multi-processing that IBM did in the 80s with shared memory spaces, load and store," but rather than a few dozen System/370 mainframes, we’re now talking about tens, potentially hundreds of GPUs, Naffziger contends.

AMD's first rack-scale compute platform is slated to arrive next year with the launch of its MI400. Naffziger suggests it'll follow the same basic formula as Nvidia's NVL systems, albeit using the Universal Accelerator Link UALink interconnect rather than NVLink. However, future designs could end up looking quite a bit different.

Most notably, Naffziger expects photonic interconnects could replace copper in the scale-up fabrics within the next five years. Co-packaged optics (CPO) have long promised greater bandwidth and reach than copper cables or traces, but have been held back by the increased power consumption associated with the lasers.

"Everything's driven by economics, and we're at the point where economics will favor optical," Naffziger said.

For all the advantages co-packaged optics presents, it isn't perfect. 

"There are temperature sensitives with optical," Naffziger said. "There's a lot more to worry about than in electrical space... Now we've got to route fiber attach and make sure it's mechanically robust and not susceptible to vibration."

This might explain why Nvidia has focused its early photonics efforts on the scale-out Ethernet and InfiniBand networks rather than boutique chip-to-chip interconnects. Most large scale photonic switches already require extensive use of power-hungry pluggable optics. So, for its first batch of photonic switches, Nvidia is using CPO to eliminate the need for these devices.

However, for its NVLink switch fabric, the company appears to be opting for greater rack densities, up to 600kW by 2027, in order to stick to copper.

Hardware software co-design will be key

As AMD prepares to scale up, Naffziger notes that process technology and improvements in semiconductor packaging will continue to play a role in achieving its 20x30 goal.

"There's still the remnants of Moore's law out there," he said. "We've got to use the latest process nodes."

While process technology isn't shrinking as quickly as it once did, there are still improvements to be had — especially when it comes to memory.

Naffziger pointed to 3D stacking and base die customization for high-bandwidth memory (HBM) as potential avenues for driving down the energy per bit, and reducing overall power consumption.

HBM accounts for a substantial amount of accelerator power consumption today. You may recall that with the jump from 192GB on the MI300X to 256GB on the MI325X, power consumption increased by 250W. So any packaging technologies that allow for both higher bandwidth and capacity while also curbing power consumption are worth investigating at the very least.

Even at rack scale, Naffziger says the "biggest improvements are going to be the fruit of hardware-software co-design. The raw hardware gains are reaching diminishing returns."

AMD has trailed in software, particularly when it comes to low-level development. However the situation has improved considerably in the year and a half since its MI300X made its debut.

The chip shop has invested considerable resources to optimize its ROCm software stack for a wide range of popular inference and training platforms, including vLLM, SGLang, and PyTorch.

These efforts have been bolstered by several acquisitions, including Nod.ai, Mipsology, and Brium. AMD has also been eager to attract AI talent. Most recently, Sharon Zhou, CEO of AMD-friendly startup Lamini, which helps companies tune LLMs to reduce hallucinations, announced her plans to join the House of Zen's AI software efforts on Wednesday.

"When we talk about a rack-scale goal, there definitely are big opportunities in system architecture, system design, improved components, integration reducing the cost of communication," Naffziger said. "But we've got to map the workload optimally on that hardware."

FP8 and now FP4 support is just one example of this. On the model side, these lower precision datatypes offer a number of advantages, trading often imperceptibly lower output quality for a smaller memory footprint. Meanwhile, halving the precision usually doubles the floating point output of an accelerator.

However, it can take software time to catch up to these new data types. It took the better part of a year from the time the MI300X launched to when the popular vLLM inference engine extended hardware support for AMD's FP8 implementation.

Software may be key to unlocking the full potential of AMD's silicon, but it also presents challenges when it comes to measuring performance, particularly when it comes to AI workloads.

The AI ecosystem is moving incredibly quickly. In a matter of months, a model can go from bleeding edge to antiquated. "We can't assume Llama 405B is going to be here in 2030 and have any meaning," Naffziger said.

So for AMD's 20x30 goal, it'll use a combination of GPU FLOPS, HBM, and network bandwidth, which are weighted differently for inference and training to keep track of its progress. ®

More about

TIP US OFF

Send us news


Other stories you might like