This article is more than 1 year old
How AMD, Intel, Nvidia are keeping their cores from starving
Chips are only as fast as the memory that feeds them
Analysis During the recent launch of its 96-core Epyc Genoa CPUs, AMD touched on one of the biggest challenges facing modern computing. For the past several years, the rate at which processors have grown more powerful has outpaced that of the memory subsystems that keep those cores fed with data.
"Anything that's using a very large memory footprint is going to need a lot of bandwidth to drive the cores," Gartner analyst Tony Harvey told The Register. "And if you're accessing that data randomly, then you’re going to be cache missing a lot, so being able to pull in data very quickly is going to be very useful."
And this is by no means a new phenomenon, especially in high-performance computing (HPC) workloads. Our sister site The Next Platform has been tracking the growing ratio of compute power to memory bandwidth for some time now.
But while the move to DDR5 4,800MTps DIMMs will boost bandwidth by 50 percent over the fastest DDR4, that on its own wasn't enough to satiate AMD's 96-core Epycs. AMD engineers had to make up the difference by increasing the number of memory controllers and thereby channels to 12. Combined with faster DDR5, Genoa offers more than twice the memory bandwidth of Milan.
The approach isn't without compromise. For one, adding more channels requires dedicating more die space to memory controllers. There are also signaling considerations that have to be taken into consideration to support the larger number of DIMMs attaching to those channels. And then there's the challenge of physically fitting all of those DIMMs into a conventional chassis, especially in a dual-socket configuration.
Because of this, AMD is likely to stay at 12 channels for at least the next few generations, and instead rely on improving DDR5 memory speeds to boost bandwidth.
Micron expects memory speeds to top 8,800MTps within DDR5's lifetime. In a 12-channel system, that works out to about 840GBps of memory bandwidth.
"The performance of DDR5 is going to go up over time, but we're still going to have that big difference between the cores that are available and the memory bandwidth, and it's going to be difficult to keep them fed," Harvey said.
Optane lives on. Sort of
While AMD’s approach to the problem involves physically cramming more memory controllers into its chips and more faster DDR5 memory into the system, Intel has taken a different approach with the Xeon Max CPUs that will power the US Department of Energy's long-delayed Aurora supercomputer.
Formerly known as Sapphire Rapids HBM, the chips package 64GB of HBM2e memory capable of 1TBps of bandwidth in a 56-core 4th-gen Xeon Scalable processor.
And while you can technically run the chip entirely off the HBM, for those who need vast pools of memory for things like large natural-language models, Intel supports tiered memory in two configurations highly reminiscent of its recently axed Optane business unit.
In Intel's HBM flat mode, any external DDR5 acts as a separately accessible memory pool. Meanwhile in caching mode, the HBM is treated more like a level 4 cache for the DDR5.
Although the latter may be attractive for some use cases since it's transparent and doesn't require any software changes, Harvey argues that if it behaves anything like Intel's Optane persistent memory, the HBM could go underutilized.
"Most of the time, CPUs are good at caching at the instruction level; they're not very good at caching at the application level," he said, adding that running the chip in flat mode could prove promising, even though it would require special considerations from software vendors.
- Nvidia teases server designs for Grace-Hopper Superchips
- AMD's 96-core Epyc CPUs leapfrog Intel to put DDR5, PCIe 5.0 in the datacenter
- Intel takes on AMD and Nvidia with mad 'Max' chips for HPC
- Micron releases DDR5 DRAM ready for next-gen servers
"If you've got a large HBM cache effectively for main memory, then the operating system vendors, the hypervisor vendors, are going to be much better at managing that then the CPU will be," he said. "The CPU can't see about the level of instructions, whereas the hypervisor knows that I'm about to flip between this app and that app and therefore I can preload that app into HBM."
Co-packaged LPDDR
To achieve similarly high bandwidths for its first datacenter CPU, Nvidia is also moving the memory onto the CPU. But unlike Intel's Xeon Max, Nvidia isn’t relying on expensive, low-capacity HBM memory, and is instead using commodity LPDDR5x modules.
Each Grace Superchip fuses two Grace CPU dies — each with 72 Arm Neoverse V2 cores — connected by the chipmaker's 900GB/s NVLink-C2C interconnect. The dies are flanked by rows of LPDDR5 memory modules for a terabyte of bandwidth and capacity.
While it's hard to know for sure, our best guess is each Grace CPU die is hooked up to eight 64GB LPDDR5x memory modules running somewhere in the neighborhood of 8,533MTps. This would work out to 546GBps of bandwidth for each of the two CPU dies.
Apple actually employed a similar approach, albeit using slower LPDDR5 6,400MTps memory, to achieve 800GBps of memory bandwidth on its M1 Ultra processors, launched in the Mac Studio earlier this year. However, Apple's reasons for doing so had less to do with per-core memory bandwidth and more to do with feeding the chip's integrated GPUs.
For Nvidia, the method offers a few apparent advantages over using something like HBM, with the biggest being capacity and cost. HBM2e can be had in capacities up to 16GB from vendors like Micron. That means you'd need four times as many modules as LPDDR.
But even this approach isn't without compromise, according to Harvey. Baking the memory onto the CPU package means you're giving up flexibility. If you need more than 1TB of system memory, you can't just add more DIMMs to the mix — at least not how Nvidia has implemented things.
However, for Nvidia's target market for these chips, it probably still makes sense, Harvey explained. "Nvidia is very much focused on AI/ML workloads which have a particular set of needs, whereas Intel is more focused at that general purpose workload."
CXL isn't the answer, yet
Both AMD's Genoa and Intel's 4th-gen Xeon Scalable processors add support for CXL 1.1 interconnect standard.
Early implementations of the tech by companies like Astera Labs and Samsung will allow for novel memory configurations including memory expansion and memory tiering.
However, for the moment, the limited bandwidth available to these devices means their usefulness for addressing the mismatch between CPU and memory performance is limited.
AMD's implementation features 64 lanes dedicated to CXL devices. That works out to about 63GBps of bandwidth for a 16x expansion module. That's just shy of the bandwidth required to satiate two channels of DDR5 4800MTps.
"It might open up some stuff for memory bandwidth over time, but I think the initial implementations may not be fast enough," Harvey said.
With future generations of PCIe, that could change. The interconnect tech typically doubles its bandwidth with each subsequent generation. So by PCIe Gen 7.0, a single CXL 16x device would have closer to 250GBps of bandwidth available to it.
For now, Harvey argues that CXL will be most valuable for memory-hungry applications that aren't necessarily as sensitive to bandwidth or in a tiered memory configuration. ®