AMD claims Nvidia's Grace CPU Superchip, Arm are no match for its Epyc Zen 4 cores
But does it matter when all Grace needs to is to babysit GPUs?
Comment AMD has claimed its current datacenter silicon is already more than twice as fast, and up to 2.75 times more efficient, than Nvidia's Grace CPU Superchips.
The chip design firm's assertions came after its own testing, published last week, in which it considered Nvidia's 2022 Grace CPU Superchip.
That product combines a pair of CPU dies packing 72 Arm Neoverse V2 cores apiece, connects them with a 900GB/sec NVLink chip-to-chip interconnect, and backs that with up to 960GB of speedy LPDDR5x memory. However, it appears AMD was testing the 480GB version.
To be clear, this isn't Nvidia's Grace-Hopper Superchip (GH200) which combines a single Grace CPU with up to 480GB of LPDDR5x and an 144GB H100 GPU die.
Against Nvidia's Grace CPU, AMD pitted both a single- and dual-socket system running an Epyc 4 Genoa (9654) and Bergamo (9754), each with 768GB of DDR5 4800MT/sec memory.
Across ten workloads – ranging from general purpose compute, server-side Java, power efficiency, transactional databases, decision support systems, web servers, in-memory databases, video encoding, and high-performance compute (HPC) – AMD boasted its kit delivered between 1.5x and 4x the performance of Nvidia's chip.
Your mileage may vary
As with any vendor supplied benchmarks, take these with a grain of salt. You can find a more detailed breakdown of AMD's performance claims here, but here's a consolidated view.
Here's a consolidated view showing AMD's claimed lead over Nvidia's Grace-Grace CPU Superchip – Click to enlarge
Meanwhile, in the SPECpower-ssj2008 benchmark, AMD claimed a single 128-core Epyc 9754 offered roughly 2.5x better performance per watt than Nvidia's Arm Neoverse V2-based chip, while a pair of the Bergamo Epycs pushed that advantage to 2.75x.
AMD also attempted to dispel the commonly held belief that Arm systems are more energy-efficient by nature, showing as much as a 2.75x lead in performance per watt in some tests – Click to enlarge
None of this should come as a surprise to anyone who's been following Grace's development – though the situation isn't as simple as AMD would have you believe.
As our sibling site The Next Platform reported back in February, researchers at Stoney Brook and Buffalo Universities compared performance data from Nvidia's Grace CPU Superchip and several x86 processors gathered from multiple scientific research institutes and one cloud builder.
Naturally, most of these tests were HPC-centric, including Linpack, High Performance Conjugate Gradient (HPCG), OpenFOAM, and Gromacs. While the Grace system's performance varied wildly between tests, at worst it fell somewhere between Intel's Skylake architecture (circa 2015) and its Ice Lake (2019) tech, bested AMD's Milan (from 2021) and came within spitting distance of a Xeon Max launched in early 2023.
The findings suggest that AMD's most powerful Genoa and Bergamo Epyc processors might beat out Nvidia's first datacenter CPU – on the right benchmark.
But as we alluded to earlier, all of this is workload dependent. In its Grace CPU Superchip datasheet, Nvidia shows the silicon achieving anywhere from 90 percent to 2.4x the performance of a dual 96-core Epyc 9654s – that's the same Genoa Epyc used in AMD's tests – and up to three times the throughput in a variety of cloud and HPC services.
What's this really about?
While a good old CPU shootout might make sense – Grace and Epyc are, at the end of the day, both datacenter CPU platforms – we haven't really seen Nvidia's Grace CPU Superchips deployed widely outside of HPC applications, and usually in preparation for larger scale deployments of the next-generation GH200 silicon. The UK's Isambard-3 and Isambard-AI supercomputers are fine examples of that strategy in action.
Nvidia itself bills the CPU Superchip as one that's designed to "process mountains of data to produce intelligence with maximum energy efficiency," and specifically cites AI, data analytics, hyperscale cloud applications, and HPC applications.
What's more, in the GH200 configuration most of the computation is done by the GPU – Grace mostly keeps the accelerator fed with data. And clearly Nvidia thinks Grace and its NVLink-C2C interconnect is up to the task, as it chose to reuse the CPU on its upcoming GB200 superchips, which we looked at back at Nvidia's GTC developer conference.
That's arguably all Nvidia needs Grace to do for it to be successful. And explains why the acceleration champ has already started work on its successor.
We have to imagine the number of folks cross shopping Grace-Grace against 4th-gen Epyc – outside of the HPC arena of course – is a rather short list. In all honesty, we'd have been far more interested to see a head-to-head between the GH200 and AMD's MI300A APUs.
- Nvidia said to be prepping Blackwell GPUs for Chinese market
- Nvidia's next Linux driver to be… just as open
- Game dev accuses Intel of selling 'defective' Raptor Lake CPUs
- AMD spills the beans on Zen 5's 16% IPC gains
AMD closes its claims with a discussion on Arm compatibility – a topic worthy of many more benchmarks.
We get the sense AMD's tests may just be an exercise in dispelling fears that x86 is running out of steam and that Arm is taking over.
Arm isn't exactly new to the HPC community or the cloud – markets that are far from rejecting the architecture. In fact, every major US cloud provider now has an Arm CPU to call their own.
But if this is really about how AMD's Zen 4 and Zen 4c cores stack up against Arm's Neoverse V2 architecture, a comparison with Amazon Web Service's Graviton4 would have been more useful.
Announced late in 2023, Graviton4 is based on the same Neoverse V2 core as Grace, but boasts 96 cores and supports standard dual-socket configurations and 12 channels of DDR4, as opposed to Grace's soldered down LPDDR5x modules.
Instances running Graviton4 have been available in preview for months now and became generally available last week. Perhaps more importantly, AWS offers both Epyc 4- and Graviton 4-based instances, making the likelihood of someone comparing the two far higher. ®