Ampere heads off Intel, AMD's cloud-optimized CPUs with a 192-core Arm chip
Just don't look too closely at the benchmarks
What's better than 128 cores? 192 of course, or that's the bet Ampere is making with the launch of its next-generation Arm datacenter processors this week.
Since breaking into the datacenter CPU market in 2020 with the launch of its 80-core Ampere Altra parts, the company's strategy of packing a ton of relatively small, efficient Arm cores into a single socket has paid dividends. Today, nearly every major cloud provider, with the exception of Amazon of course, has put Ampere's cores to work in their clouds.
With the launch of its 192-core Ampere One processor family this week, Ampere hopes to cement its hard-won cloud foothold, even as Intel and AMD circle like buzzards with their own core-optimized parts.
So what has Ampere brought to the table this time?
Not just more cores, but mostly that
If it wasn't already obvious, more, faster, feature-packed cores.
To be fair, that's what everyone seems to be doing this time around. Intel and AMD boosted their core counts by 50 percent and Ampere's strategy is no different. The difference is Ampere's chips have gobs more cores — twice as many as AMD and more than three times as many as Intel's priciest Xeon.
The Ampere's One family is available in five SKUs ranging from 136 to 196 cores, picking up where their last-gen Altra chips left off. Because of this, Ampere tells us it'll be keeping the Altra family around for a little while longer.
And while these chips are still single threaded, they're now fabbed on a combination of TSMC 5nm and 7nm processes tech, in a chiplet architecture.
A peek under the integrated heat spreader reveals that Ampere's approach to chiplets differs greatly from either AMD or Intel. Where AMD breaks up its 96 cores into 12, eight-core compute tiles that all talk to a single central memory and I/O controller, Ampere has taken the opposite approach. All 192 cores reside within a single large die flanked by memory and I/O dies.
This has a couple of benefits, but the main one is that Ampere can theoretically achieve better latencies. "This, I believe is a more ideal way of architecting it, because it means that you don't have a bunch of hops from CPU to CPU they're all sitting there together on the mesh," Ampere Chief Product Officer Jeff Wittich told The Register in a briefing ahead of the launch.
The other advantage is the ability to mix and match process tech. While Ampere's compute tile is fabbed using a 5nm process its I/O and memory dies are built using an older 7nm process. "There's usually not a huge advantage to moving those over to the latest process node because the analog circuits don't scale in the same way," he explained.
With that said, Ampere is hardly the first to do this. AMD has used a heterogeneous mix of process tech to great effect in its Epyc and Ryzen processors for years now.
Digging deeper into Ampere One reveals the company's latest chips don't just boast more cores, but bigger cores at that. The new chips now feature 2MB of private L2 cache per core. The core design is also entirely new this time around. If Ampere is to be believed, you can expect substantial improvements in virtualization, mesh congestion management, branch prediction, security, and power management.
Ampere is particularly proud of its move away from Arm's prefab cores as it allowed them to ship many cloud centric features like support for nested virtualization, confidential computing, memory tagging, and per tenant memory bandwidth limits. We expect these features will make Ampere's chips even more attractive to cloud providers as they ramp up their confidential computing offerings.
But just like AMD and Intel's latest chips, there is a cost to adding so many more cores: thermals and power. The Ampere One family is quite a bit hotter and consumes a fair bit more power than its predecessor. Whereas Ampere Altra consumed between 1.25 and 1.4 watts per core, Ampere One has a much higher power budget at around 1.8 watts per core, which translates to about 200-350W per socket.
This could be down to how Ampere is reporting the TDP for these chips. When Altra first launched, the chipmaker quoted the socket power — around 250W — rather than the real world power consumption of the chip, which we're told never exceeded 180W. We've asked Ampere whether the quoted 200-350W TDP reflects real-world consumption.
Playing catch-up on I/O
While Ampere is still leading the pack on core counts, the company is only now catching up on I/O. Ampere One is the chipmaker's first CPU to add support for DDR5 and PCIe 5.0.
By comparison, AWS launched its DDR5 and PCIe 5.0 compatible Graviton3 this time last year. Meanwhile AMD rolled out support for the next-gen memory and interface standards in November with the launch of Epyc 4, while Intel joined the party in January.
And like Intel, Ampere is sticking with an eight-channel configuration with support for DDR5 4800MTps out of the box. This translates to about 50 percent more bandwidth on Altra. However, with 50 percent more cores, memory bandwidth actually remains flat at the top of the stack.
For comparison, AMD boosted its core counts by the same margin, but added four additional lanes — for a total of 12 — in order to provide higher memory bandwidth per core, even on its flagship parts.
While eight channels will be supported at launch, Wittich tells The Register that the company is working on a 12-channel variant, which should perform better in bandwidth-constrained workloads.
- Spectre of layoffs looms over Intel following dismal sales
- AMD reveals Azure is offering its SmartNICs as-a-service
- Will Arm make and sell its own processors? We're gonna go with no
- Fujitsu's A64FX successor will be an Arm-based datacenter chip
Of course it's optimized for AI
No CPU launch in 2023 would be complete with at least a passing mention to AI acceleration. In fact, when Intel finally delivered its Sapphire Rapids Xeons in January, its AMX AI accelerator was one of the few bright spots for a chip that was never intended to compete with AMD's Genoa.
Ampere One is no exception to this rule. In addition to the dual, per-core 128bit vector units in the Altra generation, Ampere One adds support for Bfloat16 — a floating point format optimized for machine learning.
However, short of real world benchmarks, it's not clear how much value this will actually have. Ampere drew comparisons to AMD's Genoa — more on that in a second — while any comparison to Intel's AMX-equipped Xeon Scalables was conspicuously missing. And when we pointed this out, Wittich hemmed and hawed around the omission.
"While Sapphire Rapids and AMX, you know, delivers good imprints performance in some spaces… it's sort of a one-trick pony. It's good at AI inferencing; it's not particularly good at the rest of cloud workloads," he said.
A word on performance
So if Intel is a "one-trick pony," and Ampere holds such an impress core lead, you might expect it to strut out a bunch of performance graphs touting superior performance over its x86 competitors, except we didn't really see that.
In a press briefing ahead of Thursday's launch, Ampere offered little in comparison to its x86 rivals. One of the few performance comparisons we did get was for virtual machine density. It claims a rack full of Ampere One CPUs can fit 2.9x as many VMs as AMD's 96-core Epyc 4 and 4.3x as many as Intel's 56-core Xeon 8480+.
But there's a pretty big asterisk there. That's not a CPU-to-CPU performance comparison; that's comparing how many cores each chipmaker can fit into a 16.4kW rack. For reasons that no doubt favor the company's high-core count parts, Ampere likes to talk about performance and efficiency this way, but it's not exactly intuitive considering a single server.
What's more, Ampere's claims don't take into consideration multi-threading on either AMD or Intel's parts. Running one VM per thread might not be advisable in a multi-tenant environment, but it certainly could net you a few extra VMs, if you're willing to run one per thread.
Don't miss further commentary and analysis of the 192-core Ampere One processor, code-named Siryn, over at our high-performance computing title The Next Platform.
The only other performance comparison Ampere wanted to show us was for AI inferencing on Stable Diffusion and the DLRM recommender system, which showed a rack full of 160-core Ampere One systems clobbering AMD's 96-core parts. Except it wasn't exactly a fair fight. Not only was the Ampere systems running a newer Linux kernel for the DLRM model, it was running at FP16 without Docker runtime overheads. Meanwhile the Genoa system was stuck running at FP32. Since lower precision usually nets sizable performance improvements at the expense of accuracy, it's hard to take the comparison seriously. It's not an apples-to-apples comparison.
AMD, Intel close in on Ampere's turf
The unusual performance charts underscore the fact that AMD's Genoa nor Intel's Sapphire Rapids are not the chips Ampere should be worried about.
By the time Ampere One hits the market in volume, it'll have to contend with AMD's 128-core Bergamo chips — expected sometime next month — and a few months later Intel's efficiency-core-toting Sierra Forest Xeons.
Both Sierra Forest and Bergamo are designed to combat the rise of Arm CPUs in the cloud. They're both built around the same core-concepts as Ampere's Altra and One family of CPUs in that they'll feature a large number of relatively low-power cores.
And while the Arm ecosystem has matured greatly, helped in large part by Ampere and AWS' efforts to popularize the ISA in the datacenter, you just can't beat x86 for legacy compatibility. If it runs on a Xeon or Epyc today, it'll run on Sierra Forest or Bergamo tomorrow.
Despite this threat, Wittich remains confident that Ampere can hold its own. "It's validating that we're not the only one in this space. It would have been scary if we'd looked around and everyone said what those Ampere guys are doing is useless," he said. ®