This article is more than 1 year old
AVX10: The benefits of AVX-512 without all the baggage
Turns out bigger isn't always better
Since its introduction, AVX-512 has gotten a bit of bad rap for being hot, power hungry, and inconsistent in its implementation and feature set.
Recall Linux kernel dev Linus Torvalds famously said he hoped the SIMD instruction set would "die a painful death."
With the recent introduction of AVX10, Intel signaled its efforts to address many of these frustrations. Detailed as part of Intel's new Advanced Performance Extensions, AVX10 [PDF] is essentially a reset of Intel's AVX spec.
To refresh, AVX-512 first showed up in Intel's quirky PCIe-based Xeon Phi accelerators circa 2013 but eventually made the leap into standard Xeons with the launch of Skylake in 2015.
On paper, the potential for massive 512-bit vector registers to accelerate workloads was substantial, but taking advantage of them, especially in the early days, came with some hefty concessions.
SIMD (Single Instruction/Multiple Data) does just what it claims: It allows the CPU to execute instructions on multiple sets of data simultaneously. The benefit is increased parallelism and by extension, a boost in performance. The downside is these instructions hit the CPU a lot harder, since it's doing more work with every clock cycle. In practice, this has translated into higher power consumption and heat. As a result, executing AVX-512 workloads, at least in the early days, resulted in steep frequency penalties, which weren't great for systems running mixed workloads.
"That gave it sort of a black eye of 'Hey, if I use AVX-512, I'm gonna have this big frequency drop, so I can't use it'," Intel Fellow Ronak Singhal told The Register. "We've spent a lot of effort on making sure that that really goes away as a barrier to adoption."
Beyond a wider register width, AVX-512 has a couple of advantages over AVX2, SSE, and other SIMD instructions. Two of the biggest, according to Intel Fellow Arjan van de Ven, are its 32 registers, twice that of AVX2, and the introduction of K-masks. "A lot of the performance comes from those extra registers [and] from the K-masks; not so much the rest," he said.
Another recent inclusion to the AVX-512 spec was support for FP16 and bfloat 16 math. This functionality showed up in Cooper Lake in 2019 and is commonly employed in machine learning applications.
These are only available if you are using AVX-512. And this means that even if you don't care about the wider register width, if you want to take advantage of any of these features, you have to use AVX-512, Singhal explained.
Benefits without the baggage
This is one of the problems that AVX10 seeks to address. "The idea here is, how do I bring those capabilities, the value of those capabilities, to everyone regardless of whether you invest in building support for the 512 bit width?" Singhal said. With AVX10, "now, even if I'm focused on the 256 bit width, I can still get the goodness of everything in AVX-512."
In this regard, AVX10 isn't so much the next evolution of AVX-512, but a redistribution of features across Intel's entire AVX implementation. It's "an opportunity to reset what is the baseline and the foundation that software can count on," he said.
Under the new spec, AVX10 compatible chips will, for the most part, share a common feature set — including 32 registers, k-masks, FP16 support — and minimally support 256 bit wide registers.
Van de Ven notes the new spec should address many of the frustrations raised by Torvalds back in 2020. "We listened very carefully to his feedback… part of his gripe was inconsistency," he said. "Inconsistency makes it harder for people to use it, and, if it's hard to use, it doesn't get used."
In terms of implementation, Van de Ven told The Register that once fully fleshed out, most applications should be able to take advantage of the new SIMD instructions with nothing more than a recompile. Though, of course, Intel says it will provide additional tools for the one percent that want to further optimize their code.
Another benefit of decoupling these features from AVX-512 is lower power overhead. "In terms of power and thermals, the extra registers and K-masks make the same code more efficient. That gives you a performance benefit, but the performance benefit is also a power benefit," Van de Ven said. "As an example, if your matrix multiply is suddenly 10 percent faster, you take 10 percent less time; your total power consumption is down by give or take 10 percent."
- Amazon's rumored investment in Arm's IPO might be good insurance
- Chinese web giants go on $5B Nvidia shopping spree to fuel AI ambitions
- Say hello to Downfall, another data-leaking security hole in several years of Intel chips
- Intel adds fresh x86 and vector instructions for future chips
What AVX10 means for Intel's line up
While AVX-512 has been a mainstay of Intel's Xeon processors and could be had in the chipmaker's high-end desktop (HEDT) parts, the instruction set only appeared in Intel's consumer platform beginning with its 11th-gen parts in 2021.
However, the availability of AVX-512 instructions on consumer hardware was short-lived. With the introduction of its 12th-gen Core-series processors, a few months later, Intel disabled and eventually fused off AVX-512 support entirely. We're told this was due in part to the chip's hybrid core architecture which featured a combination of performance and efficiency cores.
"At the time, our e-cores didn't support AVX-512," Singhal explained. So, to ensure consistency between the two core architectures, AVX-512 was disabled on the p-cores by default.
While the introduction of AVX10 means chips gain many of AVX-512's features, it doesn't necessarily mean we'll see 512-bit vector registers on Intel's e-cores any time soon.
For now, 512-bit vectors and 64-bit opmask registers will be available on some P-core processors to support vector heavy compute workloads that benefit from the wider vector length. So, there's no guarantee 512 bit registers are going back to consumer platforms either.
Singhal told The Register the company is leaving itself some room in the specification, and that 256 bit instructions will be the minimum width required by the AVX10 instruction set. In other words, Intel isn't ruling out the possibility of 512 bit vector registers on e-core chips, but don't expect them anytime soon.
This makes sense when you consider that Intel's Xeon roadmap will embrace p-cores and e-cores but not on the same chips. Sierra Forest will be Intel's first Xeon based entirely around e-cores. However, it appears that Granite Rapids, which is slated to launch some time after Sierra Forest in 2024, will be the first to include AVX10 support. For the time being, Intel emphasizes that it will continue to support AVX-512 on older Xeons.
Intel's plans to revamp its AVX10 comes almost a year after AMD rolled out AVX-512 support on its processors. However, it remains unclear whether widespread availability of compatible hardware will drive a new round of investment in the technology. ®