SiFive expands from RISC-V cores for AI chips to designing its own full-fat accelerator
Seems someone's looking for an Arm wrestle
SiFive, having designed RISC-V CPU cores for various AI chips, is now offering to license the blueprints for its own homegrown full-blown machine-learning accelerator.
Announced this week, SiFive's Intelligence XM series clusters promise a scalable building block for developing AI chips large and small. The idea is that others can license the RISC-V-based designs to integrate into processors and system-on-chips – to be placed in products from edge and IoT gear to datacenter servers – and hopefully foster more competition between architectures.
Fabless SiFive is no stranger to the AI arena. As we've previously reported, at least some of Google's tensor processing units are already using SiFive's X280 RISC-V CPU cores to manage the machine-learning accelerators and keep their matrix multiplication units (MXUs) fed with work and data.
Likewise, John Ronco, SVP and GM of SiFive UK, told The Register that SiFive's RISC-V designs also underpin the CPU cores found in Tenstorrent's newly disclosed Blackhole accelerator, which we looked at in detail at Hot Chips last month.
And in a canned statement, SiFive CEO Patrick Little claimed the US-based outfit is now supplying RISC-V-based chip designs to five of the "Magnificent 7" companies – Microsoft, Apple, Nvidia, Alphabet, Amazon, Meta and Tesla – though we suspect not all that silicon necessarily involves AI.
What sets SiFive's Intelligence XM-series apart from previous engagements with the likes of Google or Tenstorrent is that rather than having its CPU cores attached to a third-party matrix math engine, all packaged up in the same chip, SiFive is instead bringing out its own complete AI accelerator design for customers to license and put into silicon. This isn't aimed at semiconductor players capable of crafting their own accelerators, such as Google and Tenstorrent – it's aimed at organizations that want to take an off-the-shelf design, customize it, and send it to the fab.
"For some customers, it's still going to be right for them to do their own hardware," Ronco said. "But, for some customers, they wanted more of a one-stop shop from SiFive."
In this sense, these XM clusters are a bit like Arm's Compute Subsystem (CSS) designs in that they offer customers a more comprehensive building block for designing custom silicon. But instead of general application processors, SiFive is targeting those who want to make their own AI accelerators.
A closer look at the XM Cluster
SiFive's base XM cluster is built around four of SiFive's Intelligence X RISC-V CPU cores which are connected to an in-house matrix math engine specifically for powering through neural network calculations in hardware. If you're not familiar, we've previously explored SiFive's X280 and newer X390 X-series core designs, the latter of which can be configured with a pair of 1,024 vector arithmetic logic units.
The base XM cluster comprises four Intelligence X RISC-V CPU cores tied to a matrix engine – Click to enlarge. Source: SiFive
Each of these clusters boasts support for up to 1TB/sec of memory bandwidth via a coherent hub interface, and is expected to deliver up to 16 TOPS (tera-operations per second) of INT8 or 8 teraFLOPS of BF16 performance per gigahertz.
TeraFLOPS per gigahertz might seem like an odd metric, but it's important to remember this isn't a complete chip and performance is going to be determined in large part by how many clusters the customer places in their component, how it's all wired up internally, what else is on the die, what the power and cooling situation is, and how fast it ends up clocked.
At face value, these XM clusters may not sound that powerful – especially when you consider SiFive expects most chips based on the design to operate at around 1GHz. However, stick a few together and its performance potential adds up quickly.
Ronco expects most chips based on the design will utilize somewhere between four and eight XM clusters, which in theory would allow for between 4–8TB/sec of peak memory bandwidth and up to 32–64 teraFLOPS of BF16 performance – and that's assuming a 1GHz operating operating clock.
That's still far slower than something like an Nvidia H100, which can churn out nearly a petaFLOPS of dense BF16 performance. But as we mentioned earlier, FLOPS aren't everything – especially when it comes to bandwidth constrained workloads like AI inferencing. There are considerations like price, power, process node, and everything else.
- SiFive offers potential Neoverse N2 rival – the P870-D RISC-V core for datacenters
- RISC-V PCIe 5 SSD controller for the rest of us hits 14GB/s
- Alibaba's research arm promises server-class RISC-V processor due this year
- Tenstorrent's Blackhole chips boast 768 RISC-V cores and almost as many FLOPS
For this reason, Ronco expects SiFive's XM clusters probably won't be used as widely for AI training. That said, the design isn't limited to eight clusters.
Ronco was hesitant to say how far the design can scale – some of this is probably down to process tech and die area. However, the company's product slide deck suggests 512 XM clusters is within the realm of possibility. Again, this will be up to the customer to decide what's appropriate for their specific application.
SiFive suggests that as many as 512 XM clusters could be packed together to achieve 4 petaFLOPS of AI performance – Click to enlarge
Assuming the end customer can actually maintain a 1GHz clock speed without running into thermal or power limitations, 512 XM clusters would rival Nvidia's upcoming Blackwell accelerators, boasting roughly four petaFLOPS of BF16 matrix compute. For comparison, Nvidia's top-specced Blackwell GPUs boast 2.5 petaFLOPS of BF16 performance.
Along with its new XM clusters, SiFive says it will also offer an open source reference implementation of its SiFive Kernel Library to reduce barriers to adoption for RISC-V architectures. ®
PS: Arm this week announced it is adding its Kleidi library to PyTorch and ExecuTorch, allowing apps using those frameworks to use a device's host Arm cores to accelerate AI work. That's acceleration using specialized instructions in the CPUs, rather than a dedicated accelerator.