NextSilicon Maverick-2 promises to blow away the HPC market Nvidia left behind

The one chip startup building accelerators for something other than AI boasts performance up 10x that of modern GPUs using a fraction the power

Researchers and engineers working in particle physics, materials analysis, or drug discovery haven't exactly been spoiled for choice when it comes to chips capable of the highly precise double-precision calculations that these workloads depend. NextSilicon aims to change that with Maverick-2, a chip aimed not at AI but the high-performance computing (HPC) community.

This week, the chip startup offered the best look yet at how its dataflow accelerators — now deployed by several customers, including Sandia National labs — hold up in the wild, claiming up to a 10x advantage over leading GPUs.

In the High-Performance Conjugate Gradient benchmark popularized by the biannual Top500 HPCG leaderboard, a biannual ranking of supercomputers, NextSilicon claims to match "leading GPU" performance, while consuming half the power. In the test, the startup claims the chip achieved 600 gigaFLOPS at a mere 750 watts.

Meanwhile, in PageRank, NextSilicon says the chip achieves 10x higher graph analytics performance compared to modern GPUs — noting that for graphs larger than 25 GB the GPUs couldn't even finish the benchmark.

Finally, for high-throughput databases — an area with some applications for AI agents, retrieval augmented generation (RAG), or vector search — NextSilicon claims Maverick-2 achieved 32.6 giga updates a second (GUPS), making it 22x faster performance than CPUs and 6x higher performance than competitor GPUs.

What CPUs and GPUs is NextSilicon using to draw these comparisons? We're still waiting to hear back.

New competition in the HPC arena will be welcome among those working in the sciences. Amid the AI boom, chip designers like Nvidia have traded double (FP64) and single precision (FP32) performance for ultra-low-precision datatypes like FP8 and now FP4 which are better suited for AI inference and training.

Each of the GPUs in Nvidia's GB300 offers just 1.3 teraFLOPS of FP64 performance, down from 45 teraFLOPS in the GB200. This has left AMD's Instinct family of GPUs as one of the last bastions for truly high-performance double-precision compute, with 81.7 vector and 163.4 matrix FP64 teraFLOPS.

With the launch of Maverick-2 in both 96GB PCIe cards and 192GB OAM modules, HPC buyers now have another option. NextSilicon has yet to share vector or matrix performance at either precision, but has said that the chip achieves 4x the performance per watt at double precision of an Nvidia B200, and 20x that of Intel's 32-core Sapphire Rapids parts.

However, as our sibling site The Next Platform previously pointed out, how many peak teraFLOPS of FP64 Maverick-2 can theoretically deliver may not even matter. That's because, for a variety of reasons, CPUs and GPUs rarely approach these figures in the real world anyway. If you need proof just compare Rmax and Rpeak for supercomputers on the Top500.

Taking dataflow mainstream

NextSilicon claims that its silicon's dataflow architecture is a whole lot more efficient in part because it dedicates the vast majority of the chip's dies to compute logic.

The chip is essentially a grid of arithmetic logic units (ALUs) interconnected in a graph, where each unit is configured to perform a specific operation, whether it be multiplication, addition, or some other logical operation. When an input arrives at one of these units, the computation triggers.

This, the company says, eliminates overhead because there are no instructions to fetch, decode, or schedule; data simply flows through the compute.

"In a traditional processor you have a cookbook (program) that you follow step by step regardless of whether the ingredients (data) are ready," CEO Elad Raz explained in a blog post. "In a dataflow processor, each cooking station activates the moment its ingredients arrive, working in parallel with other stations."

NextSilicon isn't the first to attempt this kind of architecture. As we understand it, Groq's LPUs are based on a similar principle. The tricky bit, as is the case with any new accelerator, is getting them to run without requiring developers to start over from scratch.

If NextSilicon is to be believed, this shouldn't be a problem for Maverick-2, thanks to a compiler that can take C++, Python, Fortran, CUDA, and other AI frameworks and map the compute graphs directly to the chip.

The performance results shared in today's blog post were supposedly achieved using unmodified code, which if true, should make the chips quite attractive to supercomputing centers that have struggled to adopt GPUs up to this point.

It's hard to say how well NextSilicon's compiler actually works in practice, but we may not have to wait long to find out. That's because Maverick-2 is already running at Sandia National Laboratory in the Vanguard-II supercomputer.

A CPU core in the work

NextSilicon isn't just designing accelerators, it's also got a new high-performance RISC-V CPU core in the works called Arbel.

The company isn't new to RISC-V CPU design. Maverick-2 also used a custom RISC-V core to handle serial code that couldn't easily be parallelized. That chip performed well enough that the org has opted to pursue a standalone core.

NextSilicon says the core, which has apparently already been implemented in TSMC 5nm, will support clock speeds up to 2.5 GHz, feature a 10-wide issue pipeline, a 480-entry reorder buffer, support 16 scaler instructions, and integrate four 128-bit vector units for single instruction, multiple data (SIMD) workloads.

However, it doesn't appear that NextSilicon plans to license the tech as others, like Tenstorrent, are during. Instead, the company's goal is to vertically integrate, similar to what Nvidia has done with its Grace CPUs in its GH200 and GB200 superchips.

"When you control both general purpose computing and specialized acceleration, you can optimize the entire stack in ways that simply aren't possible when you're dependent on someone else's CPU architecture," Raz explained. ®

More about

TIP US OFF

Send us news


Other stories you might like