Tenstorrent's Blackhole chips boast 768 RISC-V cores and almost as many FLOPS
Shove 32 of 'em in a box and you've got nearly 24 petaFLOPS of FP8 perf
Hot Chips RISC-V champion Tenstorrent offered the closest look yet at its upcoming Blackhole AI accelerators at Hot Chips this week, which they claim can outperform an Nvidia A100 in raw compute and scalability.
Each Blackhole chip boasts 745 teraFLOPS of FP8 performance (372 teraFLOPS at FP16), 32GB of GDDR6 memory and an Ethernet-based interconnect capable of 1TBps of total bandwidth across its 10 400Gbps links.
The accelerator's 140 Tensix cores promise up to 745 teraFLOPS of FP8 performance. - Click to enlarge
Tenstorrent shows how its latest chip can offer a modest advantage in performance over an Nvidia A100 GPU, though it falls behind in both memory capacity and bandwidth.
However, just like the A100, Tenstorrent's Blackhole is designed to be deployed as part of a scale-out system. The AI chip startup plans to cram 32 Blackhole accelerators connected in a 4x8 mesh into a single node, which it calls the Blackhole Galaxy.
Tenstorrent's Blackhole Galaxy systems will mesh together 32 Blackhole accelerators for nearly 24 petaFLOPS of FP8 performance. - Click to enlarge
In total, a single Blackhole Galaxy promises 23.8 petaFLOPS of FP8 or 11.9 petaFLOPS at FP16, along with 1TB of memory capable of 16 TBps of raw bandwidth. What's more, Tenstorrent says the chip's core-dense architecture — we'll dive into that in a little bit — means each of these systems can function as a compute or memory node or as a high-bandwidth 11.2TBps AI switch.
"You can make an entire training cluster just using this as a Lego," said Davor Capalija, senior fellow of AI software and architecture at Tenstorrent.
Tenstorrent contends an entire training cluster can be built using nothing but Blackhole Galaxy systems as "Lego blocks." - Click to enlarge
By comparison, Nvidia's densest HGX/DGX A100 systems top out at eight GPUs per box, and manage just under 2.5 petaFLOPS of dense FP16 performance, making the Blackhole Galaxy nearly 4.8x faster. In fact, at the system level, Blackhole Galaxy should be competitive with Nvidia's HGX/DGX H100 and H200 systems, which manage roughly 15.8 petaFLOPS of dense FP8.
Tenstorrent's use of onboard Ethernet means it avoids the challenge associated with juggling multiple interconnect technologies for chip-to-chip and node-to-node networking like Nvidia has to with NVLink and InfiniBand/Ethernet. In this respect, Tenstorrent's scale-out strategy is quite similar to Intel's Gaudi platform, which also uses Ethernet as its primary interconnect.
Considering just how many Blackhole accelerators Tenstorrent plans to cram in one box, let alone a training cluster, it'll be interesting to see how they handle hardware failures.
Baby RISC-V meets Big RISC-V
Unlike its prior Greyskull and Wormhole parts, which were deployed as PCIe-based accelerators, Tenstorrent's Blackhole — not to be confused with Nvidia's similarly named Blackwell architecture — is designed to function as a standalone AI computer.
This, according to Jasmina Vasiljevic, senior fellow of ML frameworks and programming models at Tenstorrent, is possible thanks to the inclusion of 16 "Big RISC-V" 64-bit, dual-issue, in-order CPU cores arranged in four clusters. Critically, these cores are beefy enough to serve as an on-device host running Linux. These CPU cores are paired with 752 "Baby RISC-V" cores, which are responsible for memory management, off-die communications, and data processing.
The Blackhole accelerator is packed with 16 Big RISC-V and 752 Baby RISC-V cores. - Click to enlarge
The actual compute, however, is handled by 140 of Tenstorrent's Tensix cores, each of which is composed of five "Baby RISC-V" cores, a pair of routers, a compute complex, and some L1 cache.
The compute complex consists of a tile math engine designed to accelerate matrix workloads and a vector math engine. The former will support Int8, TF32, BF/FP16, FP8, as well as block floating point datatypes ranging from two to eight bits, while the vector engine targets FP32, Int16, and Int32.
Each of Blackholes' Tensix cores features five RISC-V baby cores, two routers, L1 cache, and matrix and vector engines. - Click to enlarge
According to Capalija, this configuration means the chip can support a variety of common data patterns in AI and HPC applications including matrix multiplication, convolutions, and sharded data layouts.
Blackhole's baby cores can be programmed to support a variety of data movement patterns. - Click to enlarge
In total, Blackhole's Tensix cores account for 700 of the 752 so-called baby RISC-V cores on board. The remaining are responsible for memory management ("D" for DRAM), off-chip communications ("E" for Ethernet), system management ("A") and PCIe ("P").
- Google's Irish bit barn plans denied over eco shortfall
- IBM reveals upcoming chips to power large-scale AI on next-gen big iron
- Cerebras gives waferscale chips inferencing twist, claims 1,800 token per sec generation rates
- The future of AI/ML depends on the reality of today – and it's not pretty
Building a software ecosystem
Along with the new chip, Tenstorrent also disclosed its TT-Metalium low-level programming model for its accelerators.
As anyone who's familiar with Nvidia's CUDA platform knows, software can make or break the success of even the highest performing hardware. In fact, TT-Metalium is somewhat reminiscent of GPU programming models like CUDA or OpenCL in that it's a heterogeneous, but differs in that it was built from the "ground up for AI and scale out" compute, explained Capalija.
One of these differences is that the kernels themselves are plain C++ with APIs. "We didn't see a need for a special kernel language," he explained.
Tenstorrent aims to support many standard model runtimes like TensorFlow, PyTorch, ONNX, Jax, and vLLM - Click to enlarge
Combined with its other software libraries including TT-NN, TT-MLIR, and TT-Forge, Tenstorrent aims to support running any AI model on its accelerators using commonly used runtimes like PyTorch, ONNX, JAX, TensorFlow, and vLLM.
Support for these high-level programming models should help abstract the complexity of deploying workloads across these accelerators, similar to what we've seen with AMD and Intel accelerators. ®