AI chip startup d-Matrix aspires to rack scale with JetStream I/O cards
Who needs HBM when you can juggle SRAM speed and LPDDR bulk across racks
AI chip startup d-Matrix is pushing into rack scale with the introduction of its JetStream I/O cards, which are designed to allow larger models to be distributed across multiple servers or even racks while minimizing performance bottlenecks.
At face value, JetStream presents itself as a fairly standard PCIe 5.0 NIC. It supports two ports at 200 Gb/s or a single port at 400 Gb/s and operates over standard Ethernet.
According to CEO Sid Sheth, the startup looked at using off-the-shelf NICs from the likes of Nvidia, but company researchers weren't satisfied with the latency they could achieve using them. Instead, they opted to design their own NIC using a field programmable gate array (FPGA) that consumes about 150W. More importantly, d-Matrix claims that it was able to cut network latency to just two microseconds.
Two of these NICs are designed to be paired with up to eight of the chip vendor's 600-watt Corsair AI accelerators in a topology resembling the following.
d-Matrix reference design has one JetStream I/O card serving up to four of its Corsair AI accelerators - Click to enlarge
The custom ASICs are no slouch either, with each card capable of churning out 2.4 petaFLOPS when using the MXINT8 data type or 9.6 petaFLOPS when using the lower precision MXINT4 type.
d-Matrix's memory hierarchy pairs a rather large amount of blisteringly fast SRAM with much, much slower, but higher capacity LPDDR5 memory, the same kind you'd find in a high-end notebook.
d-Matrix's Corsair accelerators feature eight compute chiplets capable of churning out up to 9600 teraFLOPS of MXINT4 - Click to enlarge
Each Corsair card contains 2GB of SRAM good for 150 TB/s along with 256GB of LPDDR5 capable of 400 GB/s. To put those figures in perspective, a single Nvidia B200 offers 180GB of HBM3e and 8TB/s of bandwidth.
As a reminder, AI inference is usually a bandwidth-constrained workload, which means the faster your memory is, the quicker it'll churn out tokens.
"Depending on the tradeoffs the customer wants to make between speed and cost, they can pick the type of memory they want to run the models on," Sheth said.
Two terabytes of LPDDR5 per node gives you enough capacity to run multi-trillion parameter models at 4-bit precision, but with 3.2 TB/s of aggregate memory bandwidth, it won't be quick.
For maximum performance, d-Matrix suggests running models entirely in SRAM instead. But with 16GB of ultra-fast on-chip memory per node, those wanting to run larger, more capable models are going to need a lot of systems. With eight nodes per rack, d-Matrix estimates it can run models up to about 200 billion parameters at MXINT4 precision, with larger models possible when scaling across multiple racks.
With 16GB of SRAM per node, you'll need an awful lot of them, potentially across multiple racks to run larger models on Corsair's faster memory tier - Click to enlarge
And this is exactly what d-Matrix JetStream I/O cards are designed to enable. d-Matrix is utilizing a combination of tensor, expert, data, and pipeline parallelism to maximize the performance of these rack-scale compute clusters.
With just 800 Gb/s of aggregate bandwidth coming off the two JetStream NICs in these systems, d-Matrix appears to be following a similar strategy to what we've seen in GPU systems to date, utilizing tensor parallelism to distribute the model weights and computational workload across the node's eight accelerators and some combination of pipeline or expert parallelism to scale that compute across multiple nodes or racks.
The result is a bit of an inference assembly line, where the model is broken into chunks that are processed in sequence, one node at a time.
If that sounds familiar, this is how competing chip startups like Groq (not to be confused with xAI's authoritarian-obsessed LLM Grok) or Cerebras have managed to build extremely high-performance inference services without relying on high-bandwidth memory.
If you thought using 64 AI accelerators to run a 200 billion parameter model was a lot, Groq used 576 of its language processing units (LPUs) to run Llama 2 70B, albeit at a higher precision.
Despite the added complexity of distributing models across that many accelerators, keeping them in SRAM clearly has its benefits. d-Matrix says its Corsair chips can achieve generation latencies as low as 2 ms per token in models like Llama 3.1 70B. Add in some speculative decoding, like Cerebras has done with its own SRAM packed chips, and we would not be surprised to see that performance jump by 2x or 3x.
- If Broadcom is helping OpenAI build AI chips, here's what they might look like
- India hails 'first' home-grown chip as a milestone despite very modest specs
- Nvidia details its itty bitty GB10 superchip for local AI development
- Top AWS chip designer reportedly defects to Arm as it weighs push into silicon
With JetStream, d-Matrix is better positioned to compete in this arena. However, the lack of interconnect bandwidth afforded by the NICs means the company's rack scale push is currently limited to scale-out the scale up architectures Nvidia and AMD are now transitioning to with their NVL72 and Helios rack systems, the latter of which we'll note won't actually ship until next year at the earliest.
For d-Matrix next-gen Raptor accelerators, the company plans to transition to an NVLink style scale up fabric based on scale up Ethernet or UALink - Click to enlarge
However, it won't be long before d-Matrix joins them with its next-gen Raptor family of accelerators, which, in addition to 3D stacked SRAM for higher capacity, will use an integrated electrical I/O chiplet to achieve rack-scale networking akin to Nvidia's NVLink.
While still a ways off, the d-Matrix roadmap has it eventually transitioning to an optical I/O chiplet, which will allow the architecture to scale across multiple racks or even rows of systems.
According to d-Matrix, JetStream is currently sampling to customers with production expected to ramp before the end of the year. ®
