This article is more than 1 year old
d-Matrix bets on in-memory compute to undercut Nvidia in smaller AI models
Microsoft among in-memory AI chip startup's backers
Generative AI infrastructure builds have given chip startups a hardware niche yet to be targeted by larger players, and in-memory biz d-Matrix has just scored $110 million in Series-B funding to take its shot.
d-Matrix's cash came from investors including Microsoft and Singapore's sovereign wealth fund Temasek.
Founded in 2019, the Santa Clara-based chipmaker has developed a novel in-memory compute platform to support inferencing workloads. That distiguishes it from rivals that have focussed on training AI models – a topic that has generated a lot of attention as generative AI and large language models (LLMs) like GPT-4, Midjourney, or Llama 2 grab the headlines.
Training often involves crunching tens or even hundreds of billions of parameters, necessitating massive banks of expensive high-performance GPUs. And in preparation for the new AI world order, titans like Microsoft, Meta, Google, and others are buying up tens of thousands of those accelerators.
But training models is only part of the job. Inferencing – the process of putting an AI to work in a chat bot, image generation, or some other machine learning workload – also benefits from dedicated hardware.
d-Matrix thinks it has a shot at competing with GPU juggernauts like Nvidia with specialist inferencing kit.
Compute in a sea of SRAM
d-Matrix has developed a series of in-memory compute systems designed to alleviate some of the bandwidth and latency constraints associated with AI inferencing.
The startup's latest chip, which will form the basis of its Corsair accelerator sometime next year, is called the Jayhawk II. It features 256 compute engines per chiplet, integrated directly into a large pool of shared static random-access memory (SRAM). For reference, your typical CPU has multiple layers of SRAM cache, some of it shared and some of it tied to a specific core.
In a recent interview, d-Matrix CEO Sid Sheth explained that his team has managed to collapse the cache and compute into a single construct. "Our compute engine is the cache. Each of them can hold weights and can compute," he said.
The result is a chip with extremely high memory bandwidth – even compared to High Bandwidth Memory (HBM) – while also being cheaper, the chip biz claims. The downside is SRAM can only hold a fraction of the data stored in HBM. Whereas a single HBM3 module might top out at 16GB or 24GB of capacity, each of d-Matrix's Jayhawk-2 chiplets contains just 256MB of shared SRAM.
Because of this, Sheth says the outfit's first commercial product will feature eight chiplets connected via a high-speed fabric, for a total of 2GB of SRAM. He claims the 350 watt card should deliver somewhere in the neighborhood of 2,000 TFLOPs of FP8 performance and as much as 9,600 TOPs of Int4 or block floating point math.
As we understand it, that's only for models that can fit within the card's SRAM.
For larger models up to 40 billion parameters, each card is equipped with 256GB of LPDDR memory that's good for 400GB/sec of bandwidth to handle any overflow – though Sheth admits that doing so does incur a performance penalty. Instead, he says early customers piloting its chips have distributed their models across as many as 16 cards or 32GB of SRAM.
There's a penalty associated with doing this too, but Sheth argues performance is still predictable – so long as you stay within a single node.
AI is not a one-size-fits-all affair
Because of this limitation, d-Matrix has its sights set on the lower end of the datacenter AI market.
"We are not really focused on the 100 billion-plus, 200 billion-plus [parameter models] where people want to do a lot of generic tasks with extremely large language models. Nvidia has a great solution for that," Sheth conceded. "We think … most of the consumers are concentrated in that 3–60 billion [parameter] bucket."
Karl Freund, an analyst at Cambrian AI, largely agrees. "Most enterprises will not be deploying trillion parameter models. They may start from a trillion parameter model, but then they'll use fine tuning to focus that model on the company's data," he predicted in an interview with The Register. "Those models are going to be much, much smaller; they're gonna be … 4–20 billion parameters."
- X may train its AI models on your social media posts
- IT needs more brains, so why is it being such a zombie about getting them?
- Intel shows off 8-core, 528-thread processor with 1TB/s of co-packaged optics
- Now Middle East nations banned from getting top-end Nvidia AI chips
And for models of this size, an Nvidia H100 isn't necessarily the most economic option when it comes to AI inference. We've seen PCIe cards selling for as much as $40,000 on eBay.
Much of the cost associated with running these models, he explained, comes down to the use of speedy high-bandwidth memory. By comparison, the SRAM used in d-Matrix's accelerators is faster and cheaper, but also limited in capacity.
Lower costs appear to have already caught the attention of M12, Microsoft's Venture Fund. "We're entering the production phase when LLM inference TCO becomes a critical factor in how much, where, and when enterprises use advanced AI in their services and applications," M12's Michael Steward explained in a statement.
"d-Matrix has been following a plan that will enable industry-leading TCO for a variety of potential model service scenarios using flexible, resilient chiplet architecture based on a memory-centric approach."
A narrow window of opportunity
But while the silicon upstart's AI accelerator might make sense for smaller LLMs, Freund notes that it has a fairly short window of opportunity to make its mark. "One must assume that Nvidia will have something in market by this time next year."
One could argue that Nvidia already has a card tailored to smaller models: the recently announced L40S. The 350-watt card tops out at 1,466 FLOPS of FP8 and trades HBM for 48GB of cheaper, but still performant, GDDR6. Even still, Freund is convinced that Nvidia will likely have a more competitive AI inferencing platform before long.
Meanwhile, several cloud providers are pushing ahead with custom silicon tuned to inferencing. Amazon has its Inferentia chips and Google recently showed off its fifth-gen Tensor Processing Unit.
Microsoft is also said to be working on its own datacenter chips – and, last we heard, is hiring electrical engineers to spearhead the project. That said, all three of the big cloud providers are known to hedge their custom silicon bets against commercial offerings. ®