Video Groq, an AI chip startup founded by ex-Googlers, today said it is now shipping its hardware to customers.
That includes the Groq Node, a 5U data-center-grade box designed to handle machine-learning workloads that consumes 3.3kW of power and delivers up to six POPS, we're told. That's six peta-operations-per-second, or more specifically, six quadrillion INT8 calculations per second. Each Node houses eight Groq PCIe cards, and Nodes can be interconnected via all sorts of topologies using 200G Ethernet or Infiniband HDR. There are two AMD second-generation Epyc processors inside gluing the tech together.
The cards each feature Groq's custom AI accelerator chip, which, it is claimed, can do 18,900 inferences per second using ResNet-50 v2 at batch size one. The startup reckons its silicon – dubbed the Tensor Streaming Processor or TSP – is the "fastest commercially available AI/ML accelerator, with a responsiveness measured in hundredths of a millisecond," beating the likes of Nvidia.
The chip, we note, was designed with the help of Marvell, which provided building blocks to form the ASIC and its interfaces with the outside world, while Groq concentrated on the heart of the beast: the AI acceleration. There's also a software development kit available for tapping into the hardware, and Nimbix hosts instances of the TSP in its cloud.
The processor is a 14nm affair fabbed by Global Foundries. It can do up to 1 INT8 POPS at 1.25GHz, or 0.82 INT8 POPS at 1GHz, and 205 TFLOPS at 1GHz using FP16. It has 80TB/s of on-die memory bandwidth. It sports 220MB of chip memory, has a 725mm2 die with 28.6 billion transistors, and uses PCIe 4 to communicate, according to an in-depth report by analysts at The Linley Group. It is said to be shipping in production quantities this year.
You can read a paper explaining the low-latency TSP architecture, by Groq employees, here [PDF]. It is not like your usual microprocessor or accelerator. In fact, it throws everything normal out the window. It clocks instructions vertically down through a grid of hundreds of function units, and clocks so-called superlanes of data horizontally through these units.
The compiler is left to schedule these streams of data so that the bytes coincide with the necessary instructions as they move down the pipelines. It looks like a 144-wide VLIW architecture, basically. It would not be the first processor to rely on a compiler to schedule its instructions. For Groq, the end result, among other things, is the execution of more than 400,000 integer multiply-and-accumulate operations per cycle, potentially.
Groq has an interesting history in that it was founded by Googlers, it's based in Mountain View near the web giant's Silicon Valley HQ, and its co-founder and CEO Jonathan Ross played a role in the development of Google's custom AI accelerator, the TPU, aka the Tensor Processing Unit. You can watch an overview of the startup's technology here, or a more technical introduction below. The TSP is most certainly not a redo of the TPU, we note.
Nicole Hemsoth over at our sister site, The Next Platform, took a look at Groq late last year, and interviewed Ross, here. That conversation revealed the startup seems to be primarily focused on inference rather than training. It sees neural-network training as work that can be done by throwing a lot of compute at something one time, with a fixed cost, whereas inference has to scale efficiently to meet demand, and has to be dynamic and in real-time. You can train a model using 4,000 servers over 40 hours, but what happens when the model needs to make a decision for 40,000 users within a second? That's where Groq's TSP might come in.
"From the time of the first TPU deployments it became clear inference was the much bigger problem," Ross told The Next Platform. "Training is pretty much a solved problem. They can always chip away at accuracy and precision but the time it takes to train is not as big of a problem any longer. Costs are down and it’s a one-time cost, not a revolving one.
“Inference is an inherently larger market. Training scales with the number of machine learning researchers you have, inference scales with the number of queries or users. Training is compiling, inference is running.
“Inference is much more difficult too, for that matter. Training can be solved by throwing a bunch of money at the problem; it can be solved at a system level by taking existing architectures, stitching a bunch of chips together, and getting a sufficient gain. With inference, it’s about deploying that across a large fleet of devices, perhaps millions of servers, each with their own inference device.”