Next-gen Meta AI chip serves up ads while sipping power
Fresh silicon won't curb Zuck's appetite for GPUs just yet
After teasing its second-gen AI accelerator in February, Meta is ready to spill the beans on this homegrown silicon, which is already said to be powering ad recommendations in 16 regions.
The Facebook goliath has been designing custom accelerators for all manner of workloads ranging from video streaming to machine learning to drive the recommender models behind its advertising empire.
The latest addition to the Meta Training Inference Accelerator (MTIA) family claims a 3x higher performance and a 1.5x power efficiency advantage over the first-gen part, which our friends at The Next Platform analyzed last year.
According to Meta, the second-generation chip, which we're going to call MTIA v2 for the sake of consistency, was designed to balance compute, memory capacity, and bandwidth to get the best possible performance for the hyperscaler's internal ranking and recommender models.
Digging into the design, the accelerator features an 8x8 grid of processing elements (PEs) which together offer a 3.5x higher dense compute performance or 7x higher performance with sparsity enabled compared to MTIA v1.
Meta's latest AI accelerator, above, are already powering the hyperscaler's ranking and recommender models - Click to enlarge. Source: Meta
Beyond using a smaller 5nm TSMC process node and boosting the clock speed from 800MHz to 1.35GHz, Meta notes several architectural and design improvements that contributed to the latest part's performance gains. These include support for sparse computation, more on-die and off-die memory, and an upgraded network-on-chip (NoC) with twice the bandwidth of the old model. Here's how the first and second generation compare:
MTIA v1 | MTIA v2 | |
---|---|---|
Process tech | 7nm TSMC | 5nm TSMC |
Die area | 373mm2 | 421mm2 |
PEs | 8x8 grid | 8x8 grid |
Clock speed | 800MHz | 1.35GHz |
INT8 perf | 102 TOPS | 354/708* TOPS |
FP16/BF16 perf | 51.2 TFLOPS | 177/354* TFLOPS |
PE mem | 128KB per PE | 384KB per PE |
On-chip mem | 128MB | 256MB |
Off-chip mem | 64GB | 128GB |
Off-chip mem BW | 176GB/s | 204GB/s |
Connectivity | 8x PCIe Gen 4.0 - 16GB/s | 8x PCIe Gen 5.0 - 32GB/s |
TDP | 25W | 90W |
* Sparse performance. You can find a full breakdown of both chips here.
It should be noted that the MTIA v2 won't eliminate the web goliath's need for GPUs. Meta supremo Mark Zuckerberg has previously said his mega-corporation will deploy 350,000 Nvidia H100 accelerators and will have the equivalent of 600,000 H100s operational by year's end.
Instead, MTIA follows an increasingly familiar pattern for Meta (and others) of developing custom silicon tailored to specific tasks. The idea being that while the kit may not be as flexible as CPUs and GPUs, an ASIC when deployed at scale can be more efficient.
While the latest chip consumes nearly four times the power of its predecessor, it's capable of producing up to 7x the floating point performance. Pitted against a GPU, Meta's latest accelerator manages 7.8 TOPS per watt (TOPS/W), which as we discussed in our Blackwell coverage, beats out Nvidia's H100 SXM at 5.65 TOPS/W and is more than twice that of the A100 SXM at 3.12 TOPS/W.
Having said that, it's clear that Meta has gone to great lengths to size the chip to its internal workloads — namely inferencing on recommender models. These are designed to render personalized suggestions such as people you may know or, more importantly for Meta's business model, which ads are most likely relevant to you.
The chips are also designed to scale out as needed and can be be deployed in a rack-based system containing 72 accelerators in total: Each system combines three chassis each containing 12 compute boards with two MTIA v2 chips per board.
Each MTIA v2 chassis contains 12 compute boards each sporting a pair of accelerators ... Click to enlarge. Source: Meta.
In terms of deploying workloads, Meta is leaning heavily on the PyTorch framework and Triton compiler. We've seen this combination used to perform tasks on various GPUs and accelerators, in part because it largely eliminates the need to develop code optimized for specific hardware.
- Imagination licenses RISC-V CPU cores for smart TVs, IoT, embedded stuff
- Hailo's latest AI chip shows up integrated NPUs and sips power like fine wine
- Google joins the custom server CPU crowd with Arm-based Axion chips
- Intel Gaudi's third and final hurrah is an AI accelerator built to best Nvidia's H100
Meta, has been a major proponent of PyTorch, which it developed before handing the reins over to the Linux Foundation, as it gives engineers the flexibility to develop AI applications that can run across a variety of GPU hardware from Nvidia and AMD. So it makes sense that Meta would want to employ the same technologies with its own chips.
In fact, Meta claims that by co-developing its software and hardware together it was able to achieve greater efficiency compared to existing GPUs platforms and expects to eke out even more performance through future optimizations.
MTIA v2 certainly won't be the last silicon we see from Meta. The social media giant says it has several chip design programs underway, including one that will support future generative AI systems. ®