Broadcom chases AI craze with ML-tuned switch ASICs
Faster GPUs don't mean much if you've got a network bottleneck
Broadcom is aiming to capitalize on the AI-arms race with a switch chip tuned for large GPU clusters.
The company's Jericho3-AI ASIC, unveiled this week, is designed to deliver high-performance switching at port speeds up to 800Gbps and scale to connect more than 32,000 GPUs.
To do this, Broadcom is using an asymmetric arrangement of serializer/deserializers (SerDes) that prioritize fabric connectivity. The chip itself boasts 304, 106Gbps PAM4 SerDes with 144 dedicated to switch ports and 160 allocated to the switch fabric. The latter is important as it allows multiple ASICs to be stitched together to support massive GPU clusters.
According to Broadcom's Pete Del Vecchio, this asymmetric split also helps the chip better contend with network congestion and overcome network failures.
Because large AI models have to be distributed across multiple nodes, these factors can have an outsized impact on completion times compared to running smaller models on a single node. If Broadcom's internal benchmarks are to be believed, its Jericho3-AI ASICs performed about 10 percent better in an "All-to-All" AI workload versus "alternative network solutions."
- Nvidia turns to optical trickery to boost long-haul InfiniBand performance
- Beijing lists the stuff it wants generative AI to censor
- Europe moves to derail Broadcom's VMware takeover
- Tencent Cloud says it's mass producing custom video chips
While most 400Gbps and 800Gbps switches, like Broadcom's Tomahawk 5 announced last year, are designed with aggregation in mind, the Jericho3-AI was developed as a high-performance top-of-rack switch that interfaces directly with clients. But while Broadcom claims the switch supports up to 18 ports at 800Gbps each, that use case isn't quite ready for prime time.
"In general, the high-end AI systems are moving from 200GbE now to 400GbE in the future," Del Vecchio said. "We have a lot of customers that have AI/ML training chips that are under development that are specifically saying they want to have an 800GbE interface."
For the moment, that puts the practical limit at 400Gbps per port, as this is the maximum bandwidth supported by the PCIe 5.0 bus. And, remember that's only on the latest generation of server platforms from AMD and Intel. Older Intel Ice Lake and AMD Milan systems will cap out at 200Gbps per NIC. But because the switch uses 106Gbps PAM4 SerDes, the ASIC can be tuned to support, 100, 200, and 400Gbps port speeds.
However, Del Vecchio notes that several chipmakers are integrating NICs directly into the accelerator — Nvidia's H100 CNX for example — to avoid these bottlenecks. So it's possible we could see 800Gbps ports built into accelerators before the first PCIe 6.0-compatible systems make it to market.
Still, 400Gbps appears to be the sweet spot for the Jericho3-AI, which supports up to 36 ports at that speed. While this might sound like overkill for a top-of-rack switch, it's not uncommon to see GPU nodes with one 200-400Gbps NIC per GPU. Nvidia's DGX H100, for instance, features eight 400Gbps ConnectX 7s for each of its SXM5 GPUs. For a four-node rack — physical size, power consumption, and rack power often prevent greater densities — that works out to 32 ports, well within the capabilities of Broadcom's new ASIC.
Looking at Broadcom's Jericho3-AI, it's hard not to draw comparisons to Nvidia's Spectrum Ethernet and Quantum InfiniBand switches, which are widely deployed in high-performance compute and AI environments, including in the cluster Microsoft built for OpenAI that was detailed by our sister site The Next Platform last month.
Nvidia's Quantum-2 InfiniBand switch boasts 25.6Tbps of bandwidth and support for 64, 400Gbps ports — enough for about eight DGX H100 systems from our earlier example.
Del Vecchio argues that many hyperscalers are developing AI accelerators of their own — AWS and Google both spring to mind — and want to stick with industry-standard Ethernet.
While Broadcom says its Jericho3-AI chips are making their way to customers now, it'll be a while longer before those chips are integrated into OEM chassis and can make their debut in the datacenter. ®