Ethernet advances will end Nvidia's InfiniBand lead in AI networks
Desire to build brainboxes will also see optical interconnects go mainstream
Three imminent improvements to the Ethernet standard will make it a better alternative to host AI workloads, and that will see vendors back the tech as an alternative to Nvidia's InfiniBand kit, which is set to dominate for the next two years.
That's the opinion of analyst firm Gartner in a piece published this week titled "Emerging Tech: Top Trends in Networking for Generative AI." Penned by director analyst Anushree Verma, a member of Gartner's Emerging Technologies and Trends group, the paper predicts InfiniBand adoption among technology providers like vendors and clouds will reach around 25 percent by 2026 and stay there.
Ethernet will achieve the same adoption rate by providers in that year, then accelerate to the point where it is offered by over eighty percent of providers in a decade.
That shift by tech providers will mean that by 2028, 45 percent of Gen AI workloads will run on Ethernet – up from less than 20 percent now.
The swing will come because Ethernet is improving. Gartner currently rates it as "not ideal" for AI training, but Verma highlighted three innovations she feels will make Ethernet a worthy – even superior – contender against InfiniBand:
- RDMA over Converged Ethernet (RoCE) – will allows direct memory access between devices over Ethernet, improving performance and reducing CPU utilization;
- Lossless Ethernet – will bring advanced flow control, improved congestion handling, hashing improvements, buffering, and advanced flow telemetry that improve on the capabilities of modern switches;
- The Ultra Ethernet Consortium's (UEC) spec in 2024 – designed specifically to make Ethernet AI-ready.
Because Ethernet is open, Verma expects many suppliers will implement the three innovations mentioned above, giving buyers choice and creating competition.
InfiniBand, by contrast, is more expensive than Ethernet and will remain so for five years. Verna believes it "has scalability limitations and requires special skills to manage," which means some network designers avoid it in case it becomes an unmanageable complexity.
She nonetheless predicted 30 percent of generative AI workloads will run on InfiniBand – up from fewer than 20 percent today.
- One rack. 120kW of compute. Taking a closer look at Nvidia's DGX GB200 NVL72 beast
- Nvidia revenue grows 265 percent with more to come as new GPUs and Ethernet near
- Cisco, Nvidia expand collab to push Ethernet into AI clusters
- Does AI give InfiniBand a moment to shine? Or will Ethernet hold the line?
That growth will be dwarfed by the rise of optical interconnects in networks used to carry generative AI traffic. Verna found that under one percent of networks used for AI workloads employ the interconnects today, but predicted that will rise to 25 percent by 2030.
She cautioned that while the tech has big backers – such as Intel, TSMC, and HPE – it will not become widely used until around 2028. Once it matures, users can expect it will improve the scalability of compute clusters beyond 100Tbit/sec, while also requiring less power than electrical switching.
PCIe is also on the rise, and when teamed with servers that employ it to share memory across the bus using the CXL spec, Gartner expects both will become prevalent in AI workloads.
Again, Verna predicts uptake is a few years off: CXL debuted in early 2023, and she feels that serious adoption will start in 2026 – the same time she expects PCIe 6.0 implementation to ramp.
Verna recommends users should "Evaluate early adoption opportunities to gain competitive advantage by establishing partnerships at the design stage with leading technology providers," and otherwise make sure they understand the technologies outlined above.
And for those contemplating InfiniBand, she wrote it will be necessary to "reevaluate networking choices for performance, reliability, scalability and price by assessing the InfiniBand-based switches versus Ultra-Ethernet-based switches." ®