Broadcom says Nvidia Spectrum-X's 'lossless Ethernet' isn't new
Been there, done that, SVP Ram Velaga tells El Reg
At Computex, Nvidia promised "lossless Ethernet" for generative AI workloads with the launch of its Spectrum-X platform – but if you ask Broadcom, it's not even a new idea.
"There's nothing unique about their device that we don't already have," Ram Velaga, SVP of Broadcom's core switching group, told The Register.
He explained that what Nvidia has actually done with Spectrum-X is build a vertically integrated Ethernet platform that's good at managing congestion in a way that minimizes tail latencies and reduces AI job completion times.
Velaga argues that this is no different than what Broadcom has done with its Tomahawk5 and Jericho3-AI switch ASICs. He also sees it as an admission by Nvidia that Ethernet makes more sense for handling GPU flows in AI.
Nvidia, for its part, hasn't given up on InfiniBand networking. InfiniBand is great for those running a handful of very large workloads – like GPT3 or digital twins. However Gilad Shainer, VP of marketing for Nvidia's networking division, told The Register that in some environments, particularly multi-tenant clouds, Ethernet is preferred.
For smaller AI/ML workloads, Shainer said, traditional Ethernet infrastructure has worked just fine – but now that these workloads are growing beyond one node, it's simply too slow.
Nvidia's Spectrum-X platform claims to address this challenge.
To be clear, Nvidia's Spectrum-X isn't a product. It's a collection of hardware and software, most of which we've covered in the past. The core components include Nvidia's 51.2Tbit/sec Spectrum-4 Ethernet switch and BlueField-3 data processing unit (DPU).
The basic idea is that so long as you're using both Nvidia's switch and its DPU, they'll work together to mitigate traffic congestion and – if Nvidia is to be believed – eliminate packet loss altogether.
While Shainer claims this is a completely new capability unit to Nvidia, Velaga makes the case that the idea of "lossless Ethernet" is just marketing. "It's not so much that it's lossless, but you're effectively managing the congestion so well that you have a very high-efficiency Ethernet fabric," he argued.
In other words, rather than an Ethernet network where packet loss is a given, it's the exception to the rule. Or that's the idea, anyway.
What's more, Velaga claims this kind of congestion management is already built into Broadcom's latest generation of switch ASICs – only they work with any vendor or cloud service provider's smartNIC or DPU. "You don't have to do it at the NIC, you can do it from one Jericho3-AI leaf to another Jericho3-AI leaf," he added.
When we asked Shainer about Broadcom's Tomahawk5 and Jericho3-AI, he declined to draw comparisons to the chips, arguing that Spectrum-X was in a class of its own and implying that some vendors were simply tacking "AI" to existing products.
"There is nothing out there, regardless of how you call it, that has those capabilities that are designed for AI," he said.
- Nvidia to power more supercomputers as AI frenzy kicks in
- Intel says AI is overwhelming CPUs, GPUs, even clouds – so all Meteor Lakes get a VPU
- Nvidia creates open server spec to house its own chips – and the occasional x86
- Look mom, no InifiniBand: Nvidia's DGX GH200 glues 256 superchips with NVLink
Vertical integration vs disaggregation
According to Velaga, the kind of vertical integration Nvidia is trying to achieve is in conflict with Ethernet. "The whole reason why Ethernet is successful today is it's a very open ecosystem," he said.
Because of this, Nvidia's Spectrum-X could prove to be a tough sell for cloud providers, which tend to avoid vendor lock-in wherever possible. Their intense desire to avoid it led to the widespread adoption of vendor-agnostic network operating systems like SONiC. This allowed them to run their clouds on any compatible switch.
For what it's worth, Nvidia's Spectrum-4 does support SONiC, as well as its own Cumulus NOS and the Linux Switch driver. However, because the Spectrum-X platform relies on having both the Spectrum-4 and BlueField, you can't just swap one for another SONiC-compatible switch or DPU without losing out on features.
Speaking of DPUs, many of the largest cloud-service providers already have SmartNICs tuned to their environments. Amazon Web Services has Nitro, Google co-developed an ASIC-based SmartNIC with Intel, and Microsoft acquired Fungible in January. These devices are incredibly valuable to cloud providers, as they allow them to offload common networking, storage, and security workloads – freeing up the CPU to run tenant workloads.
Shainer says this is perfectly fine. He argues cloud providers can use their existing DPUs to manage their infrastructure and control north/south traffic, and use Nvidia's BlueField-3 for east-west traffic between the nodes in the cluster.
He added that there's nothing stopping someone from deploying Nvidia's switches or DPUs as standalone products either.
"If someone wants to take our switches and build their own stuff, they're more than welcome. If someone wants to take our DPUs and use someone else's switches, sure – go ahead. You can develop this stuff yourself," Shainer said. "But if you want to get something that is fully optimized, full stack … and get the system up in four weeks and not six, seven, or eight months? Priceless."
Broadcom's Velaga isn't so sure how this idea will be received by customers. "It's hard to say how they will sell the value of a vertically integrated Ethernet solution in a world where everything is disaggregated." ®