Coherent lights the way to massive AI clusters with optical circuit switches
Could end-to-end lasers keep long training jobs on track?
Networking biz Coherent unveiled an optical circuit switch designed to support high-density AI clusters at the Optical Fiber Communication Conference on Monday.
The switch is not like those you might typically find in AI clusters in that the actual switching is handled entirely optically, rather than using transceivers to convert photons into electrons and back again. Laser light simply enters one port and exits another – with a little bit of attenuation, of course.
The appliance, which is slated to ship in volume next year, features 300 input and 300 output ports and is based on Coherent's Datacenter Light Wave Cross Connect tech. As we understand it, it works by manipulating liquid crystal cells to control which wavelength of light goes where.
Coherent's latest optical circuit switch on display at OFC boasts 300 input and 300 output ports – Click to enlarge
Dell’Oro Group analyst Sameh Boujelbene told The Register that optical circuit switches offer a couple of benefits. In addition to high bandwidth and low latency networking, switches of this type tend to be less expensive to operate – as they require substantially fewer electrical switches and optical transceivers.
Additionally, Coherent notes that this kind of optical switching tends to be more reliable – something that will pay dividends in very larger clusters in which mean time to failure tends to be quite low.
This is one of the reasons that Google developed its own optical circuit switches for its TPUv4 pods. Speaking at Hot Chips last year, Andy Swing, a technical lead for Google's TPU group, explained [Video] that by using OCS Google was able to switch together very large quantities of accelerators.
These pods consist of 64 racks, each containing 64 Tensor Processing Units (TPUs). Each of these racks was connected optically back to one of Google's internally developed OCS switches, for an all-to-all mesh.
Swing explained this approach has a couple of benefits – including the ability to reconfigure the cluster size dynamically. Another is that all of the accelerators are connected to one another, which improves reliability – a desirable quality as training workloads can last months depending on the model's parameter count and the size of the dataset.
- One rack. 120kW of compute. Taking a closer look at Nvidia's DGX GB200 NVL72 beast
- Here's another thing AI can do: Spark a boom in edge infrastructure spending
- Cisco, Nvidia expand collab to push Ethernet into AI clusters
- Does AI give InfiniBand a moment to shine? Or will Ethernet hold the line?
In the case of Google's TPUv4 pods, if one of the nodes were to fail, the switch could be reconfigured to work around the issue.
Swing also noted that the approach allows for various network topologies to be used depending on the model. For example, in testing, Google saw a sizable boost in network bandwidth by using a twisted torus topology, in which accelerators are meshed together in something resembling a twisted loop.
But while Coherent's new OCS appliances may allow others to build optically switched clusters similar to Google's, Dell Oro’s Boujelbene noted that OCS is still a relatively new technology in the datacenter.
"So far only Google, after many years in development, was able to deploy it en masse in its datacenter networks," she said. "Additionally, OCS switches may require a change in installed base of fiber depending on the cloud service provider." ®