AI ambition is pushing copper to its breaking point

Ayar Labs contends silicon photonics will be key to scaling beyond the rack and taming the heat

SC24 Datacenters have been trending toward denser, more power-hungry systems for years. In case you missed it, 19-inch racks are now pushing power demands beyond 120 kilowatts in high-density configurations, with many making the switch to direct liquid cooling to tame the heat.

Much of this trend has been driven by a need to support ever larger AI models. According to researchers at Fujitsu, the number of parameters in AI systems is growing 32-fold approximately every three years. To support these models, chip designers like Nvidia use extremely high-speed interconnects — on the order of 1.8 terabytes a second — to make eight or more GPUs look and behave like a single device.

The problem though, is that the faster you shuffle data across a wire, the shorter the distance at which the signal can be maintained. At those speeds, you're limited to about a meter or two over copper cables.

The alternative is to use optics, which can maintain a signal over a much larger distance. In fact, optics are already employed in many rack-to-rack scale-out fabrics like those used in AI model training. Unfortunately, in their current form, pluggable optics aren't particularly efficient or particularly fast.

Earlier in 2024 at GTC, Nvidia CEO Jensen Huang said that if the company had used optics as opposed to copper to stitch together the 72 GPUs that make up its NVL72 rack systems, it would have required an additional 20 kilowatts of power.

So, does this mean optics are off the table and denser racks are unavoidable? Well, not according to the folks at Ayar Labs, who contend that by integrating the optics directly into the compute, chipmakers can not only alleviate bandwidth bottlenecks but lower the rack densities required to support growing model parameter counts.

Lighting the way to de-densification

There is no shortage of photonics startups looking to overcome the limitations of copper interconnects and improve the efficiency of optical I/O, but Ayar is among the first.

The San Francisco-based startup has been developing an optical interconnect chiplet going back to 2015. These optical devices are designed to be co-packaged alongside a CPU or GPU in order to achieve higher bandwidth over a longer distance than would be possible with copper.

For applications like large-scale AI training and inference, the optical fibers could potentially take the place of Nvidia's NVLink or AMD's Infinity Fabric to connect multiple chips together.

In this render, Ayar shows how it imagnies optical interconnects might be integrated into a GPU platform for scale out compute.

In this render, Ayar shows how it imagines optical interconnects might be integrated into a GPU platform for scale out compute – click to enlarge

"If you want to get out of that one rack and go to multiple racks and expand the compute base to more than 64-72 GPUs, you have to do something different that's not copper and electrical," Terry Thorn, VP of commercial operations at Ayar Labs, told El Reg in a recent interview. "The pluggables that exist today don't meet the need. When you get to in-package optical-I/O, you start to meet the need, and you start to open up the ability to have that kind of scale up fabric."

While the technology could allow for a compute and memory domain to stretch across hundreds of GPUs spread among dozens of racks, it also means it's no longer necessary to cram nearly as many accelerators into a single rack, alleviating some of the power and thermal challenges datacenter operators face today.

"You may feel like with copper, you have to stay in that rack, and you may be limited to how many you can connect based on that power density, footprint, and square footage," Thorn explained. "If you start to incorporate optical I/O, you can start to spread out the distribution of the power and, therefore, give people who are constrained by power the ability to set up AI connected infrastructure across a larger set of square footage."

In other words, compute no longer has to be in the same box, let alone the same rack, to function as a logical system, and that means per rack power and thermal densities could be reduced considerably.

More work to be done

For all silicon photonic promises, the technology faces no shortage of challenges before it can be integrated into production hardware. These include everything from developing a chip that can match existing copper interconnects on power and bandwidth to developing communication protocols, like UCIe, so that the two can talk to each other.

Ayar is no stranger to these hurdles, having worked to integrate its silicon photonics chiplets into a number of prototype systems over the past few years. We've previously explored Ayar's integration with a super-threaded graph database accelerator Intel built for DARPA a few years back. Ayar has also integrated its chiplets in Intel's Agilex FPGAs.

More recently, Ayar disclosed that it's working with Fujitsu to integrate two next-gen photonics chips, each capable of about 8Tbps of bidirectional bandwidth, into their CPUs.

At SC24, Ayar showed a mockup of what a pair of its TeraPHY chiplets might look like co-packaged alongside an A64FX processor, but there's no indication that's what will actually be built or that Fujitsu intends to commercialize the tech. Just like Intel, this could simply be an experiment to test the viability of the tech.

Here's a closer look at the concept showing Ayar's optical I/O modules packaged alongside an Fujitsu A64FX

Here's a closer look at the concept, showing Ayar's optical I/O modules packaged alongside an Fujitsu A64FX – click to enlarge

However, building and integrating a photonic chiplet is only a piece of a much larger puzzle. Because they're being permanently bonded to a costly accelerator, they need to be reliable.

With optical pluggables, if something goes wrong, replacing it is relatively easy and cheap, at least compared to something like a GPU: just swap out the bad one and get back to work. If an optical chiplet fails, there goes your $40,000 accelerator.

"I think there's a few things that we are taking action on that address some of the concerns that come up anytime you talk about optics in a compute chip," Thorn said.

One of the first actions was separating the light source from the chiplet. "There are ways to do the laser inside the chip, but that puts the laser itself in very high dynamic range temperatures… and that tends to affect their reliability and long term viability," he explained.

The benefit of this approach is that if the laser does fail, it won't take the GPU or accelerator with it and can be replaced or potentially upgraded down the line.

Ayar is also in the process of developing an optical testing pipeline to suss out bad dies before they're bonded to the GPU at the fab. "We're establishing how you do the optical and electrical test on the wafer to help identify known good dies," Thorn said, adding that this should help to avoid chips getting ruined by faulty optics.

At SC24, Ayar demoed a new method of attaching optical fibers to its TeraPHY chiplets developed by Intel Foundry

At SC24, Ayar demoed a new method of attaching optical fibers to its TeraPHY chiplets developed by Intel Foundry – click to enlarge

Speaking of faulty optics, it's not just the chiplet you have to worry about, but the fibers themselves. Over the years, Ayar has explored a couple of different approaches to fiber attach, including one developed by Intel Foundry that slots horizontally into the side of the chip. We're told the testing of the attach is still in the early phases, but Ayar has successfully transmitted data over it.

As we mentioned earlier, Ayar isn't the only company working to overcome these challenges, and in many cases, developments like optical fiber attaches, communications protocols, and testing and validation methodologies are likely to be standardized to the benefit of the broader ecosystem. ®

More about

TIP US OFF

Send us news


Other stories you might like