This article is more than 1 year old

Nvidia just made a killing on AI – where is everyone else?

It doesn't matter if your GPU is better at training if no one can get hold of them

Comment Nvidia's latest quarter marked a defining moment for AI adoption.

Demand for the tech titan's GPUs drove its revenues to new heights, as enterprises, cloud providers, and hyperscalers all scrambled to stay relevant under the new AI world order. 

But while Nvidia's execs expect to extract multi-billion dollar gains from this demand over the next several quarters, the question on many minds is whether Nvidia and partners can actually build enough GPUs to satisfy demand, and what happens if they can't.

In a call with financial analysts, Nvidia CFO Colette Kress assured Wall Street that the graphics processor giant was working closely with partners to reduce cycle times and add supply capacity. When pressed for specifics, Kress repeatedly dodged the question, arguing that Nv's gear involved so many suppliers it was difficult to say how much capacity they'd be able to bring to bear and when.

A Financial Times report, meanwhile, suggested Nvidia plans to, at a minimum, triple the production of its top-spec H100 accelerator in 2024 to between 1.5 and 2 million units, up from roughly half a million this year. While this is great news for Nvidia's bottom line, if true, some companies aren't waiting around for Nvidia to catch up, and instead are looking to alternative architectures.

Unmet demand breeds opportunity

One of the most compelling examples is United Arab Emirates G42 Cloud, which tapped Cerebras Systems to build nine AI supercomputers capable of a combined 36 exaflops of sparse FP16 performance for a mere $100 million a piece.

Cerebras's accelerators are wildly different from the GPUs that power Nvidia's HGX and DGX systems. Rather than packing four or eight GPUs into a rack mount chassis, Cerebra's accelerators are enormous dinner-plate-sized sheets of silicon packing 850,000 cores and 40GB of SRAM. The chipmaker claims just 16 of these accelerators are required to achieve 1 exaflop of sparse FP16 performance, a feat that, by our estimate, would require north of 500 Nvidia H100s.

And for others willing to venture out beyond Nvidia's walled garden, there's no shortage of alternatives. Last we heard, Amazon is using Intel's Gaudi AI training accelerators to supplement its own custom Trainium chips — though it isn't clear in what volumes.

Compared to Nvidia's A100, Intel's Gaudi2 processors, which launched last May, claim to deliver roughly twice the performance, at least in the ResNet-50 image classification model and BERT natural language processing models. And for those in China, Intel recently introduced a cut down version of the chip for sale in the region. Intel is expected to launch an even more powerful version of the processor, predictably called Gaudi3, to compete with Nvidia's current-gen H100 sometime next year.

Then of course, there's AMD, which, having enjoyed a recent string of high-profile wins in the supercomputing space, has turned its attention to the AI market.

At its Datacenter and AI event in June, AMD detailed its Instinct MI300X, which is slated to start shipping by the end of the year. The accelerator packs 192GB of speed HBM3 memory and eight CDNA 3 GPU dies into a single package.

Our sister site The Next Platform estimates the chip will deliver roughly 3 petaflops of FP8 performance. While 75 percent of a Nvidia's H100 in terms of performance, the MI300X's offers 2.4x higher memory capacity, which could allow customers to get away with using fewer GPUs to train their models.

The prospect of a GPU that can not only deliver compelling performance, but which you can actually buy, clearly has piqued some interest. During AMD's Q2 earnings call this month, CEO Lisa Su boasted that the company's AI engagements had grown seven fold during the quarter. "In the datacenter alone, we expect the market for AI accelerators to reach over $150 billion by 2027," she said.

Barriers to adoption

So if Nvidia thinks right now it's only addressing a third of demand for its AI-focused silicon, why aren't its rivals stepping up to fill in the gap and cash in on the hype?

The most obvious issue is that of timing. Neither AMD nor Intel will have accelerators capable of challenging Nvidia's H100, at least in terms of performance, ready for months. However, even after that, customers will still have to contend with less mature software.

Then there's the fact that Nvidia's rivals will be fighting for the same supplies and manufacturing capacity that Nv wants to secure or has already secured. For example, AMD relies on TSMC just as Nvidia does for chip fabrication. Though semiconductor demand is in a slump as fewer people are interested in snapping up PCs, phones, and the like lately, there is significant demand for server accelerators to train models and power machine learning applications.

But back to the code: Nvidia's close knit hardware and software ecosystem has been around for years. As a result, there's a lot of code, including many of the most popular AI models, optimized for Nv's industry-dominating CUDA framework.

That's not to say rival chip houses aren't trying to change this dynamic. Intel's OneAPI includes tools to help users to convert code written for Nvidia's CUDA to SYCL, which can then run on Intel's suite of AI platforms. Similar efforts have been made to convert CUDA workloads to run on AMD's Instinct GPU family using the HIP API.

Many of these same chipmakers are also soliciting the help of companies like Hugging Face, which develops tools for building ML apps, to reduce the barrier to running popular models on their hardware. These investments recently drove Hugging's valuation to over $4 billion.

Other chip outfits, like Cerebras, have looked to side step this particular issue by developing custom AI models for its hardware, which customers can leverage rather than having to start from scratch. Back in March, Cerebras announced Cerebras-GPT, a collection of seven LLMs ranging from 111 million to 13 billion parameters in size.

For more technical customers with the resources to devote to developing, optimizing, or porting legacy code to newer, less mature architectures, opting for an alternative hardware platform may be worth the potential cost savings or reduced lead times. Both Google and Amazon have already gone down this route with their TPU and Trainium accelerators, respectively.

However, for those that lack these resources, embracing infrastructure without a proven software stack - no matter how performant it may be - could be seen as a liability. In that case, Nvidia will likely remain the safe bet. ®

More about

TIP US OFF

Send us news


Other stories you might like