Intel Gaudi's third and final hurrah is an AI accelerator built to best Nvidia's H100
Goodbye dedicated AI hardware and hello to a GPU that fuses Xe graphics DNA with Habana chemistry
An H100 contender?
As we mentioned earlier, Intel's Habana team has never been a big fan of using FLOPS as an analog for AI performance, preferring instead to highlight real-world performance.
There are plenty of reasons for this. One issue is that FLOPS have to be qualified. What precision at which are they measured; is that sparse or dense performance; what level of utilization are you able to achieve on the accelerator? There's not much point in highlighting floating point performance if the accelerator can only manage 60 percent utilization – or so the argument goes.
"People are used to GPUs – and GPUs are great – but the utilization of compute and memory is quite different between Gaudi and a typical GPU," Medina says. "That's why you typically see us talk about workload and application level performance."
So how does Gaudi3 stack up against Nvidia's H100? Well, if Intel's performance claims are to be believed, the accelerator is as much as 1.7x faster in training workloads and 1.3x faster on average for inference compared to than Nvidia's Hopper GPUs.
As always, take these claims with a heavy dose of salt. Notice the word "projections" versus benchmarks. According to Intel, the H100 performance figures shown here are taken from Nvidia's own testing. In effect, Intel looked at how Nvidia arrived at its figures and used its methodology as a point of comparison for Gaudi3's performance.
"We have to be absolutely transparent," he says. "I don't want to get into a situation where I'm asking my team to take an H100 for instance and then download software and then run it ourselves because someone could claim 'Hey, you didn't use our best optimization.'"
Nvidia has previously spoken out about AMD's test methodology and comparisons following the launch of the MI300X in December, a spat that we covered in depth.
Intel is clearly trying to communicate that Gaudi3 is not only capable of competing with Nvidia's chips, but under the right conditions beating them.

Intel claims Gaudi3 will deliver between 1.4x and 1.7x faster training times compared to Nvidia's H100 – click to enlarge
For training, the strongest gains appear to be for smaller models like Llama2 7B or 13B, which can be trained at FP8 using one or two nodes respectively. For larger models, like GPT3-175B, Intel's performance claims are based on a 1,024-node cluster of Gaudi3 accelerators. Keep in mind that this is projected performance.
Having said that, Intel hasn't been shy about making regulator submissions to the MLPerf training and inference benchmarks, where its scores have been enough to warrant a rare acknowledgement from Nvidia. Intel's choice to mirror Nvidia's testing suggests it's quite confident in Gaudi3 as an alternative to the H100.
In terms of inference – the actual act of running these models once trained – is a bit of a mixed bag, with Intel pitting Gaudi3 against both Nvidia's H100 and upcoming H200.

Against Nvidia's H200, Intel's Gaudi3 appears to struggle a bit, only pulling ahead in larger models like Falcon 180B – click to enlarge
Against the newer H200, with its larger 141 GB of speedy HBM3e, Intel expects Gaudi3 to match or slightly underperform Nvidia, at least for smaller models. However, with larger models, like Falcon 180B, and larger input and output lengths, Medina claims Gaudi3's AI-first architecture takes the lead.
Pitted against the older H100, the story is largely the same, with Gaudi3 pulling ahead in Llama2 70B. This makes sense as the model was right on the edge of what Nvidia was able to cram into a single H100 when running at FP8.
Taming the heat
For those looking to get their hands on Gaudi3, Intel says it began sampling the chips to OEM/ODM partners in Q1 and will be ramping volume shipments of the parts beginning in Q3 for air-cooled parts and Q4 for liquid-cooled designs.

Intel has already started sampling Gaudi3 accelerators to OEM/ODM customers with the chip expected to ramp beginning in Q3 – click to enlarge
The chip will be available in one of three form factors: an OCP-compliant OAM module, a universal baseboard similar to Nvidia's HGX boards with eight accelerators, and as a PCIe add-in card.
Apart from a lower 600 W power limit on the PCIe card compared to the OAM module's 900 W TDP, Medina tells us all these configurations share the same silicon. The only caveat being that the PCIe card's smaller dual-slot cooler and lower TDP may result in lower performance in some workloads. But he insists that for the majority of inference workloads performance should be comparable.
Medina also notes that customers shouldn't expect any of the chips to actually run at the power limit. "TDP is not the average power," he says. "On inference, we see much lower power draw."

Despite a 900 W TDP, Intel claims Gaudi3 is as much as 2.3x more efficient compared to the H100 – click to enlarge
Compared to the H100, Intel claims that Gaudi3 is as much as 1.2x and 2.3x more efficient, when measuring tokens per second, per card, per watt. As we saw with the inference comparison, much of this is down to Gaudi3 having more HBM than than H100, allowing it to run larger models using fewer accelerators and therefore achieve higher performance per watt.
A Blackwell boogeyman, Intel's salvation, and an uncertain future
At launch, Intel has four OEM partners lined up to carry its Gaudi3 accelerators including Dell Tech, Hewlett Packard Enterprise, Lenovo, and Supermicro. For those looking to port their software stack over from CUDA, Intel will also make the chips available in its Developer Cloud in the second half of this year.
Intel's timing positions it to compete squarely against Nvidia's H200, which is expected to start shipping in Q2. However, the bigger boogeyman is Nvidia's Blackwell GPUs, which we looked at in detail back in March.
Nvidia claims Blackwell will deliver up to 5x the floating point performance of the H100 or H200 and deliver substantially higher memory capacity at 192 GB and be good for 8 TBps of bandwidth, when it arrives later this year.
As we noted earlier, we have a strong suspecion it'll be 2025 before most customers get their hands on Nvidia's latest and greatest accelerators, but that does mean that Gaudi3's window to claim market share is rather narrow. For what it's worth, the same can be said about AMD's MI300X.
The good news for Intel is that Nvidia can't make enough GPUs to keep up with demand. Even with Nvidia expected to ship between 1.5 million and 2 million more H100s than in 2023, experts expects them to remain in extremely tight supply.
The one wrinkle in all of this, which might scare some customers off, is Gaudi's trajectory from here on out. In 2025, the Habana team's IP will be fused into Intel's Falcon Shores platform.

As Intel discussed at ISC last year, its GPU Max and Habana product lines are set to converge in 2025 – click to enlarge
Falcon Shores was initially envisioned as an APU – or, as Intel prefers, XPU – which would combine CPU and GPU cores on a single package, similar to what AMD did with MI300A.
However, those plans were later scrapped in favor of a next-gen GPU architecture that would combine Intel's Xe graphics lineage with Habana's AI accelerator chemistry. This merger has led to questions about Gaudi3's longevity.
According to Intel, for developers coding higher-level frameworks, like PyTorch, the migration should be seamless. Meanwhile, for those building AI apps at a lower level, Intel says it'll provide additional guidance and resources via its Developer Cloud in the lead up to Falcon Shore's debut. ®