This article is more than 1 year old

Congrats, Nvidia and Google: You're still the best (out of five) at training neural networks

ML Perf could do with more entrants' results

Analysis Nvidia and Google continue to dominate in AI hardware, according to the latest benchmarking results from the ML Perf project published this week.

Tech companies all want a little slice of the machine learning pie: Nvidia has rebranded GPUs for AI, Intel is busy trying to tout its CPUs and push its new ASIC known as the NNP-T out on time, and Google – not traditionally known for its hardware – went and built its own accelerator chip too. Several AI hardware startups have also cropped up, like Graphcore and SambaNova.

So a group of hardware nerds from industry and academia decided to get together to create ML Perf for "fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services".

Despite ML Perf being supported by several big names, it looks like only five companies have bothered submitting results: Google, Nvidia, Intel, Fujitsu, and Alibaba. Notable names like AMD are missing and there's nothing from any of the many AI chip startups either.

It's not a cheap endeavor after all, since it requires renting or buying hundreds or thousands of chips. Or perhaps these companies retracted their results because they don't want to look bad next to their competitors.

"Part of the motivation is that when a submitter makes a mistake, not all mistakes can be fixed after submission. So if a submitter makes a mistake that isn't fixable, they need to be able to retract," David Kanter, co-chair of the MLPerf Inference Working Group, told The Register. When we asked if ML Perf did indeed receive more submissions than what's been made public, he replied: "It's entirely possible."

ML Perf tests different models for four tasks – image recognition, object detection, machine translation, and reinforcement learning. The results are split further into two categories – closed division times and open division times.

Surprised man computer photo via Shutterstock

Pssst.... build your own machine learning computer, it's cheaper and even faster than using GPUs on cloud


Closed division is more useful to compare hardware or software platforms and submissions are bound by specific rules so that the results are more of an "apples-to-apples" comparison. The open division, however, gives developers more flexibility to reach a particular result.

In the closed division, Google and Nvidia have submitted results for the first three tasks. Intel has only published results for reinforcement learning, and Alibaba only has one submission in image classification. So it's difficult to compare them back-to-back.

The fastest times to train an ImageNet, a popular computer vision dataset, on the ResNet-50 architecture was pretty comparable for Nvidia and Google. The Chocolate Factory's TPU3 clocked in 1.28 minutes running on TensorFlow, and Nvidia's was 1.33 minutes using Amazon's MXNet framework.

It all requires pretty ridiculous amounts of hardware, however. You'd need to rent 1,024 TPU3s on Google Cloud or have 1,536 Tesla V100s to hand. Alibaba's only entry trailed behind at 24.37 minutes employing 64 of Nvidia's Tesla V100s connected using PCI Express to shuttle data between GPUs and CPUs more quickly. All of that was running on software called Sinian.

For object recognition on the ResNet-34 neural network, Google came out top at 1.21, again based on 1,024 TPU3s, and Nvidia was second with 2.23 by cramming together 240 Tesla V100s optimised using PyTorch. Nvidia fared better when it came to training the Neural Machine Translation, beating Google by tens of seconds or so using 384 Tesla V100s compared to 1,024 TPU3s. Their times were 1.80 and 2.11.

But when it came to the Transformer model, Google topped in at 0.85 minutes, based on 1,024 TPU3s again, compared to Nvidia's 480 Tesla V100s running on PyTorch. It looks like Intel's only strong point appears to be reinforcement learning. In this category, it entered twice: once using 64 Cascade Lake CLX-8260 chips and the other with two Cascade Lake CLX 9282 chips. The first getup managed to train MiniGo, a system that can play the strategy board game Go, in 14.43 minutes, while the second did this in 77.95 minutes.

Nvidia had a slight edge over Intel in this department, however, and trained MiniGo in 13.57 minutes using 24 Tesla V100s running on TensorFlow.

There's only one open division result. Here, Fujitsu trained a ResNet-50 model on the ImageNet dataset using 2048 Tesla V100s in 1.1 minutes. The developers achieved this by tweaking hyperparameters that were fixed in the closed division, Kanter explained.

So what did we learn from ML Perf? Software tricks can help slash training times, using more chips also trains models more quickly, and Nvidia and Google are leading in AI hardware. Basically, nothing we didn't know already.

If you want a complete breakdown and code from all the results, you can find them here. ®

More about


Send us news

Other stories you might like