On-Prem

HPC

Tesla's Dojo supercomputer is a billion-dollar bet to make AI better at driving than humans

More data means better neural net training, but it also means more cores


Tesla says it is spending upwards of $1 billion on its Dojo supercomputer between now and the end of 2024 to help develop autonomous vehicle software.

Dojo was first mentioned by CEO Elon Musk during a Tesla investor day in 2019. It was built specifically for training machine learning models needed for video processing and recognition to enable the vehicles to be self-driving.

During Tesla's Q2 earnings call this week, Musk said Tesla was not going to be "open loop" on its Dojo expenditure, but the sum involved would certainly be "north of a billion through the end of next year."

"In order to copy us, you would also need to spend billions of dollars on training compute," Musk claimed, saying that developing a reliable autonomous driving system is "one of the hottest problems ever."

"You need the data and you need the training computers, the things needed to actually achieve this at scale toward a generalized solution for autonomy."

Musk pointed out that training complex machine learning models needs huge volumes of data, the more the better, and this is what Tesla has access to, thanks to all the telemetry from its vehicles.

"With respect to Autopilot and Dojo, in order to build autonomy, we obviously need to train our neural net with data from millions of vehicles. This has been proven over and over again, the more training data you have, the better the results," he said.

"It barely works at 2 million [training examples]. At 3 million, it's like, wow, OK, we're seeing something. But then, you get to, like, 10 million training examples, it becomes incredible. So there's just no substitute for massive amount of data. And obviously, Tesla has more vehicles on the road collecting this data than all of the other companies combined. I think maybe even an order of magnitude," Musk claimed.

On the Dojo system itself, Musk said it was designed to significantly reduce the cost of neural net training, and has been "somewhat optimized" for the kind of training that Tesla requires, which is video training.

"We see a demand for really vast training resources. And we think we may reach in-house neural net training capability of 100 exaFLOPS by the end of next year," Musk claimed, which is quite a lot of compute power, to put it mildly.

Dojo is based largely on Tesla's own technology, starting with the D1 chip that comprises 354 custom CPU cores. Twenty-five of these D1 chips are interlinked into a 5x5 array inside a "training tile" module, building up to the base Dojo V1 configuration featuring 53,100 D1 cores, according to our colleagues at The Next Platform.

Musk believes that with all of the training data and a "high-efficiency inference computer" in the car, Tesla's autonomous driving system will soon make its vehicles not just as proficient as a human driver, but eventually much better. When? He didn't say and has form in making grand claims.

"To date, over 300 million miles have been driven using FSD [Full Self-Driving] Beta. That 300-million-mile number is going to seem very small, very quickly. And FSD will go from being as good as a human to then being vastly better than a human. We see a clear path to full self-driving being 10 times safer than the average human driver," he claimed.

This is important, Musk explained, because "right now, I believe there's something in the order of a million automotive deaths per year. And if you're 10 times better than a human, that would still mean 100,000 deaths, So, it's like, we'd rather be a hundred times better, and we want to achieve as perfect a safety as possible."

Dojo is not the only supercomputer Tesla has for video training. The company also built a compute cluster equipped with 5,760 Nvidia A100 GPUs, but Musk said they simply couldn't get enough GPUs for the task.

"We'll actually take the hardware as fast as Nvidia will deliver it to us," he said, adding: "If they could deliver us enough GPUs, we might not need Dojo, but they can't because they've got so many customers." ®

Send us news
43 Comments

IBM seeks $3.5B in cost savings for 2025, discretionary spend to be clipped

Workforce rebalancing? Yes, but on the plus side, the next 12 months are all about AI, AI, and more AI

Mental toll: Scale AI, Outlier sued by humans paid to steer AI away from our darkest depths

Who guards the guardrail makers? Not the bosses who hire them, it's alleged

Tesla's numbers disappoint again ... and the crowd goes wild ... again

Boy who's cried wolf on autonomous driving for years swears 'there's a damn wolf this time'

Some workers already let AI do the thinking for them, Microsoft researchers find

Dammit, that was our job here at The Reg. Now if you get a task you don't understand, you may assume AI has the answers

UK government insiders say AI datacenters may be a pricey white elephant

Economy-boosting bit barn? Not in my back yard, some locals expected to say

Only 4 percent of jobs rely heavily on AI, with peak use in mid-wage roles

Mid-salary knowledge jobs in tech, media, and education are changing. Folk in physical jobs have less to sweat about

Google torpedoes 'no AI for weapons' rules

Will now happily unleash the bots when 'likely overall benefits substantially outweigh the foreseeable risks'

When it comes to AI ROI, IT decision-makers not convinced

Proof of concept projects stuck in pilot phase as investors get itchy feet

A win at last: Big blow to AI world in training data copyright scrap

You gotta fight ... for your Reuters ... to party

EU plans to 'mobilize' €200B to invest in AI to catch up with US and China

Captain's Log, Stardate 3529.7 – oh yeah, Commish also withdrawing law that would help folks sue over AI harms

Running hot? Server shipments forecast to cool in 2025

Supply chain and regulatory hurdles likely to shrink figures

UK government using AI tools to check up on roadworthy testing centers

Who tests the testers?