Tesla's Dojo supercomputer is a billion-dollar bet to make AI better at driving than humans
More data means better neural net training, but it also means more cores
Tesla says it is spending upwards of $1 billion on its Dojo supercomputer between now and the end of 2024 to help develop autonomous vehicle software.
Dojo was first mentioned by CEO Elon Musk during a Tesla investor day in 2019. It was built specifically for training machine learning models needed for video processing and recognition to enable the vehicles to be self-driving.
During Tesla's Q2 earnings call this week, Musk said Tesla was not going to be "open loop" on its Dojo expenditure, but the sum involved would certainly be "north of a billion through the end of next year."
"In order to copy us, you would also need to spend billions of dollars on training compute," Musk claimed, saying that developing a reliable autonomous driving system is "one of the hottest problems ever."
"You need the data and you need the training computers, the things needed to actually achieve this at scale toward a generalized solution for autonomy."
Musk pointed out that training complex machine learning models needs huge volumes of data, the more the better, and this is what Tesla has access to, thanks to all the telemetry from its vehicles.
"With respect to Autopilot and Dojo, in order to build autonomy, we obviously need to train our neural net with data from millions of vehicles. This has been proven over and over again, the more training data you have, the better the results," he said.
"It barely works at 2 million [training examples]. At 3 million, it's like, wow, OK, we're seeing something. But then, you get to, like, 10 million training examples, it becomes incredible. So there's just no substitute for massive amount of data. And obviously, Tesla has more vehicles on the road collecting this data than all of the other companies combined. I think maybe even an order of magnitude," Musk claimed.
On the Dojo system itself, Musk said it was designed to significantly reduce the cost of neural net training, and has been "somewhat optimized" for the kind of training that Tesla requires, which is video training.
"We see a demand for really vast training resources. And we think we may reach in-house neural net training capability of 100 exaFLOPS by the end of next year," Musk claimed, which is quite a lot of compute power, to put it mildly.
- Tesla to license Full Self-Driving stack to other automakers, says Musk
- Tesla board members to return $735M in compensation settlement
- First of Tesla's 'bulletproof' Cybertrucks clunks off production line
- Tesla plots entry to Britain's stagnant energy market
Dojo is based largely on Tesla's own technology, starting with the D1 chip that comprises 354 custom CPU cores. Twenty-five of these D1 chips are interlinked into a 5x5 array inside a "training tile" module, building up to the base Dojo V1 configuration featuring 53,100 D1 cores, according to our colleagues at The Next Platform.
Musk believes that with all of the training data and a "high-efficiency inference computer" in the car, Tesla's autonomous driving system will soon make its vehicles not just as proficient as a human driver, but eventually much better. When? He didn't say and has form in making grand claims.
"To date, over 300 million miles have been driven using FSD [Full Self-Driving] Beta. That 300-million-mile number is going to seem very small, very quickly. And FSD will go from being as good as a human to then being vastly better than a human. We see a clear path to full self-driving being 10 times safer than the average human driver," he claimed.
This is important, Musk explained, because "right now, I believe there's something in the order of a million automotive deaths per year. And if you're 10 times better than a human, that would still mean 100,000 deaths, So, it's like, we'd rather be a hundred times better, and we want to achieve as perfect a safety as possible."
Dojo is not the only supercomputer Tesla has for video training. The company also built a compute cluster equipped with 5,760 Nvidia A100 GPUs, but Musk said they simply couldn't get enough GPUs for the task.
"We'll actually take the hardware as fast as Nvidia will deliver it to us," he said, adding: "If they could deliver us enough GPUs, we might not need Dojo, but they can't because they've got so many customers." ®