Tesla hedges Dojo supercomputer bet with 10K Nvidia H100 GPU cluster
Keeping full self-driving dream on the road just needs more graphics chips?
Tesla still dreams of fueling its motors with actual full self-driving (FSD) capabilities, and it's blowing piles of cash on AI infrastructure to reach that milestone.
The American EV manufacturer's latest investment is in a 10,000 GPU compute cluster, revealed in a xeet by Tesla AI Engineer Tim Zaman over the weekend. The system, which came online Monday, will help crunch the data collected by its vehicles and accelerate development of the FSD functionality we've heard so much about. The automaker declined to comment further.
Tesla has been teasing fully autonomous driving capabilities since 2016. So far what's been delivered is essentially super-cruise-control: a driver assistance system that is not truly self-driving and requires a human to keep their hands on the wheel.
CEO Elon Musk has no problem throwing money at his goal of achieving FSD. Last month Tesla revealed it would invest $1 billion to build out its Dojo supercomputer between now and the end of 2024 to speed the development of its autonomous driving software.
That particular AI supercomputer uses the company's massive 15kW Dojo Training tiles, six of which make up a one-exaFLOPS (BF16) Dojo V1 system that we took a look at last year. Each tile is made up of a set of D1 chip dies, all designed by Tesla and fabbed by TSMC.
It's no secret that Tesla still employs thousands of GPUs in its infrastructure. In 2021 the automaker deployed a cluster of 720 GPU nodes each equipped with eight of its then bleeding edge A100 accelerators for a total of 5,760 GPUs. Combined the system offered up to 1.8 exaFLOPS of FP16 performance.
"We'll actually take the hardware as fast as Nvidia will deliver it to us," Musk previously said. "If they could deliver us enough GPUs, we might not need Dojo, but they can't because they've got so many customers."
This latest deployment is nearly twice as large and uses Nvidia's latest generation H100 GPUs, which offer roughly three times the FP16 performance of its predecessor. The chip also added support for FP8 math.
As you drop down the scale, you give up some accuracy in exchange for greater performance. In the case of Nvidia's H100, FP8 nets you just shy of four petaFLOPS of peak performance with scarcity.
- Google sharpens AI toolset with new chips, GPUs, more at Cloud Next
- Nvidia just made a killing on AI – where is everyone else?
- Intel promises next year's Xeons will challenge AMD on memory, IO channels
- Meta lets Code Llama run riot under almost-open terms
Assuming Tesla is using Nvidia's most powerful SXM5 H100 modules, which plug into the accelerator giant's HGX chassis, we're looking at 1,250 nodes each with eight GPUs. Combined we're looking at 39.5 exaFLOPS of FP8 performance.
According to Zaman, the system is supported by a hot tier cache capacity of more than 200 petabytes.
We also know that Tesla isn't just renting a bunch of GPUs from cloud providers like Microsoft or Google. Zaman says the entire system is housed on-prem at Tesla's facilities.
"Many orgs say 'We have' which usually means 'We rented' few actually own, and therefore fully vertically integrate. This bothers me because owning and maintaining is hard. Renting is easy," he wrote.
Tesla may be looking to expand its datacenter footprint to accommodate additional capacity. Earlier this month the carmaker posted a job opening for senior engineering program manager for datacenters, who would "lead the end-to-end design and engineering of Tesla's first of its kind datacenters and will be one of the key members of its engineering team."
While we can only speculate what a first of its kind datacenter might involve, the opening suggests this individual could oversee construction of a new facility. ®