Tesla wants to take machine learning silicon to the Dojo
Just how bad does existing AI hardware have to be to start from scratch?
To quench the thirst for ever larger AI and machine learning models, Tesla has revealed a wealth of details at Hot Chips 34 on their fully custom supercomputing architecture called Dojo.
The system is essentially a massive composable supercomputer, although unlike what we see on the Top 500, it's built from an entirely custom architecture that spans the compute, networking, and input/output (I/O) silicon to instruction set architecture (ISA), power delivery, packaging, and cooling. All of it was done with the express purpose of running tailored, specific machine learning training algorithms at scale.
"Real world data processing is only feasible through machine learning techniques, be it natural-language processing, driving in streets that are made for human vision to robotics interfacing with the everyday environment," Ganesh Venkataramanan, senior director of hardware engineering at Tesla, said during his keynote speech.
However, he argued that traditional methods for scaling distributed workloads have failed to accelerate at the rate necessary to keep up with machine learning's demands. In effect, Moore's Law is not cutting it and neither are the systems available for AI/ML training at scale, namely some combination of CPU/GPU or in rarer circumstances by using speciality AI accelerators.
"Traditionally we build chips, we put them on packages, packages go on PCBs, which go into systems. Systems go into racks," said Venkataramanan. The problem is each time data moves from the chip to the package and off the package, it incurs a latency and bandwidth penalty.
A datacenter sandwich
So to get around the limitations, Venkataramanan and his team started over from scratch.
"Right from my interview with Elon, he asked me what can you do that is different from CPUs and GPUs for AI. I feel that the whole team is still answering that question."
This led to the development of the Dojo training tile, a self-contained compute cluster occupying a half-cubic foot capable of 556 TFLOPS of FP32 performance in a 15kW liquid-cooled package.
Each tile is equipped with 11GBs of SRAM and is connected over a 9TB/s fabric using a custom transport protocol throughout the entire stack.
"This training tile represents unparalleled amounts of integration from computer to memory to power delivery, to communication, without requiring any additional switches," Venkataramanan said.
At the heart of the training tile is Tesla's D1, a 50 billion transistor die, based on TSMC's 7nm process. Tesla says each D1 is capable of 22 TFLOPS of FP32 performance at a TDP of 400W. However, Tesla notes that the chip is capable of running a wide range of floating point calculations including a few custom ones.
"If you compare transistors for millimeter square, this is probably the bleeding edge of anything which is out there," Venkataramanan said.
Tesla then took 25 D1s, binned them for known good dies, and then packaged them using TSMC's system-on-wafer technology to "achieve a huge amount of compute integration at very low latency and very-high bandwidth," he said.
However, the system-on-wafer design and vertically stacked architecture introduced challenges when it came to power delivery.
According to Venkataramanan, most accelerators today place power directly adjacent to the silicon. And while proven, this approach means a large area of the accelerator has to be dedicated to those components, which made it impractical for Dojo, he explained. Instead, Tesla designed their chips to deliver power directly though the bottom of the die.
Putting it all together
"We could build an entire datacenter or an entire building out of this training tile, but the training tile is just the compute portion. We also need to feed it," Venkataramanan said.
For this, Tesla also developed the Dojo Interface Processor (DIP), which functions as a bridge between the host CPU and training processors. The DIP also serves as a source of shared high-bandwidth memory (HBM) and as a high-speed 400Gbit/sec NIC.
Each DIP features 32GB of HBM and up to five of these cards can be connected to a training tile at 900GB/s for an aggregate of 4.5TB/s to the host for a total of 160GB of HBM per tile.
- Musk tries to sell Tesla's Optimus robot butler to China
- Tesla expands Powerwall-to-grid program to cover most of California
- Elon Musk sells Tesla shares worth $6.9b as Twitter lawsuit looms
- Tesla Full Self-Driving 'fails' to notice child-sized objects in testing
Tesla's V1 configuration pairs of these tiles – or 150 D1 dies – in array supported four host CPUs each equipped with five DIP cards to achieve a claimed exaflop of BF16 or CFP8 performance.
Put together, Venkataramanan says the architecture – detailed in depth here by The Next Platform – enables Tesla to overcome the limitations associated with traditional accelerators from the likes of Nvidia and AMD.
"How traditional accelerators work, typically you try to fit an entire model into each accelerator. Replicate it, and then flow the data through each of them," he said. "What happens if we have bigger and bigger models? These accelerators can fall flat because they run out of memory."
This isn't a new problem, he noted. Nvidia's NV-switch for example enables memory to be pooled across large banks of GPUs. However, Venkataramanan argues this not only adds complexity, but introduces latency and compromises on bandwidth.
"We thought about this right from the get go. Our compute tiles and each of the dies were made for fitting big models," Venkataramanan said.
Such a specialized compute architecture demands a specialized software stack. However, Venkataramanan and his team recognized that programmability would either make or break Dojo.
"Ease of programmability for software counterparts is paramount when we design these systems," he said. "Researchers won't wait for your software folks to write a handwritten kernel for adapting to a new algorithm that we want to run."
To do this, Tesla ditched the idea of using kernels, and designed Dojo's architecture around compilers.
"What we did was we used PiTorch. We created an intermediate layer, which helps us parallelize to scale out hardware beneath it. Underneath everything is compiled code," he said. "This is the only way to create software stacks that are adaptable to all those future workloads."
Despite the emphasis on software flexibility, Venkataramanan notes that the platform, which is currently running in their labs, is limited to Tesla use for the time being.
"We are focused on our internal customers first," he said. "Elon has made it public that over time, we will make this available to researchers, but we don't have a time frame for that. ®