Google boffins pull back more of the curtain hiding TPU v4 secrets
And Nvidia's having none of Big G's claims of superiority
Google on Wednesday revealed more details of its fourth-generation Tensor Processing Unit chip (TPU v4), claiming that its silicon is faster and uses less power than Nvidia's A100 Tensor Core GPU.
TPU v4 "is 1.2x–1.7x faster and uses 1.3x–1.9x less power than the Nvidia A100," said researchers from Google and UC Berkeley in a paper published ahead of a June presentation at the International Symposium on Computer Architecture. Our pals over at The Next Platform previously dived into the TPU v4's architecture, here based on earlier material released about the chips.
After Google's reveal this week, Nvidia coincidentally published a blog post in which founder and CEO Jensen Huang noted that the A100 debuted three years ago and that Nv's more recent H100 (Hopper) GPUs deliver 4x more performance than A100 based on MLPerf 3.0 benchmarks.
Google's TPU v4 also entered service three years ago, in 2020, and has since been refined. The Google/UC Berkley authors explain that they chose not to measure TPU v4 against the more recent H100 (announced in 2022) because Google prefers to write papers about technologies after they have been deployed and used to run production apps.
"Both TPU v4s and A100s deployed in 2020 and both use 7nm technology," the paper explains. "The newer, 700W H100 was not available at AWS, Azure, or Google Cloud in 2022. The appropriate H100 match would be a successor to TPU v4 deployed in a similar time frame and technology (e.g., in 2023 and 4nm)."
The TPU v4, the researchers say, represents the company's fifth domain specific architecture (DSA) – tuned for machine learning – and its third supercomputer for machine learning models. It's nonetheless called "v4".
TPU for you and you
The ad biz introduced its first TPU back in 2016, before AI sauce had been ladled onto every product and press release. The new TPU v4 outperforms its v3 predecessor by 2.1x and boasts 2.7x better performance per Watt, it's claimed.
The salient innovations in TPU v4 involve the introduction of Optical Circuit Switches (OCS) with optical data links and the integration of SparseCores (SC), dataflow processors that accelerate calculations for models that rely on embeddings, like recommender systems.
OCS interconnection hardware allows Google's 4K TPU node supercomputer to operate with 1,000 CPU hosts that are occasionally (0.1–1.0 percent of the time) unavailable without causing problems.
"An OCS raises availability by routing around failures," the researchers explain, noting that host availability must be 99.9 percent without OCS. With OCS, effective throughput ("goodput") in Google's TPU supercomputer can be achieved with host availability around 99.0 percent.
SC, the researchers explain, is a DSA for embedding training that debuted with TPU v2 and was improvised in subsequent iterations. SC processors "accelerate models that rely on embeddings by 5x–7x yet use only five percent of die area and power," they say.
- Google uses deep learning to design faster, smaller AI chips
- Google's claims of super-human AI chip layout back under the microscope
- Nvidia not cutting it? Google and Amazon's latest AI chips have arrived
- What you need to know from today's Google IO: Chatty AI, collab tools, TPU v4 chips, quantum computing
That appears to be a reasonable price to pay given that embedding-dependent deep learning recommendation models (DLRMs) represent a quarter of Google's workloads. These are used, the boffins note, in Google's advertising, search ranking, YouTube, and Google Play applications.
Take 4,096 TPU v4 nodes unified into a supercomputer in a datacenter, as Google Cloud has done, and the resulting hardware requires ~2–6x less energy and ~20x less carbon dioxide emissions than rival DSAs, the boffins claim.
"A ~20x reduction in carbon footprint greatly increases the chances of delivering on the amazing potential of ML in a sustainable manner," the mostly Google-employed authors declare, though they stop short of endorsing low-lying coastal property as a sound long-term investment.
Google has dozens of these supercomputers deployed for internal and external use. So enjoy your YouTube recommendations with slightly less guilt about collateral climate harm. Just remember to multiply your existential dread by the growing demand for machine learning applications. ®