This article is more than 1 year old
Revealed: Blueprints to Google's AI FPU aka the Tensor Processing Unit
PCIe-connected super-calculator trounces outdated competition
Analysis In 2013, Google realized that its growing dependence on machine learning would force it to double the number of data centers it operates to handle projected workloads.
Based on the scant details Google provides about its data center operations – which include 15 major sites – the search-and-ad giant was looking at additional capital expenditures of perhaps $15bn, assuming that a large Google data center costs about $1bn.
The internet king assembled a team to produce a custom chip capable of handling part of its neural network workflow known as inference, which is where the software makes predictions based on data developed through the time-consuming and computationally intensive training phase. The processor sits on the PCIe bus and accepts commands from the host CPU: it is akin to a yesteryear discrete FPU or math coprocessor, but obviously souped up to today's standards.
The goal was to improve cost-performance over GPUs tenfold. By Google's own estimation, it succeeded, though its in-house hardware was competing against chips that have since been surpassed.
In a paper published in conjunction with a technical presentation at the National Academy of Engineering meeting at the Computer History Museum in Silicon Valley, Google engineers – more than 70 of them – have revealed how the web giant's Tensor Processing Unit (TPU), a custom ASIC designed to process TensorFlow machine-learning jobs, performs in its data centers.
Google introduced its TPU at Google I/O 2016. Distinguished hardware engineer – and top MIPS CPU architect – Norm Jouppi in a blog post said Google had been running TPUs in its data centers since 2015 and that the specialized silicon delivered "an order of magnitude better-optimized performance per watt for machine learning."
Jouppi went so far as to suggest that the improvements amounted to traveling forward in time seven years – about three chip generations under Moore's Law.
Google executives had previously declared that artificial intelligence, in the form of machine learning and related technologies, was critical to the company's future. The existence of hardware custom-built for that purpose reinforced those statements.
Now, performance tests the company has run against Intel's Haswell CPU and Nvidia's Tesla K80 GPU appear to validate its approach.
Based on workloads involving neural network inference, "the TPU is 15x to 30x faster than contemporary GPUs and CPUs," said Jouppi in a blog post on Wednesday, and achieves "30x to 80x improvement" as measured in TOPS/Watt (trillions of computations per Watt).
Not so fast
In a post to Reddit, Justin Johnson, a PhD student in the Stanford Vision Lab, points out that Google's researchers conducted their comparison with a Tesla K80 GPU, which is two generations old and lacks the hardware support for computation found in the TPU.
"The comparison doesn't look quite so rosy next to the current-gen Tesla P40 GPU, which advertises 47 INT8 TOP/s at 250W TDP; compared to the P40, the TPU is about 1.9x faster and 6.5x more energy-efficient," Johnson wrote.
Still, Google's results suggest that the premise laid out in its paper – that "major improvements in cost-energy-performance must now come from domain-specific hardware" – has merit. In other words, semiconductor makers may become more inclined to match the hardware they design with anticipated applications.
In an email to The Register, Johnson explained that the TPU is special-purpose hardware designed to accelerate the inference phase in a neural network, in part through quantizing 32-bit floating point computations into lower-precision 8-bit arithmetic.
"This allows it to achieve significantly faster speeds and better energy efficiency than general-purpose GPUs," said Johnson. "Energy efficiency is particularly important in a large-scale datacenter scenario, where improving energy efficiency can significantly reduce cost when running at scale."
Johnson said he wasn't sure about the broad significance of the TPU. "Since it is not intended for training, I think that researchers will likely stick with Nvidia hardware for the near future," he said. "Designing your own custom hardware is a huge engineering effort that is likely beyond the capabilities of most companies, so I don't expect each and every company to have its own bespoke TPU-esque chips any time soon."
Nonetheless, he speculates TPUs could help Google Cloud Platform undercut competing services from Amazon Web Services, at least among customers running trained neural network models in production. ®