- Norman P. Jouppi
- Cliff Young
- Nishant Patil
- David Patterson
- Gaurav Agrawal
- Raminder Bajwa
- Sarah Bates
- Suresh Bhatia
- Nan Boden
- Al Borchers
- Rick Boyle
- Pierre-luc Cantin
- Clifford Chao
- Chris Clark
- Jeremy Coriell
- Mike Daley
- Matt Dau
- Jeffrey Dean
- Ben Gelb
- Tara Vazir Ghaemmaghami
- Rajendra Gottipati
- William Gulland
- Robert Hagmann
- C. Richard Ho
- Doug Hogberg
- John Hu
- Robert Hundt
- Dan Hurt
- Julian Ibarz
- Aaron Jaffey
- Alek Jaworski
- Alexander Kaplan
- Harshit Khaitan
- Andy Koch
- Naveen Kumar
- Steve Lacy
- James Laudon
- James Law
- Diemthu Le
- Chris Leary
- Zhuyuan Liu
- Kyle Lucke
- Alan Lundin
- Gordon MacKean
- Adriana Maggiore
- Maire Mahony
- Kieran Miller
- Rahul Nagarajan
- Ravi Narayanaswami
- Ray Ni
- Kathy Nix
- Thomas Norrie
- Mark Omernick
- Narayana Penukonda
- Andy Phelps
- Jonathan Ross
Abstract
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work