Google Research

In-Datacenter Performance Analysis of a Tensor Processing Unit

  • Norman P. Jouppi
  • Cliff Young
  • Nishant Patil
  • David Patterson
  • Gaurav Agrawal
  • Raminder Bajwa
  • Sarah Bates
  • Suresh Bhatia
  • Nan Boden
  • Al Borchers
  • Rick Boyle
  • Pierre-luc Cantin
  • Clifford Chao
  • Chris Clark
  • Jeremy Coriell
  • Mike Daley
  • Matt Dau
  • Jeffrey Dean
  • Ben Gelb
  • Tara Vazir Ghaemmaghami
  • Rajendra Gottipati
  • William Gulland
  • Robert Hagmann
  • C. Richard Ho
  • Doug Hogberg
  • John Hu
  • Robert Hundt
  • Dan Hurt
  • Julian Ibarz
  • Aaron Jaffey
  • Alek Jaworski
  • Alexander Kaplan
  • Harshit Khaitan
  • Andy Koch
  • Naveen Kumar
  • Steve Lacy
  • James Laudon
  • James Law
  • Diemthu Le
  • Chris Leary
  • Zhuyuan Liu
  • Kyle Lucke
  • Alan Lundin
  • Gordon MacKean
  • Adriana Maggiore
  • Maire Mahony
  • Kieran Miller
  • Rahul Nagarajan
  • Ravi Narayanaswami
  • Ray Ni
  • Kathy Nix
  • Thomas Norrie
  • Mark Omernick
  • Narayana Penukonda
  • Andy Phelps
  • Jonathan Ross
ISCA (2017) (to appear)

Abstract

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work