Cliff Young

Cliff Young is a member of the Google Brain team, whose mission is to develop deep learning technologies and deploy them throughout Google. He is one of the designers of Google’s Tensor Processing Unit (TPU), which is used in production applications including Search, Maps, Photos, and Translate. TPUs also powered AlphaGo’s historic 4-1 victory over Go champion Lee Sedol. Before joining Google, Cliff worked at D. E. Shaw Research, building special-purpose supercomputers for molecular dynamics, and at Bell Labs.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Mesh-TensorFlow: Deep Learning for Supercomputers
    Noam Shazeer
    Youlong Cheng
    Ryan Sepassi
    Niki J. Parmar
    Black Hechtman
    Ashish Vaswani
    Mingsheng Hong
    HyoukJoong Lee
    Peter Hawkins
    NeurIPS (2018)
    Preview abstract Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the "batch" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing SOTA results on WMT'14 English-to-French translation task and the one-billion-word Language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh View details
    In-Datacenter Performance Analysis of a Tensor Processing Unit
    Mark Omernick
    Diemthu Le
    Robert Hagmann
    Kathy Nix
    Clifford Chao
    Jeremy Coriell
    Pierre-luc Cantin
    Andy Koch
    Rahul Nagarajan
    Mike Daley
    Al Borchers
    Chris Clark
    Adriana Maggiore
    Raminder Bajwa
    Matt Dau
    Ben Gelb
    Alan Lundin
    Ray Ni
    Rick Boyle
    Steve Lacy
    Alek Jaworski
    John Hu
    Thomas Norrie
    Aaron Jaffey
    Rajendra Gottipati
    James Law
    Ravi Narayanaswami
    Jonathan Ross
    Harshit Khaitan
    Kyle Lucke
    C. Richard Ho
    Alexander Kaplan
    Andy Phelps
    Narayana Penukonda
    Nan Boden
    Sarah Bates
    Maire Mahony
    William Gulland
    Doug Hogberg
    Gordon MacKean
    Zhuyuan Liu
    Tara Vazir Ghaemmaghami
    Dan Hurt
    Kieran Miller
    Suresh Bhatia
    Gaurav Agrawal
    Julian Ibarz
    Nishant Patil
    Norman P. Jouppi
    Naveen Kumar
    Chris Leary
    ISCA (2017) (to appear)
    Preview abstract Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU. View details
    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
    Stephan Gouws
    Yonghui Wu
    Wei Wang
    Łukasz Kaiser
    Xiaobing Liu
    Alex Rudnick
    Qin Gao
    Keith Stevens
    Mike Schuster
    Mohammad Norouzi
    Macduff Hughes
    Nishant Patil
    Jason Smith
    Apurva Shah
    Taku Kudo
    Maxim Krikun
    George Kurian
    CoRR, abs/1609.08144 (2016)
    Preview abstract Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system. View details