Sanjay Ghemawat

Sanjay Ghemawat

I have been working at Google since late 1999 on distributed systems, performance tools, indexing systems, compression schemes, memory management, data representation languages, RPC systems, and other systems infrastructure projects. I graduated with a Ph.D. in Computer Science from MIT. Before joining Google, I was a member of the research staff at DEC Systems Research Center in Palo Alto, CA.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    PaLM: Scaling Language Modeling with Pathways
    Aakanksha Chowdhery
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Brennan Saeta
    Michele Catasta
    Jason Wei
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    Preview abstract We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network. View details
    Dynamic Control Flow in Large-Scale Machine Learning
    Yuan Yu
    Eugene Brevdo
    Mike Burrows
    Tim Harley
    Peter Hawkins
    Manjunath Kudlur
    Rajat Monga
    Xiaoqiang Zheng
    Proceedings of EuroSys 2018
    Preview abstract Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability. View details
    TensorFlow: A system for large-scale machine learning
    Jianmin Chen
    Matthieu Devin
    Geoffrey Irving
    Manjunath Kudlur
    Rajat Monga
    Benoit Steiner
    Paul Tucker
    Vijay Vasudevan
    Pete Warden
    Yuan Yu
    Xiaoqiang Zheng
    12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association (2016), pp. 265-283
    Preview abstract TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous “parameter server” designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that Tensor- Flow achieves for several real-world applications. View details
    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
    Ashish Agarwal
    Eugene Brevdo
    Craig Citro
    Matthieu Devin
    Ian Goodfellow
    Andrew Harp
    Geoffrey Irving
    Yangqing Jia
    Rafal Jozefowicz
    Lukasz Kaiser
    Manjunath Kudlur
    Dan Mané
    Rajat Monga
    Chris Olah
    Mike Schuster
    Jonathon Shlens
    Benoit Steiner
    Ilya Sutskever
    Kunal Talwar
    Paul Tucker
    Vijay Vasudevan
    Pete Warden
    Yuan Yu
    Xiaoqiang Zheng
    tensorflow.org (2015)
    Preview abstract TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org. View details
    Spanner: Google's Globally Distributed Database
    Michael Epstein
    Andrew Fikes
    Christopher Frost
    J. J. Furman
    Andrey Gubarev
    Christopher Heiser
    Sebastian Kanthak
    Eugene Kogan
    Hongyi Li
    Sergey Melnik
    David Mwaura
    David Nagle
    Rajesh Rao
    Lindsay Rolig
    Yasushi Saito
    Michal Szymaniak
    Christopher Taylor
    Ruth Wang
    Dale Woodford
    ACM Trans. Comput. Syst., 31 (2013), pp. 8
    Preview
    Spanner: Google's Globally-Distributed Database
    Michael Epstein
    Andrew Fikes
    Christopher Frost
    JJ Furman
    Andrey Gubarev
    Christopher Heiser
    Peter Hochschild
    Sebastian Kanthak
    Eugene Kogan
    Hongyi Li
    Sergey Melnik
    David Mwaura
    David Nagle
    Rajesh Rao
    Lindsay Rolig
    Dale Woodford
    Yasushi Saito
    Christopher Taylor
    Michal Szymaniak
    Ruth Wang
    OSDI (2012)
    Preview abstract Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner. View details
    Back-off Language Model Compression
    Boulos Harb
    Proceedings of Interspeech 2009, International Speech Communication Association (ISCA), pp. 325-355
    Preview abstract With the availability of large amounts of training data relevant to speech recognition scenarios, scalability becomes a very productive way to improve language model performance. We present a technique that represents a back-off n-gram language model using arrays of integer values and thus renders it amenable to effective block compression. We propose a few such compression algorithms and evaluate the resulting language model along two dimensions: memory footprint, and speed reduction relative to the uncompressed one. We experimented with a model that uses a 32-bit word vocabulary (at most 4B words) and log-probabilities/back-off-weights quantized to 1 byte, respectively. The best compression algorithm achieves 2.6 bytes/n-gram at ≈18X slower than uncompressed. For faster LM operation we found it feasible to represent the LM at ≈4.0 bytes/n-gram, and ≈3X slower than the uncompressed LM. The memory footprint of a LM containing one billion n-grams can thus be reduced to 3–4 Gbytes without impacting its speed too much. See the presentation material from a talk about this paper. View details