James Laudon

James Laudon

James Laudon is a member of the Google Brain team, whose mission is to develop deep learning technologies and deploy them throughout Google. His research interests focus on hardware and software co-design for high-performance systems and he's currently working on domain-specific computer architectures for machine learning and applying machine learning to system design. Before joining the Brain team in 2017, James was founder and site director for the Google Madison office. Prior to joining Google in 2007 he contributed to the architecture and implementation of multiple computer systems including the Stanford DASH, SGI Origin 2000, and Sun UltraSPARC T1. James has a B.S. in Electrical Engineering from the University of Wisconsin – Madison and a M.S. and Ph.D. in Electrical Engineering from Stanford University.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Graph Transformer: A Generalized Method for Computation Graph Optimizations
    Amirali Abdolrashidi
    Azalia Mirhoseini
    Daniel Wong
    Hanxiao Liu
    Mangpo Phothilimthana
    Qiumin Xu
    Shen Wang
    Sudip Roy
    (2020)
    Preview abstract Runtime and scalability of neural networks can be significantly affected by computational graph optimization during compilation. Most existing automated graph optimizations are impractical for deployment due to the significant amount of compute required and their inability to generalize to new, previously held-out graphs. To address both limitations, we propose an end-to-end deep reinforcement learning method named Graph Transformer (GTf), based on a scalable sequential attention mechanism over an inductive graph neural network that is transferable to new, unseen graphs. GTf generates decisions on the entire graph in a single-shot fashion, rather than on each individual node progressively, drastically speeding up the search compared to prior methods. Moreover, we propose recurrent attention layers to jointly optimize dependent graph optimization tasks and demonstrate 33%-60% speedup of three graph optimization tasks compared to Tensorflow default optimizations. On a diverse set of representative graphs consisting of 1k-80k nodes, including Inception-v3, Transformer-XL, and WaveNet, GTf achieves an average 21% improvement over human experts and 18% improvement over the prior art with 15x faster convergence, on a device placement task evaluated in real systems. View details
    Preview abstract The looming end of Moore's Law and ascending use of deep learning drives the design of custom accelerators that are optimized for specific neural architectures. Accelerator design forms a challenging constrained optimization problem over a complex, high-dimensional and structured input space with a costly to evaluate objective function. Existing approaches for accelerator design are sample-inefficient do not transfer knowledge between related optimizations tasks with different design constraints (e.g. area budget) or neural architecture configurations. In this work, we propose a transferable architecture exploration framework, dubbed Apollo, that leverages recent advances in black-box function optimization for sample-efficient accelerator design. We use Apollo to optimize accelerator configurations of a diverse set of neural architectures with alternative design constraints. We show that Apollo finds optimal design configurations more sample-efficiently than baseline approaches. We further show that transferring knowledge between target architectures with different design constraints helps to find optimal configurations faster. This encouraging outcome portrays a promising path forward in shortening the timeline for accelerator design. View details
    In-Datacenter Performance Analysis of a Tensor Processing Unit
    Norman P. Jouppi
    Nishant Patil
    Gaurav Agrawal
    Raminder Bajwa
    Sarah Bates
    Suresh Bhatia
    Nan Boden
    Al Borchers
    Rick Boyle
    Pierre-luc Cantin
    Clifford Chao
    Chris Clark
    Jeremy Coriell
    Mike Daley
    Matt Dau
    Ben Gelb
    Tara Vazir Ghaemmaghami
    Rajendra Gottipati
    William Gulland
    Robert Hagmann
    C. Richard Ho
    Doug Hogberg
    John Hu
    Dan Hurt
    Julian Ibarz
    Aaron Jaffey
    Alek Jaworski
    Alexander Kaplan
    Harshit Khaitan
    Andy Koch
    Naveen Kumar
    Steve Lacy
    James Law
    Diemthu Le
    Chris Leary
    Zhuyuan Liu
    Kyle Lucke
    Alan Lundin
    Gordon MacKean
    Adriana Maggiore
    Maire Mahony
    Kieran Miller
    Rahul Nagarajan
    Ravi Narayanaswami
    Ray Ni
    Kathy Nix
    Thomas Norrie
    Mark Omernick
    Narayana Penukonda
    Andy Phelps
    Jonathan Ross
    ISCA (2017) (to appear)
    Preview abstract Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU. View details