James Laudon
James Laudon is a member of the Google Brain team, whose mission is to develop deep learning technologies and deploy them throughout Google. His research interests focus on hardware and software co-design for high-performance systems and he's currently working on domain-specific computer architectures for machine learning and applying machine learning to system design. Before joining the Brain team in 2017, James was founder and site director for the Google Madison office. Prior to joining Google in 2007 he contributed to the architecture and implementation of multiple computer systems including the Stanford DASH, SGI Origin 2000, and Sun UltraSPARC T1. James has a B.S. in Electrical Engineering from the University of Wisconsin – Madison and a M.S. and Ph.D. in Electrical Engineering from Stanford University.
Authored Publications
Google Publications
Other Publications
Sort By
Graph Transformer: A Generalized Method for Computation Graph Optimizations
Amirali Abdolrashidi
Anna Darling Goldie
Azalia Mirhoseini
Daniel Wong
Hanxiao Liu
Qiumin Xu
Shen Wang
Sudip Roy
(2020)
Preview abstract
Runtime and scalability of neural networks can be significantly affected by computational graph optimization during compilation. Most existing automated graph optimizations are impractical for deployment due to the significant amount of compute required and their inability to generalize to new, previously held-out graphs. To address both limitations, we propose an end-to-end deep reinforcement learning method named Graph Transformer (GTf), based on a scalable sequential attention mechanism over an inductive graph neural network that is transferable to new, unseen graphs. GTf generates decisions on the entire graph in a single-shot fashion, rather than on each individual node progressively, drastically speeding up the search compared to prior methods. Moreover, we propose recurrent attention layers to jointly optimize dependent graph optimization tasks and demonstrate 33%-60% speedup of three graph optimization tasks compared to Tensorflow default optimizations. On a diverse set of representative graphs consisting of 1k-80k nodes, including Inception-v3, Transformer-XL, and WaveNet, GTf achieves an average 21% improvement over human experts and 18% improvement over the prior art with 15x faster convergence, on a device placement task evaluated in real systems.
View details
Apollo: Transferable Architecture Exploration
Albin Jones
Ravi Narayanaswami
Sat Chatterjee
ML for Systems Workshop at NeurIPS 2020
Preview abstract
The looming end of Moore's Law and ascending use of deep learning drives the design of custom accelerators that are optimized for specific neural architectures.
Accelerator design forms a challenging constrained optimization problem over a complex, high-dimensional and structured input space with a costly to evaluate objective function. Existing approaches for accelerator design are sample-inefficient do not transfer knowledge between related optimizations tasks with different design constraints (e.g. area budget) or neural architecture configurations. In this work, we propose a transferable architecture exploration framework, dubbed Apollo, that leverages recent advances in black-box function optimization for sample-efficient accelerator design. We use Apollo to optimize accelerator configurations of a diverse set of neural architectures with alternative design constraints. We show that Apollo finds optimal design configurations more sample-efficiently than baseline approaches. We further show that transferring knowledge between target architectures with different design constraints helps to find optimal configurations faster. This encouraging outcome portrays a promising path forward in shortening the timeline for accelerator design.
View details
In-Datacenter Performance Analysis of a Tensor Processing Unit
Norman P. Jouppi
Nishant Patil
Gaurav Agrawal
Raminder Bajwa
Sarah Bates
Suresh Bhatia
Nan Boden
Al Borchers
Rick Boyle
Pierre-luc Cantin
Clifford Chao
Chris Clark
Jeremy Coriell
Mike Daley
Matt Dau
Ben Gelb
Tara Vazir Ghaemmaghami
Rajendra Gottipati
William Gulland
Robert Hagmann
C. Richard Ho
Doug Hogberg
John Hu
Dan Hurt
Julian Ibarz
Aaron Jaffey
Alek Jaworski
Alexander Kaplan
Harshit Khaitan
Andy Koch
Naveen Kumar
Steve Lacy
James Law
Diemthu Le
Chris Leary
Zhuyuan Liu
Kyle Lucke
Alan Lundin
Gordon MacKean
Adriana Maggiore
Maire Mahony
Kieran Miller
Rahul Nagarajan
Ravi Narayanaswami
Ray Ni
Kathy Nix
Thomas Norrie
Mark Omernick
Narayana Penukonda
Andy Phelps
Jonathan Ross
ISCA (2017) (to appear)
Preview abstract
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
View details
Throughput-Oriented Multicore Processors
Robert Golla
Greg Grohoski
Multicore Processors and Systems, Springer (2009), pp. 205-230
The Coming Wave of Multithreaded Chip Multiprocessors
Lawrence Spracklen
International Journal of Parallel Programming, vol. 35 (2007), pp. 299-330
Virtual Private Caches
Kyle J. Nesbit
James E. Smith
Proceedings of the 34th Annual International Symposium on Computer Architecture (2007), pp. 57-68
Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency
Fair Queuing Memory Systems
Kyle J. Nesbit
Nidhi Aggarwal
James E. Smith
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39), IEEE, Orlando, FL, USA (2006)
Maximizing CMP Throughput with Mediocre Cores
John D. Davis
Kunle Olukotun
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, Saint Louis, MO, USA (2005), pp. 51-62
The SGI Origin 2000: A ccNUMA Highly Scalable Server
Daniel Lenoski
Proceedings of the 24th Annual International Symposium on Computer Architecture, ACM, Denver, CO, USA (1997), pp. 241-251
Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations
Anoop Gupta
Mark Horowitz
Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, San Jose, CA, USA (1994), pp. 308 - 318
The DASH Prototype: Logic Overhead and Performance
Daniel Lenoski
Truman Joe
David Nakahira
Luis Stevens
Anoop Gupta
John Hennessy
IEEE Transactions on Parallel and Distributed Systems, vol. 4 (1993), pp. 41-61
The DASH Prototype: Implementation and Performance
Daniel Lenoski
Truman Joe
David Nakahira
Luis Stevens
Anoop Gupta
John Hennessy
Proceedings of the 19th Annual International Symposium on Computer Architecture, ACM, Queensland, Australia (1992), pp. 92-103
The Stanford Dash Multiprocessor
Daniel Lenoski
Kourosh Gharachorloo
Anoop Gupta
John L. Hennessy
Mark Horowitz
Monica S. Lam
IEEE Computer, vol. 25 (1992), pp. 63-79
Overview and Status of the Stanford DASH Multiprocessor
Daniel Lenoski
Kourosh Gharachorloo
Anoop Gupta
John Hennessy
Proceedings of the International Symposium on Shared Memory Multiprocessing, Tokyo, Japan (1991)
Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors
Kourosh Gharachorloo
Daniel Lenoski
Phillip Gibbons
Anoop Gupta
John Hennessy
Proceedings of the 17th Annual International Symposium on Computer Architecture, ACM, Seattle, WA, USA (1990), pp. 15-26
The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor
Daniel Lenoski
Kourosh Gharachorloo
Anoop Gupta
John Hennessy
Proceedings of the 17th Annual International Symposium on Computer Architecture, ACM, Seattle, WA, USA (1990), pp. 148-159
The ZS-1 Central Processor
James E. Smith
Greg E. Dermer
Brian D. Vanderwarn
Steve D. Klinger
Chris M. Rozewski
Dan L. Fowler
Keith R. Scidmore
Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, IEEE, Palo Alto, CA, USA (1987), pp. 199-204