Robert Hundt

Robert Hundt

Robert Hundt received a degree in Computer Science from Technical University in Munich in 1992. Until 1999 he worked for Terrasat GmbH in Germany, a 20+ people R&D company he co-owned. He played many roles - from company lead to booth cat - while writing and optimizing software for surveying and navigation with satellite systems.

In 2000 he started working for Hewlett-Packard Company in California on bringing up the new and scalable high-level optimizer SYZYGY for the HP C/C++/FORTRAN compilers with a new inter-procedural optimizer, a new loop optimizer, and a new scalar optimizer. Before joining the compiler group, Robert was responsible for dynamic binary instrumentation for Intel Itanium processors, co-creating and designing the performance analysis tool HP Caliper.

Since beginning of 2007 Robert has been working for Google. He created various compiler and performance projects, e.g., he served as Tech Lead for compiler optimization for servers (x86), Android (ARM), and GPUs (open-source CUDA compiler), built datacenter profiling and performance analysis tools, and worked on GMail/Apps performance, from Chrome to datacenter. For many years Robert was the SW lead for Google TPU - supercomputers to accelerate machine learning inference and training, which include the open-source TensorFlow compiler XLA. Today he is the TL for ML compilers, runtimes, and performance, for TPU, GPU, and CPU. In parallel, he works on the open-source High-Level Synthesis toolchain XLS and dabbles in Quantum Computing. He remains strongly engaged in compiler and datacenter research.

In real life, he enjoys spending time with his family, playing the piano (at which he sucks), playing Volleyball (which he used to do fairly well) and everything related to delicious high quality food (his main reason for joining Google ;-)

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Quantum Computing for Programmers
    Cambridge University Press, Cambridge CB2 8BS, United Kingdom (2022)
    Preview abstract This introduction to quantum computing from a classical programmer's perspective is meant for students and practitioners alike. Over 25 fundamental algorithms are explained with full mathematical derivations and classical code for simulation, using an open-source code base developed from the ground up in Python and C++. After presenting the basics of quantum computing, the author focuses on algorithms and the infrastructure to simulate them efficiently, beginning with quantum teleportation, superdense coding, and Deutsch-Jozsa. Coverage of advanced algorithms includes the quantum supremacy experiment, quantum Fourier transform, phase estimation, Shor's algorithm, Grover's algorithm with derivatives, quantum random walks, and the Solovay–Kitaev algorithm for gate approximation. Quantum simulation is explored with the variational quantum eigensolver, quantum approximate optimization, and the Max-Cut and Subset-Sum algorithms. The book also discusses issues around programmer productivity, quantum noise, error correction, and challenges for quantum programming languages, compilers, and tools, with a final section on compiler techniques for transpilation. View details
    In-Datacenter Performance Analysis of a Tensor Processing Unit
    Norman P. Jouppi
    Nishant Patil
    Gaurav Agrawal
    Raminder Bajwa
    Sarah Bates
    Suresh Bhatia
    Nan Boden
    Al Borchers
    Rick Boyle
    Pierre-luc Cantin
    Clifford Chao
    Chris Clark
    Jeremy Coriell
    Mike Daley
    Matt Dau
    Ben Gelb
    Tara Vazir Ghaemmaghami
    Rajendra Gottipati
    William Gulland
    Robert Hagmann
    C. Richard Ho
    Doug Hogberg
    John Hu
    Dan Hurt
    Julian Ibarz
    Aaron Jaffey
    Alek Jaworski
    Alexander Kaplan
    Harshit Khaitan
    Andy Koch
    Naveen Kumar
    Steve Lacy
    James Law
    Diemthu Le
    Chris Leary
    Zhuyuan Liu
    Kyle Lucke
    Alan Lundin
    Gordon MacKean
    Adriana Maggiore
    Maire Mahony
    Kieran Miller
    Rahul Nagarajan
    Ravi Narayanaswami
    Ray Ni
    Kathy Nix
    Thomas Norrie
    Mark Omernick
    Narayana Penukonda
    Andy Phelps
    Jonathan Ross
    ISCA (2017) (to appear)
    Preview abstract Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU. View details
    GPUCC - An Open-Source GPGPU Compiler
    Jingyue Wu
    Mark Heffernan
    Chris Leary
    Bjarke Roune
    Rob Springer
    Xuetian Weng
    Proceedings of the 2016 International Symposium on Code Generation and Optimization, ACM, New York, NY, pp. 105-116
    Preview abstract Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has not been a fully open-source compiler targeting the CUDA environment, hampering general compiler and architecture research and making deployment difficult in datacenter or supercomputer environments. In this paper, we present gpucc, an LLVM-based, fully open-source, CUDA compatible compiler for high performance computing. It performs various general and CUDA-specific optimizations to generate high performance code. The Clang-based frontend supports modern language features such as those in C++11 and C++14. Compile time is 8% faster than NVIDIA’s toolchain (nvcc) and it reduces compile time by up to 2.4x for pathological compilations (>100 secs), which tend to dominate build times in parallel build environments. Compared to nvcc, gpucc’s runtime performance is on par for several open-source benchmarks, such as Rodinia (0.8% faster), SHOC (0.5% slower), or Tensor (3.7% faster). It outperforms nvcc on internal large-scale end-to-end benchmarks by up to 51.0%, with a geometric mean of 22.9%. View details
    Whare-Map: Heterogeneity in “Homogeneous” Warehouse-Scale Computers
    Jason Mars
    Lingjia Tang
    Proceedings of the 2013 ACM/IEEE International Symposium on Computer Architecture (ISCA), IEEE (to appear)
    Preview abstract Modern “warehouse scale computers” (WSCs) continue to be embraced as homogeneous computing platforms. However, due to frequent machine replacements and upgrades, modern WSCs are in fact composed of diverse commodity microarchitectures and machine configurations. Yet, current WSCs are architected with the assumption of homogeneity, leaving a potentially significant performance opportunity unexplored. In this paper, we expose and quantify the performance impact of the “homogeneity assumption” for modern production WSCs using industry-strength large-scale web-service workloads. In addition, we argue for, and evaluate the benefits of, a heterogeneity-aware WSC using commercial web-service production workloads including Google’s websearch. We also identify key factors impacting the available performance opportunity when exploiting heterogeneity and introduce a new metric, opportunity factor, to quantify an application’s sensitivity to the heterogeneity in a given WSC. To exploit heterogeneity in “homogeneous” WSCs, we propose “Whare-Map,” the WSC Heterogeneity Aware Mapper that leverages already in-place continuous profiling subsystems found in production environments. When employing “Whare-Map”, we observe a cluster-wide performance improvement of 15% on average over heterogeneity–oblivious job placement and up to an 80% improvement forweb-service applications that are particularly sensitive to heterogeneity View details
    Optimizing Google's Warehouse Scale Computers: The NUMA Experience
    Lingjia Tang
    Jason Mars
    Robert Hagmann
    The 19th IEEE International Symposium on High Performance Computer Architecture (2013)
    Preview
    JSWhiz - Static Analysis for JavaScript Memory Leaks
    Proceedings of the 10th annual IEEE/ACM international symposium on Code generation and optimization, IEEE (2013)
    Preview abstract JavaScript is the dominant language for implementing dynamic web pages in browsers. Even though it is standardized, many browsers implement language and browser bindings in different and incompatible ways. As a result, a plethora of web development frameworks were developed to hide cross-browser issues and to ease development of large web applications. An unwelcome side-effect of these frameworks is that they can introduce memory leaks, despite the fact that JavaScript is garbage collected. Memory bloat is a major issue for web applications, as it affects user perceived latency and may even prevent large web applications from running on devices with limited resources. In this paper we present JSWhiz, an extension to the open-source Closure JavaScript compiler. Based on experiences analyzing memory leaks in Gmail, JSWhiz detects five identified common problem patterns. JSWhiz found a total of 89 memory leaks across Google's Gmail, Docs, Spreadsheets, Books, and Closure itself. It contributed significantly in a recent effort to reduce Gmail memory footprint, which resulted in bloat reduction of 75% at the 99th percentile, and by roughly 50% at the median. View details
    Bubble-Up: Increasing Utilization In Modern Warehouse Scale Computers Via Sensible Co-Locations
    Jason Mars
    Linjia Tang
    Kevin Skadron
    Mary Lou Souffa
    Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, IEEE, New York, NY, USA
    Preview abstract As much of the world’s computing continues to move into the cloud, the over-provisioning of computing resources to ensure the performance isolation of latency-sensitive tasks, such as web search, in modern datacenters is a major contributor to low machine utilization. Being unable to accurately predict performance degradation due to contention for shared resources on multicore systems has led to the heavy handed approach of simply disallowing the co-location of high-priority, latency-sensitive tasks with other tasks. Performing this precise prediction has been a challenging and unsolved problem. In this paper, we present Bubble-Up, a characterization methodology that enables the accurate prediction of the performance degradation that results from contention for shared resources in the memory subsystem. By using a bubble to apply a tunable amount of “pressure” to the memory subsystem on processors in production datacenters, our methodology can predict the performance interference between co-locate applications with an accuracy within 1% to 2% of the actual performance degradation. Using this methodology to arrive at “sensible” co-locations in Google’s production datacenters with real-world large-scale applications, we can improve the utilization of a 500-machine cluster by 50% to 90% while guaranteeing a high quality of service of latency-sensitive applications. View details
    RACEZ: A Lightweight and Non-Invasive Race Detection Tool for Production Applications
    Tianwei Sheng
    Neil Vachharajani
    Stephane Eranian
    ICSE, ACM (2011), pp. 401-410
    Preview abstract Concurrency bugs, particularly data races, are notoriously difficult to debug and are a significant source of unreliability in multithreaded applications. Many tools to catch data races rely on program instrumentation to obtain memory instruction traces. Unfortunately, this instrumentation introduces significant runtime overhead, is extremely invasive, or has a limited domain of applicability making these tools unsuitable for many production systems. Consequently, these tools are typically used during application testing where many data races go undetected. This paper proposes RACEZ, a novel race detection mechanism which uses a sampled memory trace collected by the hardware performance monitoring unit rather than invasive instrumentation. The approach introduces only a modest overhead making it usable in production environments. We validate RACEZ using two open source server applications and the PARSEC benchmarks. Our experiments show that RACEZ catches a set of known bugs with reasonable probability while introducing only 2.8% runtime slow down on average. View details
    MAO - an Extensible Micro-Architectural Optimizer
    Easwaran Raman
    Martin Thuresson
    Neil Vachharajani
    Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, ACM (2011)
    Preview abstract Performance matters, and so does repeatability and predictability. Today's processors' micro-architectures have become so complex as to now contain many undocumented, not understood, and even puzzling performance cliffs. Small changes in the instruction stream, such as the insertion of a single NOP instruction, can lead to significant performance deltas, with the effect of exposing compiler and performance optimization efforts to perceived unwanted randomness. This paper presents MAO, an extensible micro-architectural assembly to assembly optimizer, which seeks to address this problem for x86/64 processors. In essence, MAO is a thin wrapper around a common open source assembler infrastructure. It offers basic operations, such as creation or modification of instructions, simple data-flow analysis, and advanced infra-structure, such as loop recognition, and a repeated relaxation algorithm to compute instruction addresses and lengths. This infrastructure enables a plethora of passes for pattern matching, alignment specific optimizations, peep-holes, experiments (such as random insertion of NOPs), and fast prototyping of more sophisticated optimizations. MAO can be integrated into any compiler that emits assembly code, or can be used standalone. MAO can be used to discover micro-architectural details semi-automatically. Initial performance results are encouraging. View details
    The Impact of Memory Subsystem Resource Sharing on Datacenter Applications
    Lingjia Tang
    Jason Mars
    Neil Vachharajani
    Mary-Lou Soffa
    ISCA, ACM (2011)
    Preview abstract In this paper we study the impact of sharing memory resources on five Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol buffer. While prior work has found neither positive nor negative effects from cache sharing across the PARSEC benchmark suite, we find that across these datacenter applications, there is both a sizable benefit and a potential degradation from improperly sharing resources. In this paper, we first present a study of the importance of thread-tocore mappings for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth. Second, we investigate the impact of co-locating threads from multiple applications with diverse memory behavior and discover that the best mapping for a given application changes depending on its co-runner. Third, we investigate the application characteristics that impact performance in the various thread-to-core mapping scenarios. Finally, we present both a heuristics-based and an adaptive approach to arrive at good thread-to-core decisions in the datacenter. We observe performance swings of up to 25% for web search and 40% for other key applications, simply based on how application threads are mapped to cores. By employing our adaptive thread-to-core mapper, the performance of the datacenter applications presented in this work improved by up to 22% over status quo thread-to-core mapping and performs within 3% of optimal. View details