Jump to Content
Robert Hundt

Robert Hundt

Robert Hundt received a degree in Computer Science from Technical University in Munich in 1992. Until 1999 he worked for Terrasat GmbH in Germany, a 20+ people R&D company he co-owned. He played many roles - from company lead to booth cat - while writing and optimizing software for surveying and navigation with satellite systems.

In 2000 he started working for Hewlett-Packard Company in California on bringing up the new and scalable high-level optimizer SYZYGY for the HP C/C++/FORTRAN compilers with a new inter-procedural optimizer, a new loop optimizer, and a new scalar optimizer. Before joining the compiler group, Robert was responsible for dynamic binary instrumentation for Intel Itanium processors, co-creating and designing the performance analysis tool HP Caliper.

Since beginning of 2007 Robert has been working for Google. He created various compiler and performance projects, e.g., he served as Tech Lead for compiler optimization for servers (x86), Android (ARM), and GPUs (open-source CUDA compiler), built datacenter profiling and performance analysis tools, and worked on GMail/Apps performance, from Chrome to datacenter. For many years Robert was the SW lead for Google TPU - supercomputers to accelerate machine learning inference and training, which include the open-source TensorFlow compiler XLA. Today he is working on the open-source High-Level Synthesis toolchain XLS and dabbles in Quantum Computing. He remains strongly engaged in compiler and datacenter research.

In real life, he enjoys spending time with his family, playing the piano (at which he sucks), playing Volleyball (which he used to do fairly well) and everything related to delicious high quality food (his main reason for joining Google ;-)

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Quantum Computing for Programmers
    Cambridge University Press, Cambridge CB2 8BS, United Kingdom (2022)
    Preview abstract This introduction to quantum computing from a classical programmer's perspective is meant for students and practitioners alike. Over 25 fundamental algorithms are explained with full mathematical derivations and classical code for simulation, using an open-source code base developed from the ground up in Python and C++. After presenting the basics of quantum computing, the author focuses on algorithms and the infrastructure to simulate them efficiently, beginning with quantum teleportation, superdense coding, and Deutsch-Jozsa. Coverage of advanced algorithms includes the quantum supremacy experiment, quantum Fourier transform, phase estimation, Shor's algorithm, Grover's algorithm with derivatives, quantum random walks, and the Solovay–Kitaev algorithm for gate approximation. Quantum simulation is explored with the variational quantum eigensolver, quantum approximate optimization, and the Max-Cut and Subset-Sum algorithms. The book also discusses issues around programmer productivity, quantum noise, error correction, and challenges for quantum programming languages, compilers, and tools, with a final section on compiler techniques for transpilation. View details
    In-Datacenter Performance Analysis of a Tensor Processing Unit
    Norman P. Jouppi
    Nishant Patil
    Gaurav Agrawal
    Raminder Bajwa
    Sarah Bates
    Suresh Bhatia
    Nan Boden
    Al Borchers
    Rick Boyle
    Pierre-luc Cantin
    Clifford Chao
    Chris Clark
    Jeremy Coriell
    Mike Daley
    Matt Dau
    Ben Gelb
    Tara Vazir Ghaemmaghami
    Rajendra Gottipati
    William Gulland
    Robert Hagmann
    C. Richard Ho
    Doug Hogberg
    John Hu
    Dan Hurt
    Julian Ibarz
    Aaron Jaffey
    Alek Jaworski
    Alexander Kaplan
    Harshit Khaitan
    Andy Koch
    Naveen Kumar
    Steve Lacy
    James Law
    Diemthu Le
    Chris Leary
    Zhuyuan Liu
    Kyle Lucke
    Alan Lundin
    Gordon MacKean
    Adriana Maggiore
    Maire Mahony
    Kieran Miller
    Rahul Nagarajan
    Ravi Narayanaswami
    Ray Ni
    Kathy Nix
    Thomas Norrie
    Mark Omernick
    Narayana Penukonda
    Andy Phelps
    Jonathan Ross
    ISCA (2017) (to appear)
    Preview abstract Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU. View details
    GPUCC - An Open-Source GPGPU Compiler
    Jingyue Wu
    Mark Heffernan
    Chris Leary
    Bjarke Roune
    Rob Springer
    Xuetian Weng
    Proceedings of the 2016 International Symposium on Code Generation and Optimization, ACM, New York, NY, pp. 105-116
    Preview abstract Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has not been a fully open-source compiler targeting the CUDA environment, hampering general compiler and architecture research and making deployment difficult in datacenter or supercomputer environments. In this paper, we present gpucc, an LLVM-based, fully open-source, CUDA compatible compiler for high performance computing. It performs various general and CUDA-specific optimizations to generate high performance code. The Clang-based frontend supports modern language features such as those in C++11 and C++14. Compile time is 8% faster than NVIDIA’s toolchain (nvcc) and it reduces compile time by up to 2.4x for pathological compilations (>100 secs), which tend to dominate build times in parallel build environments. Compared to nvcc, gpucc’s runtime performance is on par for several open-source benchmarks, such as Rodinia (0.8% faster), SHOC (0.5% slower), or Tensor (3.7% faster). It outperforms nvcc on internal large-scale end-to-end benchmarks by up to 51.0%, with a geometric mean of 22.9%. View details
    Whare-Map: Heterogeneity in “Homogeneous” Warehouse-Scale Computers
    Jason Mars
    Lingjia Tang
    Proceedings of the 2013 ACM/IEEE International Symposium on Computer Architecture (ISCA), IEEE (to appear)
    Preview abstract Modern “warehouse scale computers” (WSCs) continue to be embraced as homogeneous computing platforms. However, due to frequent machine replacements and upgrades, modern WSCs are in fact composed of diverse commodity microarchitectures and machine configurations. Yet, current WSCs are architected with the assumption of homogeneity, leaving a potentially significant performance opportunity unexplored. In this paper, we expose and quantify the performance impact of the “homogeneity assumption” for modern production WSCs using industry-strength large-scale web-service workloads. In addition, we argue for, and evaluate the benefits of, a heterogeneity-aware WSC using commercial web-service production workloads including Google’s websearch. We also identify key factors impacting the available performance opportunity when exploiting heterogeneity and introduce a new metric, opportunity factor, to quantify an application’s sensitivity to the heterogeneity in a given WSC. To exploit heterogeneity in “homogeneous” WSCs, we propose “Whare-Map,” the WSC Heterogeneity Aware Mapper that leverages already in-place continuous profiling subsystems found in production environments. When employing “Whare-Map”, we observe a cluster-wide performance improvement of 15% on average over heterogeneity–oblivious job placement and up to an 80% improvement forweb-service applications that are particularly sensitive to heterogeneity View details
    JSWhiz - Static Analysis for JavaScript Memory Leaks
    Proceedings of the 10th annual IEEE/ACM international symposium on Code generation and optimization, IEEE (2013)
    Preview abstract JavaScript is the dominant language for implementing dynamic web pages in browsers. Even though it is standardized, many browsers implement language and browser bindings in different and incompatible ways. As a result, a plethora of web development frameworks were developed to hide cross-browser issues and to ease development of large web applications. An unwelcome side-effect of these frameworks is that they can introduce memory leaks, despite the fact that JavaScript is garbage collected. Memory bloat is a major issue for web applications, as it affects user perceived latency and may even prevent large web applications from running on devices with limited resources. In this paper we present JSWhiz, an extension to the open-source Closure JavaScript compiler. Based on experiences analyzing memory leaks in Gmail, JSWhiz detects five identified common problem patterns. JSWhiz found a total of 89 memory leaks across Google's Gmail, Docs, Spreadsheets, Books, and Closure itself. It contributed significantly in a recent effort to reduce Gmail memory footprint, which resulted in bloat reduction of 75% at the 99th percentile, and by roughly 50% at the median. View details
    Optimizing Google's Warehouse Scale Computers: The NUMA Experience
    Lingjia Tang
    Jason Mars
    Robert Hagmann
    The 19th IEEE International Symposium on High Performance Computer Architecture (2013)
    RACEZ: A Lightweight and Non-Invasive Race Detection Tool for Production Applications
    Tianwei Sheng
    Neil Vachharajani
    Stephane Eranian
    ICSE, ACM (2011), pp. 401-410
    Preview abstract Concurrency bugs, particularly data races, are notoriously difficult to debug and are a significant source of unreliability in multithreaded applications. Many tools to catch data races rely on program instrumentation to obtain memory instruction traces. Unfortunately, this instrumentation introduces significant runtime overhead, is extremely invasive, or has a limited domain of applicability making these tools unsuitable for many production systems. Consequently, these tools are typically used during application testing where many data races go undetected. This paper proposes RACEZ, a novel race detection mechanism which uses a sampled memory trace collected by the hardware performance monitoring unit rather than invasive instrumentation. The approach introduces only a modest overhead making it usable in production environments. We validate RACEZ using two open source server applications and the PARSEC benchmarks. Our experiments show that RACEZ catches a set of known bugs with reasonable probability while introducing only 2.8% runtime slow down on average. View details
    The Impact of Memory Subsystem Resource Sharing on Datacenter Applications
    Lingjia Tang
    Jason Mars
    Neil Vachharajani
    Mary-Lou Soffa
    ISCA, ACM (2011)
    Preview abstract In this paper we study the impact of sharing memory resources on five Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol buffer. While prior work has found neither positive nor negative effects from cache sharing across the PARSEC benchmark suite, we find that across these datacenter applications, there is both a sizable benefit and a potential degradation from improperly sharing resources. In this paper, we first present a study of the importance of thread-tocore mappings for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth. Second, we investigate the impact of co-locating threads from multiple applications with diverse memory behavior and discover that the best mapping for a given application changes depending on its co-runner. Third, we investigate the application characteristics that impact performance in the various thread-to-core mapping scenarios. Finally, we present both a heuristics-based and an adaptive approach to arrive at good thread-to-core decisions in the datacenter. We observe performance swings of up to 25% for web search and 40% for other key applications, simply based on how application threads are mapped to cores. By employing our adaptive thread-to-core mapper, the performance of the datacenter applications presented in this work improved by up to 22% over status quo thread-to-core mapping and performs within 3% of optimal. View details
    Bubble-Up: Increasing Utilization In Modern Warehouse Scale Computers Via Sensible Co-Locations
    Jason Mars
    Linjia Tang
    Kevin Skadron
    Mary Lou Souffa
    Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, IEEE, New York, NY, USA
    Preview abstract As much of the world’s computing continues to move into the cloud, the over-provisioning of computing resources to ensure the performance isolation of latency-sensitive tasks, such as web search, in modern datacenters is a major contributor to low machine utilization. Being unable to accurately predict performance degradation due to contention for shared resources on multicore systems has led to the heavy handed approach of simply disallowing the co-location of high-priority, latency-sensitive tasks with other tasks. Performing this precise prediction has been a challenging and unsolved problem. In this paper, we present Bubble-Up, a characterization methodology that enables the accurate prediction of the performance degradation that results from contention for shared resources in the memory subsystem. By using a bubble to apply a tunable amount of “pressure” to the memory subsystem on processors in production datacenters, our methodology can predict the performance interference between co-locate applications with an accuracy within 1% to 2% of the actual performance degradation. Using this methodology to arrive at “sensible” co-locations in Google’s production datacenters with real-world large-scale applications, we can improve the utilization of a 500-machine cluster by 50% to 90% while guaranteeing a high quality of service of latency-sensitive applications. View details
    Preview abstract In this experience report we encode a well specified, compact benchmark in four programming languages, namely C++, Java, Go, and Scala. The implementations each use the languages’ idiomatic container classes, looping constructs, and memory/object allocation schemes. It does not attempt to exploit specific language and runtime features to achieve maximum performance. This approach allows an almost fair comparison of language features, code complexity, compilers and compile time, binary sizes, runtimes, and memory footprint. While the benchmark itself is simple and compact, it employs many language features, in particular, higher-level data structures (lists, maps, lists and arrays of sets and lists), a few algorithms (union/find, dfs / deep recursion, and loop recognition based on Tarjan), iterations over collection types, some object oriented features, and interesting memory allocation patterns. We do not explore any aspects of multi-threading, or higher level type mechanisms, which vary greatly between the languages. The benchmark points to very large differences in all examined dimensions of the language implementations. After publication of the benchmark internally at Google, several engineers produced highly optimized versions of the benchmark. While this whole effort is an anectodal comparison only, the benchmark and subsequent tuning effort might be indicatie of typical performance pain points in the respective languages. View details
    MAO - an Extensible Micro-Architectural Optimizer
    Easwaran Raman
    Martin Thuresson
    Neil Vachharajani
    Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, ACM (2011)
    Preview abstract Performance matters, and so does repeatability and predictability. Today's processors' micro-architectures have become so complex as to now contain many undocumented, not understood, and even puzzling performance cliffs. Small changes in the instruction stream, such as the insertion of a single NOP instruction, can lead to significant performance deltas, with the effect of exposing compiler and performance optimization efforts to perceived unwanted randomness. This paper presents MAO, an extensible micro-architectural assembly to assembly optimizer, which seeks to address this problem for x86/64 processors. In essence, MAO is a thin wrapper around a common open source assembler infrastructure. It offers basic operations, such as creation or modification of instructions, simple data-flow analysis, and advanced infra-structure, such as loop recognition, and a repeated relaxation algorithm to compute instruction addresses and lengths. This infrastructure enables a plethora of passes for pattern matching, alignment specific optimizations, peep-holes, experiments (such as random insertion of NOPs), and fast prototyping of more sophisticated optimizations. MAO can be integrated into any compiler that emits assembly code, or can be used standalone. MAO can be used to discover micro-architectural details semi-automatically. Initial performance results are encouraging. View details
    Heterogeneity in “Homogeneous” Warehouse-Scale Computers: A Performance Opportunity
    Jason Mars
    Lingjia Tang
    IEEE Computer Architecture Letters (CAL), vol. Vol. 10 No. 2 (2011), pp. 29-32
    Preview abstract The class of modern datacenters recently coined as “warehouse scale computers” (WSCs) has traditionally been embraced as homogeneous computing platforms. However, due to frequent machine replacements and upgrades, modern WSCs are in fact composed of diverse commodity microarchitectures and machine configurations. Yet, current WSCs are designed with an assumption of homogeneity, leaving a potentially significant performance opportunity unexplored. In this paper, we investigate the key factors impacting the available heterogeneity in modern WSCs, and the benefit of exploiting this heterogeneity to maximize overall performance. We also introduce a new metric, opportunity factor, which can be used to quantify an application’s sensitivity to the heterogeneity in a given WSC. For applications that are sensitive to heterogeneity, we observe a performance improvement of up to 70% when employing our approach. In a WSC composed of state-of-the-art machines, we can improve the overall performance of the entire datacenter by 16% over the status quo. View details
    Preview abstract Google-Wide Profiling (GWP), a continuous profiling infrastructure for data centers, provides performance insights for cloud applications. With negligible overhead, GWP provides stable, accurate profiles and a datacenter-scale tool for traditional performance analyses. Furthermore, GWP introduces novel applications of its profiles, such as application- platform affinity measurements and identification of platform-specific, microarchitectural peculiarities. View details
    Lightweight Feedback-Directed Cross-Module Optimization
    Raksit Ashok
    Proceedings of International Symposium on Code Generation and Optimization (CGO), IEEE (2010)
    Preview abstract Cross-module inter-procedural compiler optimization (IPO) and Feedback-Directed Optimization (FDO) are two important compiler techniques delivering solid performance gains. The combination of IPO and FDO delivers peak performance, but also multiplies both techniques' usability problems. In this paper, we present LIPO, a novel static IPO framework, which integrates IPO and FDO. Compared to existing approaches, LIPO no longer requires writing of the compiler's intermediate representation, eliminates the link-time inter-procedural optimization phase entirely, and minimizes code re-generation overhead, thus improving scalability by an order of magnitude. Compared to an FDO baseline, and without further specific tuning, LIPO improves performance of SPEC2006 INT by 2.5%, and of SPEC2000 INT by 4.4%, with up to 23% for one benchmarks. We confirm our scalability results on a set of large industrial applications, demonstrating 2.9% performance improvements on average. Compile time overhead for full builds is less than 30%, incremental builds take a few seconds on average, and storage requirements increase by only 24%, all compared to the FDO baseline. View details
    Taming Hardware Event Samples for FDO Compilation
    Dehao Chen
    Neil Vachharajani
    Shih-wei Liao
    Vinodha Ramasamy
    Paul Yuan
    Wenguang Chen
    Weiming Zheng
    Proceedings of International Symposium on Code Generation and Optimization (CGO) (2010)
    Preview abstract Feedback-directed optimization (FDO) is effective in improving application runtime performance, but has not been widely adopted due to the tedious dual-compilation model, the difficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling to generate estimated edge profiles overcomes these drawbacks. Yet, hardware event samples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed performance when compared to instrumentation-based FDO. In this paper, we use multiple hardware event profiles and supervised learning techniques to generate heuristics for improved precision of basic-block-level sample profiles, and to further improve the smoothing algorithms used to construct edge profiles. We demonstrate that sampling-based FDO can achieve an average of 78% of the performance gains obtained using instrumentation-based exact edge profiles for SPEC2000 benchmarks, matching or beating instrumentation-based FDO in many cases. The overhead of collection is only 0.74% on average, while compiler based instrumentation incurs 6.8%–53.5% overhead (and 10x overhead on an industrial web search application), and dynamic instrumentation incurs 28.6%–1639.2% overhead. View details
    Contention Aware Execution: Online Contention Detection and Response
    Jason Mars
    Neil Vachharajani
    Mary Lou Souffa
    Proceedings of International Symposium on Code Generation and Optimization (CGO), IEEE (2010)
    Scenario Based Optimization: A Framework for Statically Enabling Online Optimizations
    Jason Mars
    Proceedings of the 2009 Symposium on Code Generation and Optimization (CGO), IEEE Computer Society, 10662 Los Vaqueros Circle, P.O. Box 3014, Los Alamitos, CA, 90720, pp. 169-170
    Feedback-Directed Optimizations in GCC with Estimated Edge Profiles from Hardware Event Sampling
    Vinodha Ramasamy
    Paul Yuan
    Dehao Chen
    Proceedings of GCC Summit 2008, pp. 87-102
    Preview abstract Traditional feedback-directed optimization (FDO) in GCC uses static instrumentation to collect edge and value profiles. This method has shown good application performance gains, but is not commonly used in practice due to the high runtime overhead of profile collection, the tedious dual-compile usage model, and difficulties in generating representative training data sets. In this paper, we show that edge frequency estimates can be successfully constructed with heuristics using profile data collected by sampling of hardware events, incurring low runtime overhead (e.g., less then 2%), and requiring no instrumentation, yet achieving competitive performance gains. We describe the motivation, design, and implementation of FDO using sample profiles in GCC and also present our initial experimental results with SPEC2000int C benchmarks that show approximately 70% to 90% of the performance gains obtained using traditional FDO with exact edge profiles. View details
    No Results Found