Jump to Content
Parthasarathy Ranganathan

Parthasarathy Ranganathan

Parthasarathy (Partha) Ranganathan is currently at Google designing their next-generation systems. Before this, he was a HP Fellow and Chief Technologist at Hewlett Packard Labs where he led their research on systems and datacenters. Dr. Ranganathan's research interests are in systems architecture and manageability, energy-efficiency, and systems modeling and evaluation. He has done extensive work in these areas including key contributions around energy-aware user interfaces, heterogeneous multi-core processors, power capping and power-aware server designs, federated enterprise power management, energy modeling and benchmarking, disaggregated blade server architectures, and most recently, storage hierarchy and systems redesign for non-volatile memory. He was also one of the primary developers of the publicly distributed Rice Simulator for ILP Multiprocessors (RSIM). Dr. Ranganathan's work has led to broad impact on both academia and industry including several commercial products such as Power Capping and HP Moonshot servers. He holds more than 50 patents (with another 45 pending) and has published extensively, including several award-winning papers. He also teaches regularly (including, most recently, at Stanford) and has contributed to several popular computer architecture textbooks. Dr. Ranganathan and his work have been featured on numerous occasions in the press including the New York Times, Wall Street Journal, Business Week, San Francisco Chronicle, Times of India, Slashdot, Youtube, and Tom's hardware guide. Dr. Ranganathan has been named one of the world's top young innovators by MIT Technology Review, as one of the top 15 enterprise technology rock stars by Business Insider, and has been recognized with several other awards including the ACM SIGARCH Maurice Wilkes award and Rice University's Outstanding Young Engineering Alumni award. Dr. Ranganathan received his B.Tech degree from the Indian Institute of Technology, Madras and his M.S. and Ph.D. from Rice University, Houston. He is also an ACM and IEEE Fellow.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    CDPU: Co-designing Compression and Decompression Processing Units for Hyperscale Systems
    Ani Udipi
    JunSun Choi
    Joonho Whangbo
    Jerry Zhao
    Edwin Lim
    Vrishab Madduri
    Yakun Sophia Shao
    Borivoje Nikolic
    Krste Asanovic
    Proceedings of the 50th Annual International Symposium on Computer Architecture, Association for Computing Machinery, New York, NY, USA (2023)
    Preview abstract General-purpose lossless data compression and decompression ("(de)compression") are used widely in hyperscale systems and are key "datacenter taxes". However, designing optimal hardware compression and decompression processing units ("CDPUs") is challenging due to the variety of algorithms deployed, input data characteristics, and evolving costs of CPU cycles, network bandwidth, and memory/storage capacities. To navigate this vast design space, we present the first large-scale data-driven analysis of (de)compression usage at a major cloud provider by profiling Google's datacenter fleet. We find that (de)compression consumes 2.9% of fleet CPU cycles and 10-50% of cycles in key services. Demand is also artificially limited; 95% of bytes compressed in the fleet use less capable algorithms to reduce compute, motivating a CDPU that changes cost vs. size tradeoffs. Prior work has improved the microarchitectural state-of-the-art for CDPUs supporting various algorithms in fixed contexts. However, we find that higher-level design parameters like CDPU placement, hash table sizing, history window sizes, and more have as significant of an impact on the viability of CDPU integration, but are not well-studied. Thus, we present the first end-to-end design/evaluation framework for CDPUs, including: 1. An open-source RTL-based CDPU generator that supports many run-time and compile-time parameters. 2. Integration into an open-source RISC-V SoC for rapid performance and silicon area evaluation across CDPU placements and parameters. 3. An open-source (de)compression benchmark, HyperCompressBench, that is representative of (de)compression usage in Google's fleet. Using our framework, we perform an extensive design space exploration running HyperCompressBench. Our exploration spans a 46× range in CDPU speedup, 3× range in silicon area (for a single pipeline), and evaluates a variety of CDPU integration techniques to optimize CDPU designs for hyperscale contexts. Our final hyperscale-optimized CDPU instances are up to 10× to 16× faster than a single Xeon core, while consuming a small fraction (as little as 2.4% to 4.7%) of the area. View details
    Profiling Hyperscale Big Data Processing
    Aasheesh Kolli
    Abraham Gonzalez
    Samira Khan
    Sihang Liu
    Krste Asanovic
    ISCA (2023)
    Preview abstract Computing demand continues to grow exponentially, largely driven by "big data" processing on hyperscale data stores. At the same time, the slowdown in Moore's law is leading the industry to embrace custom computing in large-scale systems. Taken together, these trends motivate the need to characterize live production traffic on these large data processing platforms and understand the opportunity of acceleration at scale. This paper addresses this key need. We characterize three important production distributed database and data analytics platforms at Google to identify key hardware acceleration opportunities and perform a comprehensive limits study to understand the trade-offs among various hardware acceleration strategies. We observe that hyperscale data processing platforms spend significant time on distributed storage and other remote work across distributed workers. Therefore, optimizing storage and remote work in addition to compute acceleration is critical for these platforms. We present a detailed breakdown of the compute-intensive functions in these platforms and identify dominant key data operations related to datacenter and systems taxes. We observe that no single accelerator can provide a significant benefit but collectively, a sea of accelerators, can accelerate many of these smaller platform-specific functions. We demonstrate the potential gains of the sea of accelerators proposal in a limits study and analytical model. We perform a comprehensive study to understand the trade-offs between accelerator location (on-chip/off-chip) and invocation model (synchronous/asynchronous). We propose and evaluate a chained accelerator execution model where identified compute-intensive functions are accelerated and pipelined to avoid invocation from the core, achieving a 3x improvement over the baseline system while nearly matching identical performance to an ideal fully asynchronous execution model. View details
    CRISP: Critical Slice Prefetching
    Heiner Litz
    Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2022), pp. 300-313
    Preview abstract The high access latency of DRAM continues to be a performance challenge for contemporary microprocessor systems. Prefetching is a well-established technique to address this problem, however, existing implemented designs fail to provide any performance benefits in the presence of irregular memory access patterns. The hardware complexity of prior techniques that can predict irregular memory accesses such as runahead execution has proven untenable for implementation in real hardware. We propose a lightweight mechanism to hide the high latency of irregular memory access patterns by leveraging criticality-based scheduling. In particular, our technique executes delinquent loads and their load slices as early as possible, hiding a significant fraction of their latency. Furthermore, we observe that the latency induced by branch mispredictions and other high latency instructions can be hidden with a similar approach. Our proposal only requires minimal hardware modifications by performing memory access classification, load and branch slice extraction, as well as priority analysis exclusively in software. As a result, our technique is feasible to implement, introducing only a simple new instruction prefix while requiring minimal modifications of the instruction scheduler. Our technique increases the IPC of memory-latency-bound applications by up to 38% and by 8.4% on average. View details
    Preview abstract We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent'': the only symptom is an erroneous computation. We refer to a core that develops such behavior as "mercurial.'' Mercurial cores are extremely rare, but in a large fleet of servers we can observe the correlated disruption they cause, often enough to see them as a distinct problem -- one that will require collaboration between hardware designers, processor vendors, and systems software architects. This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause. Please watch our short video summarizing the paper. View details
    Warehouse-Scale Video Acceleration: Co-design and Deployment in the Wild
    Danner Stodolsky
    Jeff Calow
    Jeremy Dorfman
    Clint Smullen
    Aki Kuusela
    Aaron James Laursen
    Alex Ramirez
    Amir Salek
    Anna Cheung
    Ben Gelb
    Brian Fosco
    Cho Mon Kyaw
    Dake He
    David Alexander Munday
    David Wickeraad
    Devin Persaud
    Don Stark
    Elisha Indupalli
    Fong Lou
    Hon Kwan Wu
    In Suk Chong
    Indira Jayaram
    Jia Feng
    JP Maaninen
    Maire Mahony
    Mark Steven Wachsler
    Mercedes Tan
    Niranjani Dasharathi
    Poonacha Kongetira
    Prakash Chauhan
    Raghuraman Balasubramanian
    Ramon Macias
    Richard Ho
    Rob Springer
    Roy W Huffman
    Sandeep Bhatia
    Sathish K Sekar
    Srikanth Muroor
    Ville-Mikko Rautio
    Yolanda Ripley
    Yoshiaki Hase
    Yuan Li
    Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery, New York, NY, USA (2021), pp. 600-615
    Preview abstract Video sharing (e.g., YouTube, Vimeo, Facebook, TikTok) accounts for the majority of internet traffic, and video processing is also foundational to several other key workloads (video conferencing, virtual/augmented reality, cloud gaming, video in Internet-of-Things devices, etc.). The importance of these workloads motivates larger video processing infrastructures and – with the slowing of Moore’s law – specialized hardware accelerators to deliver more computing at higher efficiencies. This paper describes the design and deployment, at scale, of a new accelerator targeted at warehouse-scale video transcoding. We present our hardware design including a new accelerator building block – the video coding unit (VCU) – and discuss key design trade-offs for balanced systems at data center scale and co-designing accelerators with large-scale distributed software systems. We evaluate these accelerators “in the wild" serving live data center jobs, demonstrating 20-33x improved efficiency over our prior well-tuned non-accelerated baseline. Our design also enables effective adaptation to changing bottlenecks and improved failure management, and new workload capabilities not otherwise possible with prior systems. To the best of our knowledge, this is the first work to discuss video acceleration at scale in large warehouse-scale environments. View details
    A Hardware Accelerator for Protocol Buffers
    Chris Leary
    Jerry Zhao
    Dinesh Parimi
    Borivoje Nikolic
    Krste Asanovic
    Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-54), Association for Computing Machinery, New York, NY, USA (2021), 462–478
    Preview abstract Serialization frameworks are a fundamental component of scale-out systems, but introduce significant compute overheads. However, they are amenable to acceleration with specialized hardware. To understand the trade-offs involved in architecting such an accelerator, we present the first in-depth study of serialization framework usage at scale by profiling Protocol Buffers (“protobuf”) usage across Google’s datacenter fleet. We use this data to build HyperProtoBench, an open-source benchmark representative of key serialization-framework user services at scale. In doing so, we identify key insights that challenge prevailing assumptions about serialization framework usage. We use these insights to develop a novel hardware accelerator for protobufs, implemented in RTL and integrated into a RISC-V SoC. Applications can easily harness the accelerator, as it integrates with a modified version of the open-source protobuf library and is wire-compatible with standard protobufs. We have fully open-sourced our RTL, which, to the best of our knowledge, is the only such implementation currently available to the community. We also present a first-of-its-kind, end-to-end evaluation of our entire RTL-based system running hyperscale-derived benchmarks and microbenchmarks. We boot Linux on the system using FireSim to run these benchmarks and implement the design in a commercial 22nm FinFET process to obtain area and frequency metrics. We demonstrate an average 6.2x to 11.2x performance improvement vs. our baseline RISC-V SoC with BOOM OoO cores and despite the RISC-V SoC’s weaker uncore/supporting components, an average 3.8x improvement vs. a Xeon-based server. View details
    Preview abstract Memory allocation represents significant compute cost at the warehouse scale and its optimization can yield considerable cost savings. One classical approach is to increase the efficiency of an allocator to minimize the cycles spent in the allocator code. However, memory allocation decisions also impact overall application performance via data placement, offering opportunities to improve fleetwide productivity by completing more units of application work using fewer hardware resources. Here, we focus on hugepage coverage. We present TEMERAIRE, a hugepage-aware enhancement of TCMALLOC to reduce CPU overheads in the application’s code. We discuss the design and implementation of TEMERAIRE including strategies for hugepage-aware memory layouts to maximize hugepage coverage and to minimize fragmentation overheads. We present application studies for 8 applications, improving requests-per-second (RPS) by 7.7% and reducing RAM usage 2.4%. We present the results of a 1% experiment at fleet scale as well as the longitudinal rollout in Google’s warehouse scale computers. This yielded 6% fewer TLB miss stalls, and 26% reduction in memory wasted due to fragmentation. We conclude with a discussion of additional techniques for improving the allocator development process and potential optimization strategies for future memory allocators. View details
    A Hierarchical Neural Model of Data Prefetching
    Zhan Shi
    Akanksha Jain
    Calvin Lin
    Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2021)
    Preview abstract This paper presents Voyager, a novel neural network for data prefetching. Unlike previous neural models for prefetching, which are limited to learning delta correlations, our model can also learn address correlations, which are important for prefetching irregular sequences of memory accesses. The key to our solution is its hierarchical structure that separates addresses into pages and offsets and that introduces a mechanism for learning important relations among pages and offsets. Voyager provides significant prediction benefits over current data prefetchers. For a set of irregular programs from the SPEC 2006 and GAP benchmark suites, Voyager sees an average IPC improvement of 41.6% over a system with no prefetcher, compared with 21.7% and 28.2%, respectively, for idealized Domino and ISB prefetchers. We also find that for two commercial workloads for which current data prefetchers see very little benefit, Voyager dramatically improves both accuracy and coverage. At present, slow training and prediction preclude neural models from being practically used in hardware, but Voyager’s overheads are significantly lower—in every dimension—than those of previous neural models. For example, computation cost is reduced by 15-20×, and storage overhead is reduced by 110-200×. Thus, Voyager represents a significant step towards a practical neural prefetcher. View details
    Preview abstract Program execution speed critically depends on increasing cache hits, as cache hits are orders of magnitude faster than misses. To increase cache hits, we focus on the problem of cache replacement: choosing which cache line to evict upon inserting a new line. This is challenging because it requires planning far ahead and currently there is no known practical solution. As a result, current replacement policies typically resort to heuristics designed for specific common access patterns, which fail on more diverse and complex access patterns. In contrast, we propose an imitation learning approach to automatically learn cache access patterns by leveraging Belady’s, an oracle policy that computes the optimal eviction decision given the future cache accesses. While directly applying Belady’s is infeasible since the future is unknown, we train a policy conditioned only on past accesses that accurately approximates Belady’s even on diverse and complex access patterns, and call this approach PARROT. When evaluated on 13 of the most memory-intensive SPEC applications, PARROT increases cache miss rates by 20% over the current state of the art. In addition, on a large-scale web search benchmark, PARROT increases cache hit rates by 61% over a conventional LRU policy. We release a Gym environment to facilitate research in this area, as data is plentiful, and further advancements can have significant real-world impact. View details
    Classifying Memory Access Patterns for Prefetching
    Heiner Litz
    Christos Kozyrakis
    Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery (2020), 513–526
    Preview abstract Prefetching is a well-studied technique for addressing the memory access stall time of contemporary microprocessors. However, despite a large body of related work, the memory access behavior of applications is not well understood, and it remains difficult to predict whether a particular application will benefit from a given prefetcher technique. In this work we propose a novel methodology to classify the memory access patterns of applications, enabling well-informed reasoning about the applicability of a certain prefetcher. Our approach leverages instruction dataflow information to uncover a wide range of access patterns, including arbitrary combinations of offsets and indirection. These combinations or prefetch kernels represent reuse, strides, reference locality, and complex address generation. By determining the complexity and frequency of these access patterns, we enable reasoning about prefetcher timeliness and criticality, exposing the limitations of existing prefetchers today. Moreover, using these kernels, we are able to compute the next address for the majority of top-missing instructions, and we propose a software prefetch injection methodology that is able to outperform state-of-the-art hardware prefetchers. View details
    Autonomous Warehouse-Scale Computers
    Proceedings of the 57th Annual Design Automation Conference 2020, Association for Computing Machinery, New York, NY United States
    Preview abstract Modern Warehouse-Scale Computers (WSCs), composed of many generations of servers and a myriad of domain specific accelerators, are becoming increasingly heterogeneous. Meanwhile, WSC workloads are also becoming incredibly diverse with different communication patterns, latency requirements, and service level objectives (SLOs). Insufficient understanding of the interactions between workload characteristics and the underlying machine architecture leads to resource over-provisioning, thereby significantly impacting the utilization of WSCs. We present Autonomous Warehouse-Scale Computers, a new WSC design that leverages machine learning techniques and automation to improve job scheduling, resource management, and hardware-software co-optimization to address the increasing heterogeneity in WSC hardware and workloads. Our new design introduces two new layers in the WSC stack, namely: (a) a Software-Defined Server (SDS) Abstraction Layer which redefines the hardware-software boundary and provides greater control of the hardware to higher layers of the software stack through stable abstractions; and (b) a WSC Efficiency Layer which regularly monitors the resource usage of workloads on different hardware types, autonomously quantifies the performance sensitivity of workloads to key system configurations, and continuously improves scheduling decisions and hardware resource QoS policies to maximize cluster level performance. Our new WSC design has been successfully deployed across all WSCs at Google for several years now. The new WSC design improves throughput of workloads (by 7-10%, on average), increases utilization of hardware resources (up to 2x), and reduces performance variance for critical workloads (up to 25%). View details
    Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware Power Capping at Scale
    Shaohong Li
    Sreekumar Kodakara
    14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), {USENIX} Association (2020), pp. 1241-1255
    Preview abstract As the demand for data center capacity continues to grow, hyperscale providers have used power oversubscription to increase efficiency and reduce costs. Power oversubscription requires power capping systems to smooth out the spikes that risk overloading power equipment by throttling workloads. Modern compute clusters run latency-sensitive serving and throughput-oriented batch workloads on the same servers, provisioning resources to ensure low latency for the former while using the latter to achieve high server utilization. When power capping occurs, it is desirable to maintain low latency for serving tasks and throttle the throughput of batch tasks. To achieve this, we seek a system that can gracefully throttle batch workloads and has task-level quality-of-service (QoS) differentiation. In this paper we present Thunderbolt, a hardware-agnostic power capping system that ensures safe power oversubscription while minimizing impact on both long-running throughput-oriented tasks and latency-sensitive tasks. It uses a two-threshold, randomized unthrottling/multiplicative decrease control policy to ensure power safety with minimized performance degradation. It leverages the Linux kernel's CPU bandwidth control feature to achieve task-level QoS-aware throttling. It is robust even in the face of power telemetry unavailability. Evaluation results at the node and cluster levels demonstrate the system's responsiveness, effectiveness for reducing power, capability of QoS differentiation, and minimal impact on latency and task health. We have deployed this system at scale, in multiple production clusters. As a result, we enabled power oversubscription gains of 9%--25%, where none was previously possible. View details
    Data Center Power Oversubscription with a Medium Voltage Power Plane and Priority-Aware Capping
    David Landhuis
    Shaohong Li
    Darren De Ronde
    Thomas Blooming
    Anand Ramesh
    James Kennedy
    Christopher Malone
    Jimmy Clidaras
    Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery, New York, NY, USA (2020), 497–511
    Preview abstract As major web and cloud service providers continue to accelerate the demand for new data center capacity worldwide, the importance of power oversubscription as a lever to reduce provisioning costs has never been greater. Building on insights from Google-scale deployments, we design and deploy a new architecture across hardware and software to improve power oversubscription significantly. Our design includes (1) a new medium voltage power plane to enable larger power sharing domains (across tens of MW of equipment) and (2) a scalable, fast, and robust power capping service coordinating multiple priorities of workload on every node. Over several years of production deployment, our co-design has enabled power oversubscription of 25% or higher, saving hundreds of millions of dollars of data center costs, while preserving the desired availability and performance of all workloads. View details
    Preview abstract As the performance of computer systems stagnates due to the end of Moore’s Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn static representations of source code, these representations do not understand how code executes at runtime. In this work, we propose a new approach using GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance on an indirectly related algorithm classification task. View details
    Preview abstract A significant effort has been made to train neural networks that replicate algorithmic reasoning, but they often fail to learn the abstract concepts underlying these algorithms. This is evidenced by their inability to generalize to data distributions that are outside of their restricted training sets, namely larger inputs and unseen data. We study these generalization issues at the level of numerical subroutines that comprise common algorithms like sorting, shortest paths, and minimum spanning trees. First, we observe that transformer-based sequence-to-sequence models can learn subroutines like sorting a list of numbers, but their performance rapidly degrades as the length of lists grows beyond those found in the training set. We demonstrate that this is due to attention weights that lose fidelity with longer sequences, particularly when the input numbers are numerically similar. To address the issue, we propose a learned conditional masking mechanism, which enables the model to strongly generalize far outside of its training range with near-perfect accuracy on a variety of algorithms. Second, to generalize to unseen data, we show that encoding numbers with a binary representation leads to embeddings with rich structure once trained on downstream tasks like addition or multiplication. This allows the embedding to handle missing data by faithfully interpolating numbers not seen during training. View details
    Software-defined far memory in warehouse-scale computers
    Andres Lagar-Cavilla
    Suleiman Souhlal
    Neha Agarwal
    Radoslaw Burny
    Junaid Shahid
    Greg Thelen
    Kamil Adam Yurtsever
    Yu Zhao
    International Conference on Architectural Support for Programming Languages and Operating Systems (2019)
    Preview abstract Increasing memory demand and slowdown in technology scaling pose important challenges to total cost of ownership (TCO) of warehouse-scale computers (WSCs). One promising idea to reduce the memory TCO is to add a cheaper, but slower, "far memory" tier and use it to store infrequently accessed (or cold) data. However, introducing a far memory tier brings new challenges around dynamically responding to workload diversity and churn, minimizing stranding of capacity, and addressing brownfield (legacy) deployments. We present a novel software-defined approach to far memory that proactively compresses cold memory pages to effectively create a far memory tier in software. Our end-to-end system design encompasses new methods to define performance service-level objectives (SLOs), a mechanism to identify cold memory pages while meeting the SLO, and our implementation in the OS kernel and node agent. Additionally, we design learning-based autotuning to periodically adapt our design to fleet-wide changes without a human in the loop. Our system has been successfully deployed across Google's WSC since 2016, serving thousands of production services. Our software-defined far memory is significantly cheaper (67% or higher memory cost reduction) at relatively good access speeds (6 us) and allows us to store a significant fraction of infrequently accessed data (on average, 20%), translating to significant TCO savings at warehouse scale. View details
    AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers
    Nayana Prasad Nagendra
    David I. August
    Christos Kozyrakis
    Trivikram Krishnamurthy
    Heiner Litz
    International Symposium on Computer Architecture (ISCA) (2019)
    Preview abstract The large instruction working sets of private and public cloud workloads lead to frequent instruction cache misses and costs in the millions of dollars. While prior work has identified the growing importance of this problem, to date, there has been little analysis of where the misses come from, and what the opportunities are to improve them. To address this challenge, this paper makes three contributions. First, we present the design and deployment of a new, always-on, fleet-wide monitoring system, AsmDB, that tracks front-end bottlenecks. AsmDB uses hardware support to collect bursty execution traces, fleet-wide temporal and spatial sampling, and sophisticated offline post-processing to construct full-program dynamic control-flow graphs. Second, based on a longitudinal analysis of AsmDB data from real-world online services, we present two detailed insights on the sources of front-end stalls: (1) cold code that is brought in along with hot code leads to significant cache fragmentation and a corresponding large number of instruction cache misses; (2) distant branches and calls that are not amenable to traditional cache locality or next-line prefetching strategies account for a large fraction of cache misses. Third, we prototype two optimizations that target these insights. For misses caused by fragmentation, we focus on memcmp, one of the hottest functions contributing to cache misses, and show how fine-grained layout optimizations lead to significant benefits. For misses at the targets of distant jumps, we propose new hardware support for software code prefetching and prototype a new feedback-directed compiler optimization that combines static program flow analysis with dynamic miss profiles to demonstrate significant benefits for several large warehouse-scale workloads. Improving upon prior work, our proposal avoids invasive hardware modifications by prefetching via software in an efficient and scalable way. Simulation results show that such an approach can eliminate up to 96% of instruction cache misses with negligible overheads. View details
    Kelp: QoS for Accelerators in Machine Learning Platforms
    Haishan Zhu
    Mattan Erez
    International Symposium on High Performance Computer Architecture (2019)
    Preview abstract Development and deployment of machine learning (ML) accelerators in Warehouse Scale Computers (WSCs) demand significant capital investments and engineering efforts. However, even though heavy computation can be offloaded to the accelerators, applications often depend on the host system for various supporting tasks. As a result, contention on host resources, such as memory bandwidth, can significantly discount the performance and efficiency gains of accelerators. The impact of performance interference is further amplified in distributed learning for large models. In this work, we study the performance of four production machine learning workloads on three accelerator platforms. Our experiments show that these workloads are highly sensitive to host memory bandwidth contention, which can cause 40% average performance degradation when left unmanaged. To tackle this problem, we design and implement Kelp, a software runtime that isolates high priority accelerated ML tasks from memory resource interference. We evaluate Kelp with both production and artificial aggressor workloads, and compare its effectiveness with previously proposed solutions. Our evaluation shows that Kelp is effective in mitigating performance degradation of the accelerated tasks, and improves performance by 24% on average. Compared to previous work, Kelp reduces performance degradation of ML tasks by 7% and improves system efficiency by 17%. Our results further expose opportunities in future architecture designs. View details
    Memory Hierarchy for Web Search
    Jung Ho Ahn
    Christos Kozyrakis
    International Symposium on High Performance Computer Architecture (HPCA) (2018)
    Preview abstract Online data-intensive services, such as search, serve billions of users, utilize millions of cores, and comprise a significant and growing portion of datacenter-scale workloads. However, the complexity of these workloads and their proprietary nature has precluded detailed architectural evaluations and optimizations of processor design trade-offs. We present the first detailed study of the memory hierarchy for the largest commercial search engine today. We use a combination of measurements from longitudinal studies across tens of thousands of deployed servers, systematic microarchitectural evaluation on individual platforms, validated trace-driven simulation, and performance modeling – all driven by production workloads servicing real-world user requests. Our data quantifies significant differences between production search and benchmarks commonly used in the architecture community. We identify the memory hierarchy as an important opportunity for performance optimization, and present new insights pertaining to how search stresses the cache hierarchy, both for instructions and data. We show that, contrary to conventional wisdom, there is significant reuse of data that is not captured by current cache hierarchies, and discuss why this precludes state-of-the-art tiled and scale-out architectures. Based on these insights, we rethink a new cache hierarchy optimized for search that trades off the inefficient use of L3 cache transistors for higher-performance cores, and adds a latency-optimized on-package eDRAM L4 cache. Compared to state-of-the-art processors, our proposed design performs 27% to 38% better. View details
    Preview abstract The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations, augmenting or replacing traditional heuristics and data structures. However, the space of machine learning for computer hardware architecture is only lightly explored. In this paper, we demonstrate the potential of deep learning to address the von Neumann bottleneck of memory performance. We focus on the critical problem of learning memory access patterns, with the goal of constructing accurate and efficient memory prefetchers. We relate contemporary prefetching strategies to n-gram models in natural language processing, and show how recurrent neural networks can serve as a drop-in replacement. On a suite of challenging benchmark datasets, we find that neural networks consistently demonstrate superior performance in terms of precision and recall. This work represents the first step towards practical neural-network based prefetching, and opens a wide range of exciting directions for machine learning in computer architecture research. View details
    Improving Resource Efficiency at Scale with Heracles
    Christos Kozyrakis
    ACM Transactions on Computer Systems (TOCS), vol. 34 (2016), 6:1-6:33
    Preview abstract User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy efficiency of large-scale datacenters. With the slowdown in technology scaling caused by the sunsetting of Moore’s law, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated. View details
    Heracles: Improving Resource Efficiency at Scale
    Christos Kozyrakis
    Proceedings of the 42th Annual International Symposium on Computer Architecture (2015)
    Preview abstract User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated. View details
    Profiling a warehouse-scale computer
    Juan Darago
    Kim Hazelwood
    Gu-Yeon Wei
    David Brooks
    ISCA '15 Proceedings of the 42nd Annual International Symposium on Computer Architecture, ACM (2014), pp. 158-169
    Preview abstract With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications. We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This "datacenter tax" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers. View details
    Preview abstract Although the field of datacenter computing is arguably still in its relative infancy, a sizable body of work from both academia and industry is already available and some consistent technological trends have begun to emerge. This special issue presents a small sample of the work underway by researchers and professionals in this new field. The selection of articles presented reflects the key role that hardware-software codesign plays in the development of effective datacenter-scale computer systems. View details
    Evaluating impact of manageability features on device performance
    Jacob Leverich
    Vanish Talwar
    Christos Kozyrakis
    CNSM (2010), pp. 426-430
    Preview
    Power Management of Datacenter Workloads Using Per-Core Power Gating
    Jacob Leverich
    Matteo Monchiero
    Vanish Talwar
    Christos Kozyrakis
    Computer Architecture Letters, vol. 8 (2009), pp. 48-51
    Preview
    Models and Metrics to Enable Energy-Efficiency Optimizations
    Suzanne Rivoire
    Mehul A. Shah
    Christos Kozyrakis
    Justin Meza
    IEEE Computer, vol. 40 (2007), pp. 39-48
    Preview
    JouleSort: a balanced energy-efficiency benchmark
    Suzanne Rivoire
    Mehul A. Shah
    Christos Kozyrakis
    SIGMOD Conference (2007), pp. 365-376
    Preview
    The new (system) balance of power and opportunities for optimizations
    ISLPED (2014), pp. 331-332
    Hardware acceleration for similarity measurement in natural language processing
    Prateek Tandon
    Vahed Qazvinian
    Ronald G. Dreslinski
    Thomas F. Wenisch
    ISLPED (2013), pp. 409-414
    Thin servers with smart pipes: designing SoC accelerators for memcached
    Kevin T. Lim
    David Meisner
    Ali G. Saidi
    Thomas F. Wenisch
    ISCA (2013), pp. 36-47
    Consistent, durable, and safe memory management for byte-addressable non volatile main memory
    Iulian Moraru
    David G. Andersen
    Michael Kaminsky
    Niraj Tolia
    Nathan L. Binkert
    TRIOS@SOSP (2013), pp. 1
    Meet the walkers: accelerating index traversals for in-memory databases
    Yusuf Onur Koçberber
    Boris Grot
    Javier Picorel
    Babak Falsafi
    Kevin T. Lim
    MICRO (2013), pp. 468-479
    An FPGA memcached appliance
    Sai Rahul Chalamalasetti
    Kevin T. Lim
    Mitch Wright
    Alvin AuYoung
    Martin Margala
    FPGA (2013), pp. 245-254
    Totally green: evaluating and designing servers for lifecycle environmental impact
    Justin Meza
    Amip Shah
    Rocky Shih
    Cullen Bash
    ASPLOS (2012), pp. 25-36
    Free-p: A Practical End-to-End Nonvolatile Memory Protection Mechanism
    Doe Hyun Yoon
    Naveen Muralimanohar
    Norman P. Jouppi
    Mattan Erez
    IEEE Micro, vol. 32 (2012), pp. 79-87
    System-level implications of disaggregated memory
    Kevin T. Lim
    Yoshio Turner
    Jose Renato Santos
    Alvin AuYoung
    Thomas F. Wenisch
    HPCA (2012), pp. 189-200
    Evaluating FPGA-acceleration for real-time unstructured search
    Sai Rahul Chalamalasetti
    Martin Margala
    Wim Vanderbauwhede
    Mitch Wright
    ISPASS (2012), pp. 200-209
    (Re)Designing Data-Centric Data Centers
    IEEE Micro, vol. 32 (2012), pp. 66-70
    Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management
    Justin Meza
    HanBin Yoon
    Onur Mutlu
    Computer Architecture Letters, vol. 11 (2012), pp. 61-64
    BOOM: Enabling mobile memory based low-power server DIMMs
    Doe Hyun Yoon
    Naveen Muralimanohar
    ISCA (2012), pp. 25-36
    Exploring latency-power tradeoffs in deep nonvolatile memory hierarchies
    Doe Hyun Yoon
    Tobin Gonzalez
    Robert S. Schreiber
    Conf. Computing Frontiers (2012), pp. 95-102
    A limits study of benefits from nanostore-based future data-centric system architectures
    Trevor N. Mudge
    David Roberts
    Mehul A. Shah
    Kevin T. Lim
    Conf. Computing Frontiers (2012), pp. 33-42
    Loosely coupled coordinated management in virtualized data centers
    Sanjay Kumar
    Vanish Talwar
    Vibhore Kumar
    Karsten Schwan
    Cluster Computing, vol. 14 (2011), pp. 259-274
    System-level integrated server architectures for scale-out datacenters
    Sheng Li
    Kevin T. Lim
    Paolo Faraboschi
    Norman P. Jouppi
    MICRO (2011), pp. 260-271
    From Microprocessors to Nanostores: Rethinking Data-Centric Systems
    IEEE Computer, vol. 44 (2011), pp. 39-48
    Everything as a Service: Powering the New Information Economy
    Prith Banerjee
    Rich Friedrich
    Cullen Bash
    P. Goldsack
    Bernardo A. Huberman
    J. Manley
    Chandrakant D. Patel
    A. Veitch
    IEEE Computer, vol. 44 (2011), pp. 36-43
    Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory
    Shivaram Venkataraman
    Niraj Tolia
    Roy H. Campbell
    FAST (2011), pp. 61-75
    On energy efficiency for enterprise and data center networks
    Priya Mahadevan
    Sujata Banerjee
    Puneet Sharma
    Amip Shah
    IEEE Communications Magazine, vol. 49 (2011), pp. 94-100
    Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems
    Vishakha Gupta
    Karsten Schwan
    Niraj Tolia
    Vanish Talwar
    USENIX Annual Technical Conference (2011)
    Saving the World, One Server at a Time, Together
    IEEE Computer, vol. 44 (2011), pp. 91-93
    FREE-p: Protecting non-volatile memory against both hard and soft errors
    Doe Hyun Yoon
    Naveen Muralimanohar
    Norman P. Jouppi
    Mattan Erez
    HPCA (2011), pp. 466-477
    Topology-aware resource allocation for data-intensive workloads
    Gunho Lee
    Niraj Tolia
    Randy H. Katz
    Computer Communication Review, vol. 41 (2011), pp. 120-124
    sNICh: efficient last hop networking in the data center
    Kaushik Kumar Ram
    Jayaram Mudigonda
    Alan L. Cox
    Scott Rixner
    Jose Renato Santos
    ANCS (2010), pp. 26
    Guest Editors' Introduction: Datacenter-Scale Computing
    Luiz André Barroso
    IEEE Micro, vol. 30 (2010), pp. 6-7
    Online detection of utility cloud anomalies using metric distributions
    Chengwei Wang
    Vanish Talwar
    Karsten Schwan
    NOMS (2010), pp. 96-103
    Recipe for efficiency: principles of power-aware computing
    Commun. ACM, vol. 53 (2010), pp. 60-67
    Topology-aware resource allocation for data-intensive workloads
    Gunho Lee
    Niraj Tolia
    Randy H. Katz
    ApSys (2010), pp. 1-6
    Disaggregated memory for expansion and sharing in blade servers
    Kevin T. Lim
    Trevor N. Mudge
    Steven K. Reinhardt
    Thomas F. Wenisch
    ISCA (2009), pp. 267-278
    Industrial perspectives panel
    PPOPP (2009), pp. 197
    Energy Efficiency: The New Holy Grail of Data Management Systems Research
    Stavros Harizopoulos
    Mehul A. Shah
    Justin Meza
    CoRR, vol. abs/0909.1784 (2009)
    A Power Benchmarking Framework for Network Devices
    Priya Mahadevan
    Puneet Sharma
    Sujata Banerjee
    Networking (2009), pp. 795-808
    Sustainable data centers: enabled by supply and demand side management
    Prith Banerjee
    Chandrakant D. Patel
    Cullen Bash
    DAC (2009), pp. 884-887
    vManage: loosely coupled platform and virtualization management in data centers
    Sanjay Kumar
    Vanish Talwar
    Vibhore Kumar
    Karsten Schwan
    ICAC (2009), pp. 127-136
    Industrial perspectives panel
    HPCA (2009), pp. 325-326
    Models and Metrics for Energy-Efficient Computing
    Suzanne Rivoire
    Justin D. Moore
    Advances in Computers, vol. 75 (2009), pp. 159-233
    Server Designs for Warehouse-Computing Environments
    Kevin T. Lim
    Chandrakant D. Patel
    Trevor N. Mudge
    Steven K. Reinhardt
    IEEE Micro, vol. 29 (2009), pp. 41-49
    Energy Efficiency: The New Holy Grail of Data Management Systems Research
    Stavros Harizopoulos
    Mehul A. Shah
    Justin Meza
    CIDR (2009)
    Tracking the power in an enterprise decision support system
    Justin Meza
    Mehul A. Shah
    Mike Fitzner
    Judson Veazey
    ISLPED (2009), pp. 261-266
    Fabric convergence implications on systems architecture
    Kevin Leigh
    Jaspal Subhlok
    HPCA (2008), pp. 15-26
    Implementing high availability memory with a duplication cache
    Nidhi Aggarwal
    James E. Smith
    Kewal K. Saluja
    Norman P. Jouppi
    MICRO (2008), pp. 71-82
    Power management from cores to datacenters: where are we going to get the next ten-fold improvements?
    ISLPED (2008), pp. 139-140
    Using Asymmetric Single-ISA CMPs to Save Energy on Operating Systems
    Jayaram Mudigonda
    Nathan L. Binkert
    Vanish Talwar
    IEEE Micro, vol. 28 (2008), pp. 26-41
    Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments
    Kevin T. Lim
    Chandrakant D. Patel
    Trevor N. Mudge
    Steven K. Reinhardt
    ISCA (2008), pp. 315-326
    No "power" struggles: coordinated multi-level power management for the data center
    Ramya Raghavendra
    Vanish Talwar
    Zhikui Wang
    Xiaoyun Zhu
    ASPLOS (2008), pp. 48-59
    Delivering Energy Proportionality with Non Energy-Proportional Systems - Optimizing the Ensemble
    Niraj Tolia
    Zhikui Wang
    Manish Marwah
    Cullen Bash
    Xiaoyun Zhu
    HotPower (2008)
    Active storage revisited: the case for power and performance benefits for unstructured data processing applications
    Clinton Wills Smullen IV
    Shahrukh Rohinton Tarapore
    Sudhanva Gurumurthi
    Mustafa Uysal
    Conf. Computing Frontiers (2008), pp. 293-304
    General-purpose blade infrastructure for configurable system architectures
    Kevin Leigh
    Jaspal Subhlok
    Distributed and Parallel Databases, vol. 21 (2007), pp. 115-144
    Isolation in Commodity Multicore Processors
    Nidhi Aggarwal
    Norman P. Jouppi
    James E. Smith
    IEEE Computer, vol. 40 (2007), pp. 49-59
    Configurable isolation: building high availability systems with commodity multi-core processors
    Nidhi Aggarwal
    Norman P. Jouppi
    James E. Smith
    ISCA (2007), pp. 470-481
    Cost-aware scheduling for heterogeneous enterprise machines (CASH'EM)
    Jennifer Burge
    Janet L. Wiener
    CLUSTER (2007), pp. 481-487
    Motivating co-ordination of power management solutions in data centers
    Ramya Raghavendra
    Vanish Talwar
    Xiaoyun Zhu
    Zhikui Wang
    CLUSTER (2007), pp. 473
    Ensemble-level Power Management for Dense Blade Servers
    Phil Leech
    David E. Irwin
    Jeffrey S. Chase
    ISCA (2006), pp. 66-77
    Energy-Aware User Interfaces and Energy-Adaptive Displays
    Erik Geelhoed
    Meera Manahan
    Ken Nicholas
    IEEE Computer, vol. 39 (2006), pp. 31-38
    Weatherman: Automated, Online and Predictive Thermal Mapping and Management for Data Centers
    Justin D. Moore
    Jeffrey S. Chase
    ICAC (2006), pp. 155-164
    IT Infrastructure in Emerging Markets: Arguing for an End-to-End Perspective
    Ajay Gupta 0005
    Prashant Sarin
    Mehul A. Shah
    IEEE Pervasive Computing, vol. 5 (2006), pp. 24-31
    Heterogeneous Chip Multiprocessors
    Rakesh Kumar 0002
    Dean M. Tullsen
    Norman P. Jouppi
    IEEE Computer, vol. 38 (2005), pp. 32-38
    Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers
    Justin D. Moore
    Jeffrey S. Chase
    Ratnesh K. Sharma
    USENIX Annual Technical Conference, General Track (2005), pp. 61-75
    Enterprise IT Trends and Implications for Architecture Research
    Norman P. Jouppi
    HPCA (2005), pp. 253-256
    Investigating the Relationship Between Battery Life and User Acceptance of Dynamic, Energy-Aware Interfaces on Handhelds
    Lance Bloom
    Rachel Eardley
    Erik Geelhoed
    Meera Manahan
    Mobile HCI (2004), pp. 13-24
    Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance
    Rakesh Kumar 0002
    Dean M. Tullsen
    Norman P. Jouppi
    Keith I. Farkas
    ISCA (2004), pp. 64-75
    Energy-aware user interfaces: an evaluation of user acceptance
    Tim Harter
    Sander Vroegindeweij
    Erik Geelhoed
    Meera Manahan
    CHI (2004), pp. 199-206
    Energy-Adaptive Display System Designs for Future Mobile Environments
    Subu Iyer
    Lu Luo
    Robert N. Mayo
    MobiSys (2003)
    Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction
    Rakesh Kumar 0002
    Keith I. Farkas
    Norman P. Jouppi
    Dean M. Tullsen
    MICRO (2003), pp. 81-92
    Energy Consumption in Mobile Devices: Why Future Systems Need Requirements-Aware Energy Scale-Down
    Robert N. Mayo
    PACS (2003), pp. 26-40
    Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architectures
    Rakesh Kumar 0002
    Keith I. Farkas
    Norman P. Jouppi
    Dean M. Tullsen
    Computer Architecture Letters, vol. 2 (2003)
    Topological navigation and qualitative localization for indoor environment using multi-sensory perception
    Jean-Bernard Hayet
    Michel Devy
    Seth Hutchinson
    Frédéric Lerasle
    Robotics and Autonomous Systems, vol. 41 (2002), pp. 137-144
    Energy-Driven Statistical Sampling: Detecting Software Hotspots
    Fay Chang
    Keith I. Farkas
    Workshop on Power Aware Computing Systems (PACS) (2002), pp. 110-129
    Reconfigurable caches and their application to media processing
    Sarita V. Adve
    Norman P. Jouppi
    ISCA (2000), pp. 214-224
    The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors
    Vijay S. Pai
    Hazim Abdel-Shafi
    Sarita V. Adve
    IEEE Trans. Computers, vol. 48 (1999), pp. 218-226
    Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions
    Sarita V. Adve
    Norman P. Jouppi
    ISCA (1999), pp. 124-135
    Performance of database workloads on shared-memory systems with out-of-order processors
    Kourosh Gharachorloo
    Sarita V. Adve
    ASPLOS-VIII: Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, ACM, New York, NY, USA (1998), pp. 307-318
    The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology
    Vijay S. Pai
    Sarita V. Adve
    HPCA (1997), pp. 72-83
    The Interaction of Software Prefetching with ILP Processors in Shared-Memory Systems
    Vijay S. Pai
    Hazim Abdel-Shafi
    Sarita V. Adve
    ISCA (1997), pp. 144-156
    Using Speculative Retirement and Larger Instruction Windows to Narrow the Performance Gap Between Memory Consistency Models
    Vijay S. Pai
    Sarita V. Adve
    SPAA (1997), pp. 199-210
    An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors
    Vijay S. Pai
    Sarita V. Adve
    Tracy Harton
    ASPLOS (1996), pp. 12-23