Jichuan Chang
Authored Publications
Sort By
Profiling Hyperscale Big Data Processing
Aasheesh Kolli
Abraham Gonzalez
Samira Khan
Sihang Liu
Krste Asanovic
ISCA (2023)
Preview abstract
Computing demand continues to grow exponentially, largely driven by "big data" processing on hyperscale data stores. At the same time, the slowdown in Moore's law is leading the industry to embrace custom computing in large-scale systems. Taken together, these trends motivate the need to characterize live production traffic on these large data processing platforms and understand the opportunity of acceleration at scale.
This paper addresses this key need. We characterize three important production distributed database and data analytics platforms at Google to identify key hardware acceleration opportunities and perform a comprehensive limits study to understand the trade-offs among various hardware acceleration strategies.
We observe that hyperscale data processing platforms spend significant time on distributed storage and other remote work across distributed workers. Therefore, optimizing storage and remote work in addition to compute acceleration is critical for these platforms. We present a detailed breakdown of the compute-intensive functions in these platforms and identify dominant key data operations related to datacenter and systems taxes. We observe that no single accelerator can provide a significant benefit but collectively, a sea of accelerators, can accelerate many of these smaller platform-specific functions. We demonstrate the potential gains of the sea of accelerators proposal in a limits study and analytical model. We perform a comprehensive study to understand the trade-offs between accelerator location (on-chip/off-chip) and invocation model (synchronous/asynchronous). We propose and evaluate a chained accelerator execution model where identified compute-intensive functions are accelerated and pipelined to avoid invocation from the core, achieving a 3x improvement over the baseline system while nearly matching identical performance to an ideal fully asynchronous execution model.
View details
Software-defined far memory in warehouse-scale computers
Andres Lagar-Cavilla
Suleiman Souhlal
Neha Agarwal
Radoslaw Burny
Shakeel Butt
Junaid Shahid
Greg Thelen
Kamil Adam Yurtsever
Yu Zhao
International Conference on Architectural Support for Programming Languages and Operating Systems (2019)
Preview abstract
Increasing memory demand and slowdown in technology scaling pose important challenges to total cost of ownership (TCO) of warehouse-scale computers (WSCs). One promising idea to reduce the memory TCO is to add a cheaper, but slower, "far memory" tier and use it to store infrequently accessed (or cold) data. However, introducing a far memory tier brings new challenges around dynamically responding to workload diversity and churn, minimizing stranding of capacity, and addressing brownfield (legacy) deployments.
We present a novel software-defined approach to far memory that proactively compresses cold memory pages to effectively create a far memory tier in software. Our end-to-end system design encompasses new methods to define performance service-level objectives (SLOs), a mechanism to identify cold memory pages while meeting the SLO, and our implementation in the OS kernel and node agent. Additionally, we design learning-based autotuning to periodically adapt our design to fleet-wide changes without a human in the loop. Our system has been successfully deployed across Google's WSC since 2016, serving thousands of production services. Our software-defined far memory is significantly cheaper (67% or higher memory cost reduction) at relatively good access speeds (6 us) and allows us to store a significant fraction of infrequently accessed data (on average, 20%), translating to significant TCO savings at warehouse scale.
View details
Learning Memory Access Patterns
Jamie Alexander Smith
Heiner Litz
Christos Kozyrakis
ICML (2018)
Preview abstract
The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations, augmenting or replacing traditional heuristics and data structures. However, the space of machine learning for computer hardware architecture is only lightly explored. In this paper, we demonstrate the potential of deep learning to address the von Neumann bottleneck of memory performance. We focus on the critical problem of learning memory access patterns, with the goal of constructing accurate and efficient memory prefetchers. We relate contemporary prefetching strategies to n-gram models in natural language processing, and show how recurrent neural networks can serve as a drop-in replacement. On a suite of challenging benchmark datasets, we find that neural networks consistently demonstrate superior performance in terms of precision and recall. This work represents the first step towards practical neural-network based prefetching, and opens a wide range of exciting directions for machine learning in computer architecture research.
View details
Near-Data Processing: Insights from a MICRO-46 Workshop
Rajeev Balasubramonian
Troy Manning
Jaime H. Moreno
Richard Murphy
Ravi Nair
Steven Swanson
IEEE Micro (Special Issue on Big Data), 34 (2014), pp. 36-43
Preview abstract
The cost of data movement in big-data systems motivates careful examination of near-data processing (NDP) frameworks. The concept of NDP was actively researched in the 1990s, but gained little commercial traction. After a decade-long dormancy, interest in this topic has spiked. A workshop on NDP was organized at MICRO-46 and was well attended. Given the interest, the organizers and keynote speakers have attempted to capture the key insights from the workshop into an article that can be widely disseminated. This article describes the many reasons why NDP is compelling today and identifies key upcoming challenges in realizing the potential of NDP.
View details