Liqun Cheng

Liqun Cheng

Liqun Cheng is a distinguished engineer at Google, where he is a technical lead for performance, TCO, and efficiency of data centers. His interests range from architecture, distributed systems, energy proportional computing and machine learning. He is particularly interested in interactions across domains with a major focus on software-hardware co-design. He obtained his PhD from the University of Utah and BS degree from Shanghai Jiao Tong University. Prior to Google, Liqun was a performance architect at Intel.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Searching for Fast Models on Datacenter Accelerators
    Ruoming Pang
    Andrew Li
    Norm Jouppi
    Conference on Computer Vision and Pattern Recognition (2021)
    Preview abstract Neural Architecture Search (NAS), together with model scaling, has shown remarkable progress in designing high accuracy and fast convolutional architecture families. However, as neither NAS nor model scaling considers sufficient hardware architecture details, they do not take full advantage of the emerging datacenter (DC) accelerators. In this paper, we search for fast and accurate CNN model families for efficient inference on DC accelerators. We first analyze DC accelerators and find that existing CNNs suffer from insufficient operational intensity, parallelism, and execution efficiency and exhibit FLOPs-latency nonproportionality. These insights let us create a DC-accelerator-optimized search space, with space-to-depth, space-to-batch, hybrid fused convolution structures with vanilla and depthwise convolutions, and block-wise activation functions. We further propose a latency-aware compound scaling (LACS), the first multi-objective compound scaling method optimizing both accuracy and latency. Our LACS discovers that network depth should grow much faster than image size and network width, which is quite different from the observations from previous compound scaling. With the new search space and LACS, our search and scaling on datacenter accelerators results in a new model series named EfficientNet-X. EfficientNet-X is up to more than 2X faster than EfficientNet (a model series with state-of-the-art trade-off on FLOPs and accuracy) on TPUv3 and GPUv100, with comparable accuracy. EfficientNet-X is also up to 7X faster than recent RegNet and ResNeSt on TPUv3 and GPUv100. Source code is at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/tpu View details
    Autonomous Warehouse-Scale Computers
    Proceedings of the 57th Annual Design Automation Conference 2020, Association for Computing Machinery, New York, NY United States
    Preview abstract Modern Warehouse-Scale Computers (WSCs), composed of many generations of servers and a myriad of domain specific accelerators, are becoming increasingly heterogeneous. Meanwhile, WSC workloads are also becoming incredibly diverse with different communication patterns, latency requirements, and service level objectives (SLOs). Insufficient understanding of the interactions between workload characteristics and the underlying machine architecture leads to resource over-provisioning, thereby significantly impacting the utilization of WSCs. We present Autonomous Warehouse-Scale Computers, a new WSC design that leverages machine learning techniques and automation to improve job scheduling, resource management, and hardware-software co-optimization to address the increasing heterogeneity in WSC hardware and workloads. Our new design introduces two new layers in the WSC stack, namely: (a) a Software-Defined Server (SDS) Abstraction Layer which redefines the hardware-software boundary and provides greater control of the hardware to higher layers of the software stack through stable abstractions; and (b) a WSC Efficiency Layer which regularly monitors the resource usage of workloads on different hardware types, autonomously quantifies the performance sensitivity of workloads to key system configurations, and continuously improves scheduling decisions and hardware resource QoS policies to maximize cluster level performance. Our new WSC design has been successfully deployed across all WSCs at Google for several years now. The new WSC design improves throughput of workloads (by 7-10%, on average), increases utilization of hardware resources (up to 2x), and reduces performance variance for critical workloads (up to 25%). View details
    Kelp: QoS for Accelerators in Machine Learning Platforms
    Haishan Zhu
    Rama Govindaraju
    Mattan Erez
    International Symposium on High Performance Computer Architecture (2019)
    Preview abstract Development and deployment of machine learning (ML) accelerators in Warehouse Scale Computers (WSCs) demand significant capital investments and engineering efforts. However, even though heavy computation can be offloaded to the accelerators, applications often depend on the host system for various supporting tasks. As a result, contention on host resources, such as memory bandwidth, can significantly discount the performance and efficiency gains of accelerators. The impact of performance interference is further amplified in distributed learning for large models. In this work, we study the performance of four production machine learning workloads on three accelerator platforms. Our experiments show that these workloads are highly sensitive to host memory bandwidth contention, which can cause 40% average performance degradation when left unmanaged. To tackle this problem, we design and implement Kelp, a software runtime that isolates high priority accelerated ML tasks from memory resource interference. We evaluate Kelp with both production and artificial aggressor workloads, and compare its effectiveness with previously proposed solutions. Our evaluation shows that Kelp is effective in mitigating performance degradation of the accelerated tasks, and improves performance by 24% on average. Compared to previous work, Kelp reduces performance degradation of ML tasks by 7% and improves system efficiency by 17%. Our results further expose opportunities in future architecture designs. View details
    WSMeter: A Fast, Accurate, and Low-Cost Performance Evaluation for Warehouse-Scale Computers
    Jaewon Lee
    Changkyu Kim
    Rama Govindaraju
    Jangwoo Kim
    Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (2018) (to appear)
    Preview abstract A warehouse-scale computer (WSC) is a vast collection of tightly networked computers providing modern internet services, that is becoming increasingly popular as the most cost-effective approach to serve users at global scale. It is however extremely difficult to accurately measure the holistic performance of WSC. The existing load-testing benchmarks are tailored towards a dedicated machine model and do not address shared infrastructure environments. Evaluating the performance of a live shared production WSC environment presents many challenges due to the lack of holistic performance metrics, high evaluation costs, and potential service disruptions they may cause. WSC providers and customers are in need of a cost effective methodology to accurately evaluate the holistic performance of their platforms and hosted services. To address these challenges, we propose WSMeter, a cost effective framework and methodology to accurately evaluate the holistic performance of WSC in a live production environment. We define a new performance metric to accurately reflect the holistic performance of a WSC running a wide variety of unevenly distributed jobs. We propose a model to statistically embrace the performance variances amplified by co-located jobs, to evaluate holistic performance with minimum costs. For validation of our approach, we analyze two real-world use cases and show that WSMeter accurately discerns 7% and 1% performance improvements, using only 0.9% and 6.6% of the machines in the WSC, respectively. We show through a Cloud customer case study, where WSMeter helped quantify the performance benefits of service software optimization with minimal costs. View details
    Improving Resource Efficiency at Scale with Heracles
    Rama Govindaraju
    Christos Kozyrakis
    ACM Transactions on Computer Systems (TOCS), 34 (2016), 6:1-6:33
    Preview abstract User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy efficiency of large-scale datacenters. With the slowdown in technology scaling caused by the sunsetting of Moore’s law, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated. View details
    Heracles: Improving Resource Efficiency at Scale
    Rama Govindaraju
    Christos Kozyrakis
    Proceedings of the 42th Annual International Symposium on Computer Architecture (2015)
    Preview abstract User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated. View details
    Towards Energy Proportionality for Large-Scale Latency-Critical Workloads
    Rama Govindaraju
    Luiz André Barroso
    Christos Kozyrakis
    Proceedings of the 41th Annual International Symposium on Computer Architecture, ACM (2014)
    Preview abstract Reducing the energy footprint of warehouse-scale computer (WSC) systems is key to their affordability, yet difficult to achieve in practice. The lack of energy proportionality of typical WSC hardware and the fact that important workloads (such as search) require all servers to remain up regardless of traffic intensity renders existing power management techniques ineffective at reducing WSC energy use. We present PEGASUS, a feedback-based controller that significantly improves the energy proportionality of WSC systems, as demonstrated by a real implementation in a Google search cluster. PEGASUS uses request latency statistics to dynamically adjust server power management limits in a fine-grain manner, running each server just fast enough to meet global service-level latency objectives. In large cluster experiments, PEGASUS reduces power consumption by up to 20%. We also estimate that a distributed version of PEGASUS can nearly double these savings. View details