Jump to Content
Xiao Zhang

Xiao Zhang

Xiao Zhang received his PhD in Computer Science from University of Rochester in 2010. Before that, he earned BS in Computer Science from University of Science and Technology of China. His research focuses on Operating Systems and Computer Architecture and now he primarily works on systems for accelerating domain specific computing. In addition to that, he also works on power management systems.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware Power Capping at Scale
    Shaohong Li
    Sreekumar Kodakara
    14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), {USENIX} Association (2020), pp. 1241-1255
    Preview abstract As the demand for data center capacity continues to grow, hyperscale providers have used power oversubscription to increase efficiency and reduce costs. Power oversubscription requires power capping systems to smooth out the spikes that risk overloading power equipment by throttling workloads. Modern compute clusters run latency-sensitive serving and throughput-oriented batch workloads on the same servers, provisioning resources to ensure low latency for the former while using the latter to achieve high server utilization. When power capping occurs, it is desirable to maintain low latency for serving tasks and throttle the throughput of batch tasks. To achieve this, we seek a system that can gracefully throttle batch workloads and has task-level quality-of-service (QoS) differentiation. In this paper we present Thunderbolt, a hardware-agnostic power capping system that ensures safe power oversubscription while minimizing impact on both long-running throughput-oriented tasks and latency-sensitive tasks. It uses a two-threshold, randomized unthrottling/multiplicative decrease control policy to ensure power safety with minimized performance degradation. It leverages the Linux kernel's CPU bandwidth control feature to achieve task-level QoS-aware throttling. It is robust even in the face of power telemetry unavailability. Evaluation results at the node and cluster levels demonstrate the system's responsiveness, effectiveness for reducing power, capability of QoS differentiation, and minimal impact on latency and task health. We have deployed this system at scale, in multiple production clusters. As a result, we enabled power oversubscription gains of 9%--25%, where none was previously possible. View details
    Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
    Richard Wu
    Xiangling Kong
    Yangyi Chen
    Robert Hagmann
    Rohit Jnagal
    IEEE CLOUD 2019
    Preview abstract Non-uniform memory access (NUMA) has been extensively studied at the machine level but few studies have examined NUMA optimizations at the cluster level. This paper introduces a holistic NUMA-aware scheduling policy that combines both machine-level and cluster-level NUMA-aware optimizations. We evaluate our holistic NUMA-aware scheduling policy on Google’s production cluster trace with a cluster scheduling simulator that measures the impact of NUMAaware scheduling under two scheduling algorithms, Best Fit and Enhanced PVM (E-PVM). While our results highlight that a holistic NUMA-aware scheduling policy substantially increases the proportion of NUMA-fit tasks by 22.0% and 25.6% for both the Best Fit and E-PVM scheduling algorithms, respectively, there is a non-trivial tradeoff between cluster job packing efficiency and NUMA-fitness for the E-PVM algorithm under certain circumstances. View details
    HaPPy: Hyperthread-aware Power Profiling Dynamically
    Yan Zhai
    Stephane Eranian
    Lingjia Tang
    Jason Mars
    USENIX Annual Technical Conference 2014
    Preview
    Optimizing Google's Warehouse Scale Computers: The NUMA Experience
    Lingjia Tang
    Jason Mars
    Robert Hagmann
    The 19th IEEE International Symposium on High Performance Computer Architecture (2013)
    Preview
    CPI^2: CPU performance isolation for shared compute clusters
    Robert Hagmann
    Rohit Jnagal
    Vrigo Gokhale
    SIGOPS European Conference on Computer Systems (EuroSys), ACM, Prague, Czech Republic (2013), pp. 379-391
    Preview abstract Performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other program's behavior. Our solution, CPI2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job. We have rolled out CPI2 to all of Google's shared compute clusters. The paper presents the analysis that lead us to that outcome, including both case studies and a large-scale evaluation of its ability to solve real production issues. View details
    Power Containers: An OS Facility for Fine-Grained Power and Energy Management on Multicore Servers
    Kai Shen
    Arrvindh Shriraman
    Sandhya Dwarkadas
    Zhuan Chen
    Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (2013)
    A Flexible Framework for Throttling-Enabled Multicore Management
    Rongrong Zhong
    Sandhya Dwarkadas
    Kai Shen
    The 41st International Conference on Parallel Processing (2012)