Xiao Zhang
Xiao Zhang received his PhD in Computer Science from University of Rochester in 2010. Before that, he earned BS in Computer Science from University of Science and Technology of China. His research focuses on Operating Systems and Computer Architecture and now he primarily works on systems for accelerating domain specific computing. In addition to that, he also works on power management systems.
Authored Publications
Sort By
GASS: GPU Automated Sharing at Scale
Dragos Sbirlea
Jiafan Zhu
Konstantinos Menychtas
Yuang Liu
Zhijing Gene Qin
The IEEE International Conference on Cloud Computing (CLOUD) 2024 (2024)
Preview abstract
General-purpose GPUs, with their powerful numerical computing capacity, are popular platforms for accelerating machine-learning workloads. However, our experience with a large scale production deployment shows that typical GPU work-loads often fail to keep the GPU pipeline fully occupied, resulting in low overall resource utilization. To address this inefficiency, we have designed and implemented GPU Automated Sharing at Scale (GASS). GASS relies on fine-grained time-multiplexing to let GPU compute resources be shared among different tasks, and on-demand paging to let GPU memory be shared among them. GASS mitigates sharing performance anomalies by using real-time performance monitoring to drive adaptive rescheduling. Our cluster level evaluation shows the aggregated GPU throughput is increased by 50% under GASS and that sharing enables the cluster to support 19% more GPU jobs.
View details
Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware Power Capping at Scale
Shaohong Li
Sreekumar Kodakara
14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), {USENIX} Association (2020), pp. 1241-1255
Preview abstract
As the demand for data center capacity continues to grow, hyperscale providers have used power oversubscription to increase efficiency and reduce costs. Power oversubscription requires power capping systems to smooth out the spikes that risk overloading power equipment by throttling workloads. Modern compute clusters run latency-sensitive serving and throughput-oriented batch workloads on the same servers, provisioning resources to ensure low latency for the former while using the latter to achieve high server utilization. When power capping occurs, it is desirable to maintain low latency for serving tasks and throttle the throughput of batch tasks. To achieve this, we seek a system that can gracefully throttle batch workloads and has task-level quality-of-service (QoS) differentiation.
In this paper we present Thunderbolt, a hardware-agnostic power capping system that ensures safe power oversubscription while minimizing impact on both long-running throughput-oriented tasks and latency-sensitive tasks. It uses a two-threshold, randomized unthrottling/multiplicative decrease control policy to ensure power safety with minimized performance degradation. It leverages the Linux kernel's CPU bandwidth control feature to achieve task-level QoS-aware throttling. It is robust even in the face of power telemetry unavailability. Evaluation results at the node and cluster levels demonstrate the system's responsiveness, effectiveness for reducing power, capability of QoS differentiation, and minimal impact on latency and task health. We have deployed this system at scale, in multiple production clusters. As a result, we enabled power oversubscription gains of 9%--25%, where none was previously possible.
View details
Preview abstract
Non-uniform memory access (NUMA) has been extensively studied at the machine level but few studies have examined NUMA optimizations at the cluster level. This paper introduces a holistic NUMA-aware scheduling policy that combines both machine-level and cluster-level NUMA-aware
optimizations. We evaluate our holistic NUMA-aware scheduling policy on Google’s production cluster trace with a cluster scheduling simulator that measures the impact of NUMAaware scheduling under two scheduling algorithms, Best Fit and Enhanced PVM (E-PVM). While our results highlight that a
holistic NUMA-aware scheduling policy substantially increases the proportion of NUMA-fit tasks by 22.0% and 25.6% for both the Best Fit and E-PVM scheduling algorithms, respectively, there is a non-trivial tradeoff between cluster job packing efficiency and NUMA-fitness for the E-PVM algorithm under
certain circumstances.
View details
HaPPy: Hyperthread-aware Power Profiling Dynamically
Preview
Yan Zhai
Stephane Eranian
Lingjia Tang
Jason Mars
USENIX Annual Technical Conference 2014
{CPI^2}: {CPU} performance isolation for shared compute clusters
Robert Hagmann
Rohit Jnagal
Vrigo Gokhale
SIGOPS European Conference on Computer Systems (EuroSys), ACM, Prague, Czech Republic (2013), pp. 379-391
Preview abstract
Performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other program's behavior.
Our solution, CPI2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job.
We have rolled out CPI2 to all of Google's shared compute clusters. The paper presents the analysis that lead us to that outcome, including both case studies and a large-scale evaluation of its ability to solve real production issues.
View details
Optimizing Google's Warehouse Scale Computers: The NUMA Experience
Preview
Lingjia Tang
Jason Mars
Robert Hagmann
The 19th IEEE International Symposium on High Performance Computer Architecture (2013)