Vasileios Kontorinis
Vasileios is leading data center oversubscription efforts in Data Center Software Power Team. His interests include load shedding actuators, power&cooling profiling, power-aware scheduling and novel power delivery architectures. He joined Google after receiving MS and Ph.D. from University of California San Diego in computer architecture.
Authored Publications
Sort By
Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware Power Capping at Scale
Shaohong Li
Sreekumar Kodakara
14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), {USENIX} Association (2020), pp. 1241-1255
Preview abstract
As the demand for data center capacity continues to grow, hyperscale providers have used power oversubscription to increase efficiency and reduce costs. Power oversubscription requires power capping systems to smooth out the spikes that risk overloading power equipment by throttling workloads. Modern compute clusters run latency-sensitive serving and throughput-oriented batch workloads on the same servers, provisioning resources to ensure low latency for the former while using the latter to achieve high server utilization. When power capping occurs, it is desirable to maintain low latency for serving tasks and throttle the throughput of batch tasks. To achieve this, we seek a system that can gracefully throttle batch workloads and has task-level quality-of-service (QoS) differentiation.
In this paper we present Thunderbolt, a hardware-agnostic power capping system that ensures safe power oversubscription while minimizing impact on both long-running throughput-oriented tasks and latency-sensitive tasks. It uses a two-threshold, randomized unthrottling/multiplicative decrease control policy to ensure power safety with minimized performance degradation. It leverages the Linux kernel's CPU bandwidth control feature to achieve task-level QoS-aware throttling. It is robust even in the face of power telemetry unavailability. Evaluation results at the node and cluster levels demonstrate the system's responsiveness, effectiveness for reducing power, capability of QoS differentiation, and minimal impact on latency and task health. We have deployed this system at scale, in multiple production clusters. As a result, we enabled power oversubscription gains of 9%--25%, where none was previously possible.
View details
Data Center Power Oversubscription with a Medium Voltage Power Plane and Priority-Aware Capping
David Landhuis
Shaohong Li
Darren De Ronde
Thomas Blooming
Anand Ramesh
James Kennedy
Christopher Malone
Jimmy Clidaras
Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery, New York, NY, USA (2020), 497–511
Preview abstract
As major web and cloud service providers continue to accelerate the demand for new data center capacity worldwide, the importance of power oversubscription as a lever to reduce provisioning costs has never been greater. Building on insights from Google-scale deployments, we design and deploy a new architecture across hardware and software to improve power oversubscription significantly. Our design includes (1) a new medium voltage power plane to enable larger power sharing domains (across tens of MW of equipment) and (2) a scalable, fast, and robust power capping service coordinating multiple priorities of workload on every node. Over several years of production deployment, our co-design has enabled power oversubscription of 25% or higher, saving hundreds of millions of dollars of data center costs, while preserving the desired availability and performance of all workloads.
View details