Take it to the Limit: Peak Prediction-driven Resource Overcommitment in Datacenters

Noman Bashir

Nan Deng

Krzysiek Michał Rządca

David Irwin

Sree Kodakara

Rohit Jnagal

Eurosys 2021 (2021)

Download Google Scholar

Abstract

To increase utilization, datacenter schedulers often overcommit resources where the sum of resources allocated to the tasks on a machine exceeds its physical capacity. Setting the right level of overcommitment is a challenging problem: low overcommitment leads to wasted resources, while high over-commitment leads to task performance degradation. In this paper, we take a first principles approach to designing and evaluating overcommit policies by asking a basic question:assuming complete knowledge of each task’s future resource usage, what is the safest overcommit policy that yields the highest utilization? We call this policy the peak oracle. We then devise practical overcommit policies that mimic this peak oracle by predicting future machine resource usage.We simulate our overcommit policies using the recently-released Google cluster trace, and show that they result in higher utilization and less overcommit errors than policies based on per-task allocations. We also deploy these policies to machines inside Google’s datacenters serving its internal production workload. We show that our overcommit policies increase these machines’ usable CPU capacity by 10-16% compared to no overcommitment.

Research Areas

Software systems

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Take it to the Limit: Peak Prediction-driven Resource Overcommitment in Datacenters

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs