Autopilot: Workload Autoscaling at Google Scale

Krzysztof Rzadca; Paweł Findeisen; Jacek Świderski; Przemyslaw Zych; Przemyslaw Broniek; Jarek Kusmierek; Paweł Krzysztof Nowak; Beata Strack; Piotr Witusowski; Steven Hand; John Wilkes

Autopilot: Workload Autoscaling at Google Scale

Krzysztof Rzadca

Paweł Findeisen

Jacek Świderski

Przemyslaw Zych

Przemyslaw Broniek

Jarek Kusmierek

Paweł Krzysztof Nowak

Beata Strack

Piotr Witusowski

Steven Hand

John Wilkes

Proceedings of the Fifteenth European Conference on Computer Systems, Association for Computing Machinery (2020) (to appear)

Download Google Scholar

Abstract

In many public and private Cloud systems, users need to specify a limit for the amount of resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits might be throttled or killed, resulting in delaying or dropping end-user requests, so human operators naturally err on the side of caution and request a larger limit than the job needs. At scale, this results in massive aggregate resource wastage.

To address this, Google uses Autopilot to configure resources automatically, adjusting both the number of concurrent tasks in a job (horizontal scaling) and the CPU/memory limits for individual tasks (vertical scaling). Autopilot walks the same fine line as human operators: its primary goal is to reduce slack – the difference between the limit and the actual resource usage – while minimizing the risk that a task is killed with an out-of-memory (OOM) error or its performance degraded because of CPU throttling. Autopilot uses machine learning algorithms applied to historical data about prior executions of a job, plus a set of finely-tuned heuristics, to walk this line. In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10.

Despite its advantages, ensuring that Autopilot was widely adopted took significant effort, including making potential recommendations easily visible to customers who had yet to opt in, automatically migrating certain categories of jobs, and adding support for custom recommenders. At the time of writing, Autopiloted jobs account for over 48% of Google's fleet-wide resource usage.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Autopilot: Workload Autoscaling at Google Scale

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs