Krzysztof Rzadca
I am currently a visiting researcher at Google where I work on scheduling and resource management in Google Cloud. I am on leave from my faculty job - I'm an associate professor in the Institute of Informatics of the Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland, where I did my habilitation (HDR) in 2015 (my academic website)
Before joining UW, I was working as a research fellow (post-doc) in Anwitaman Datta's SANDS working group in the School of Computer Engineering (SCE) of the Nanyang Technological University (NTU), Singapore. I did my PhD on resource management in grids jointly in Laboratoire d'Informatique de Grenoble of Institut national polytechnique de Grenoble, France and Polish-Japanese Institute of Information Technology, Warsaw, Poland, as a French government fellow (co-tutelle grant). I graduated with a MSc from the Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland.
I'm interested in resource management and scheduling in large-scale distributed systems, such as clouds, datacenters or supercomputers.
Research Areas
Authored Publications
Sort By
Take it to the Limit: Peak Prediction-driven Resource Overcommitment in Datacenters
Noman Bashir
David Irwin
Sree Kodakara
Rohit Jnagal
Eurosys 2021 (2021)
Preview abstract
To increase utilization, datacenter schedulers often overcommit resources where the sum of resources allocated to the tasks on a machine exceeds its physical capacity. Setting the right level of overcommitment is a challenging problem: low overcommitment leads to wasted resources, while high over-commitment leads to task performance degradation. In this paper, we take a first principles approach to designing and evaluating overcommit policies by asking a basic question:assuming complete knowledge of each task’s future resource usage, what is the safest overcommit policy that yields the highest utilization? We call this policy the peak oracle. We then devise practical overcommit policies that mimic this peak oracle by predicting future machine resource usage.We simulate our overcommit policies using the recently-released Google cluster trace, and show that they result in higher utilization and less overcommit errors than policies based on per-task allocations. We also deploy these policies to machines inside Google’s datacenters serving its internal production workload. We show that our overcommit policies increase these machines’ usable CPU capacity by 10-16% compared to no overcommitment.
View details
Autopilot: Workload Autoscaling at Google Scale
Paweł Findeisen
Jacek Świderski
Przemyslaw Broniek
Beata Strack
Piotr Witusowski
Proceedings of the Fifteenth European Conference on Computer Systems, Association for Computing Machinery (2020) (to appear)
Preview abstract
In many public and private Cloud systems, users need to specify a limit for the amount of resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits might be throttled or killed, resulting in delaying or dropping end-user requests, so human operators naturally err on the side of caution and request a larger limit than the job needs. At scale, this results in massive aggregate resource wastage.
To address this, Google uses Autopilot to configure resources automatically, adjusting both the number of concurrent tasks in a job (horizontal scaling) and the CPU/memory limits for individual tasks (vertical scaling). Autopilot walks the same fine line as human operators: its primary goal is to reduce slack – the difference between the limit and the actual resource usage – while minimizing the risk that a task is killed with an out-of-memory (OOM) error or its performance degraded because of CPU throttling. Autopilot uses machine learning algorithms applied to historical data about prior executions of a job, plus a set of finely-tuned heuristics, to walk this line. In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10.
Despite its advantages, ensuring that Autopilot was widely adopted took significant effort, including making potential recommendations easily visible to customers who had yet to opt in, automatically migrating certain categories of jobs, and adding support for custom recommenders. At the time of writing, Autopiloted jobs account for over 48% of Google's fleet-wide resource usage.
View details