John Wilkes

John Wilkes

See my personal page for more information, and a full paper listing.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Physical Deployability Matters
    Proc. HotNets 2023: Twenty-Second ACM Workshop on Hot Topics in Networks
    Preview abstract While many network research papers address issues of deployability, with a few exceptions, this has been limited to protocol compatibility or switch-resource constraints, such as flow table sizes. We argue that good network designs must also consider the costs and complexities of deploying the design within the constraints of the physical environment in a datacenter: \emph{physical} deployability. The traditional metrics of network ``goodness'' mostly do not account for these costs and constraints, and this may partially explain why some otherwise attractive designs have not been deployed in real-world datacenters. View details
    Preview abstract We (Google's networking teams) would like to increase our collaborations with academic researchers related to data-driven networking research. There are some significant constraints on our ability to directly share data, and in case not everyone in the community understands these, this document provides a brief summary. There are some models which can work (primarily, interns and visiting scientists). We describe some specific areas where we would welcome proposals to work within those models View details
    Autopilot: Workload Autoscaling at Google Scale
    Paweł Findeisen
    Jacek Świderski
    Przemyslaw Broniek
    Beata Strack
    Piotr Witusowski
    Proceedings of the Fifteenth European Conference on Computer Systems, Association for Computing Machinery (2020) (to appear)
    Preview abstract In many public and private Cloud systems, users need to specify a limit for the amount of resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits might be throttled or killed, resulting in delaying or dropping end-user requests, so human operators naturally err on the side of caution and request a larger limit than the job needs. At scale, this results in massive aggregate resource wastage. To address this, Google uses Autopilot to configure resources automatically, adjusting both the number of concurrent tasks in a job (horizontal scaling) and the CPU/memory limits for individual tasks (vertical scaling). Autopilot walks the same fine line as human operators: its primary goal is to reduce slack – the difference between the limit and the actual resource usage – while minimizing the risk that a task is killed with an out-of-memory (OOM) error or its performance degraded because of CPU throttling. Autopilot uses machine learning algorithms applied to historical data about prior executions of a job, plus a set of finely-tuned heuristics, to walk this line. In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10. Despite its advantages, ensuring that Autopilot was widely adopted took significant effort, including making potential recommendations easily visible to customers who had yet to opt in, automatically migrating certain categories of jobs, and adding support for custom recommenders. At the time of writing, Autopiloted jobs account for over 48% of Google's fleet-wide resource usage. View details
    Borg: the Next Generation
    Muhammad Tirmazi
    Adam Barker
    Md Ehtesam Haque
    Zhijing Gene Qin
    Mor Harchol-Balter
    EuroSys'20, ACM, Heraklion, Crete (2020)
    Preview abstract This paper analyzes a newly-published trace that covers 8 different Borg clusters for the month of May 2019. The trace enables researchers to explore how scheduling works in large-scale production compute clusters. We highlight how Borg has evolved and perform a longitudinal comparison of the newly-published 2019 trace against the 2011 trace, which has been highly cited within the research community. Our findings show that Borg features such as alloc sets are used for resource-heavy workloads; automatic vertical scaling is effective; job-dependencies account for much of the high failure rates reported by prior studies; the workload arrival rate has increased, as has the use of resource over-commitment; the workload mix has changed, jobs have migrated from the free tier into the best-effort batch tier; the workload exhibits an extremely heavy-tailed distribution where the top 1% of jobs consume over 99% of resources; and there is a great deal of variation between different clusters. View details
    Nines are Not Enough: Meaningful Metrics for Clouds
    Proc. 17th Workshop on Hot Topics in Operating Systems (HoTOS) (2019)
    Preview abstract Cloud customers want reliable, understandable promises from cloud providers that their applications will run reliably and with adequate performance, but today, providers offer only limited guarantees, which creates uncertainty for customers. Providers also must define internal metrics to allow them to operate their systems without violating customer promises or expectations. We explore why these guarantees are hard to define. We show that this problem shares some similarities with the challenges of applying statistics to make decisions based on sampled data. We also suggest that defining guarantees in terms of defense against threats, rather than guarantees for application-visible outcomes, can reduce the complexity of these problems. Overall, we offer a partial framework for thinking about Service Level Objectives (SLOs), and discuss some unsolved challenges. View details
    Borg, Omega, and Kubernetes
    Brendan Burns
    Brian Grant
    David Oppenheimer
    ACM Queue, 14 (2016), pp. 70-93
    Preview abstract Lessons learned from three container management systems over a decade. View details
    Service Level Objectives
    Niall Murphy
    Cody Smith
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    DieHard: reliable scheduling to survive correlated failures in cloud data centers
    Mina Sedaghat
    Eddie Wadbro
    Sara De Luna
    Oleg Seleznjev
    Erik Elmroth
    International Symposium on Cluster, Cloud and Grid Computing (CCGrid), IEEE/ACM, Cartagena, Colombia (2016), pp. 52-59
    Preview abstract In large scale data centers, a single fault can lead to correlated failures of several physical machines and the tasks running on them, simultaneously. Such correlated failures can severely damage the reliability of a service or a job. This paper models the impact of stochastic and correlated failures on job reliability in a data center. We focus on correlated failures caused by power outages or failures of network components, on jobs running multiple replicas of identical tasks. We present a statistical reliability model and an approximation technique for computing a job’s reliability in the presence of correlated failures. In addition, we address the problem of scheduling a job with reliability constraints. We formulate the scheduling problem as an optimization problem, with the aim being to achieve the desired reliability with the minimum number of extra tasks. We present a scheduling algorithm that approximates the minimum number of required tasks and a placement to achieve a desired job reliability. We study the efficiency of our algorithm using an analytical approach and by simulating a cluster with different failure sources and reliabilities. The results show that the algorithm can effectively approximate the minimum number of extra tasks required to achieve the job’s reliability. View details
    Large-scale cluster management at {Google} with {Borg}
    Luis Pedrosa
    Madhukar R. Korupolu
    David Oppenheimer
    Proceedings of the European Conference on Computer Systems (EuroSys), ACM, Bordeaux, France (2015)
    Preview abstract Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it. View details
    Preview abstract One of the key factors in selecting a good scheduling algorithm is using an appropriate metric for comparing schedulers. But which metric should be used when evaluating schedulers for warehouse-scale (cloud) clusters, which have machines of different types and sizes, heterogeneous workloads with dependencies and constraints on task placement, and long-running services that consume a large fraction of the total resources? Traditional scheduler evaluations that focus on metrics such as queuing delay, makespan, and running time fail to capture important behaviors – and ones that rely on workload synthesis and scaling often ignore important factors such as constraints. This paper explains some of the complexities and issues in evaluating warehouse scale schedulers, focusing on what we find to be the single most important aspect in practice: how well they pack long-running services into a cluster. We describe and compare four metrics for evaluating the packing efficiency of schedulers in increasing order of sophistication: aggregate utilization, hole filling, workload inflation and cluster compaction. View details