Jump to Content
John Wilkes

John Wilkes

See my personal page for more information, and a full paper listing.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Physical Deployability Matters
    Proc. HotNets 2023: Twenty-Second ACM Workshop on Hot Topics in Networks
    Preview abstract While many network research papers address issues of deployability, with a few exceptions, this has been limited to protocol compatibility or switch-resource constraints, such as flow table sizes. We argue that good network designs must also consider the costs and complexities of deploying the design within the constraints of the physical environment in a datacenter: \emph{physical} deployability. The traditional metrics of network ``goodness'' mostly do not account for these costs and constraints, and this may partially explain why some otherwise attractive designs have not been deployed in real-world datacenters. View details
    Preview abstract We (Google's networking teams) would like to increase our collaborations with academic researchers related to data-driven networking research. There are some significant constraints on our ability to directly share data, and in case not everyone in the community understands these, this document provides a brief summary. There are some models which can work (primarily, interns and visiting scientists). We describe some specific areas where we would welcome proposals to work within those models View details
    Borg: the Next Generation
    Muhammad Tirmazi
    Adam Barker
    Md Ehtesam Haque
    Zhijing Gene Qin
    Mor Harchol-Balter
    EuroSys'20, ACM, Heraklion, Crete (2020)
    Preview abstract This paper analyzes a newly-published trace that covers 8 different Borg clusters for the month of May 2019. The trace enables researchers to explore how scheduling works in large-scale production compute clusters. We highlight how Borg has evolved and perform a longitudinal comparison of the newly-published 2019 trace against the 2011 trace, which has been highly cited within the research community. Our findings show that Borg features such as alloc sets are used for resource-heavy workloads; automatic vertical scaling is effective; job-dependencies account for much of the high failure rates reported by prior studies; the workload arrival rate has increased, as has the use of resource over-commitment; the workload mix has changed, jobs have migrated from the free tier into the best-effort batch tier; the workload exhibits an extremely heavy-tailed distribution where the top 1% of jobs consume over 99% of resources; and there is a great deal of variation between different clusters. View details
    Autopilot: Workload Autoscaling at Google Scale
    Paweł Findeisen
    Jacek Świderski
    Przemyslaw Broniek
    Beata Strack
    Piotr Witusowski
    Proceedings of the Fifteenth European Conference on Computer Systems, Association for Computing Machinery (2020) (to appear)
    Preview abstract In many public and private Cloud systems, users need to specify a limit for the amount of resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits might be throttled or killed, resulting in delaying or dropping end-user requests, so human operators naturally err on the side of caution and request a larger limit than the job needs. At scale, this results in massive aggregate resource wastage. To address this, Google uses Autopilot to configure resources automatically, adjusting both the number of concurrent tasks in a job (horizontal scaling) and the CPU/memory limits for individual tasks (vertical scaling). Autopilot walks the same fine line as human operators: its primary goal is to reduce slack – the difference between the limit and the actual resource usage – while minimizing the risk that a task is killed with an out-of-memory (OOM) error or its performance degraded because of CPU throttling. Autopilot uses machine learning algorithms applied to historical data about prior executions of a job, plus a set of finely-tuned heuristics, to walk this line. In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10. Despite its advantages, ensuring that Autopilot was widely adopted took significant effort, including making potential recommendations easily visible to customers who had yet to opt in, automatically migrating certain categories of jobs, and adding support for custom recommenders. At the time of writing, Autopiloted jobs account for over 48% of Google's fleet-wide resource usage. View details
    Nines are Not Enough: Meaningful Metrics for Clouds
    Proc. 17th Workshop on Hot Topics in Operating Systems (HoTOS) (2019)
    Preview abstract Cloud customers want reliable, understandable promises from cloud providers that their applications will run reliably and with adequate performance, but today, providers offer only limited guarantees, which creates uncertainty for customers. Providers also must define internal metrics to allow them to operate their systems without violating customer promises or expectations. We explore why these guarantees are hard to define. We show that this problem shares some similarities with the challenges of applying statistics to make decisions based on sampled data. We also suggest that defining guarantees in terms of defense against threats, rather than guarantees for application-visible outcomes, can reduce the complexity of these problems. Overall, we offer a partial framework for thinking about Service Level Objectives (SLOs), and discuss some unsolved challenges. View details
    DieHard: reliable scheduling to survive correlated failures in cloud data centers
    Mina Sedaghat
    Eddie Wadbro
    Sara De Luna
    Oleg Seleznjev
    Erik Elmroth
    International Symposium on Cluster, Cloud and Grid Computing (CCGrid), IEEE/ACM, Cartagena, Colombia (2016), pp. 52-59
    Preview abstract In large scale data centers, a single fault can lead to correlated failures of several physical machines and the tasks running on them, simultaneously. Such correlated failures can severely damage the reliability of a service or a job. This paper models the impact of stochastic and correlated failures on job reliability in a data center. We focus on correlated failures caused by power outages or failures of network components, on jobs running multiple replicas of identical tasks. We present a statistical reliability model and an approximation technique for computing a job’s reliability in the presence of correlated failures. In addition, we address the problem of scheduling a job with reliability constraints. We formulate the scheduling problem as an optimization problem, with the aim being to achieve the desired reliability with the minimum number of extra tasks. We present a scheduling algorithm that approximates the minimum number of required tasks and a placement to achieve a desired job reliability. We study the efficiency of our algorithm using an analytical approach and by simulating a cluster with different failure sources and reliabilities. The results show that the algorithm can effectively approximate the minimum number of extra tasks required to achieve the job’s reliability. View details
    Borg, Omega, and Kubernetes
    Brendan Burns
    Brian Grant
    David Oppenheimer
    ACM Queue, vol. 14 (2016), pp. 70-93
    Preview abstract Lessons learned from three container management systems over a decade. View details
    Service Level Objectives
    Niall Murphy
    Cody Smith
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    Large-scale cluster management at Google with Borg
    Luis Pedrosa
    Madhukar R. Korupolu
    David Oppenheimer
    Proceedings of the European Conference on Computer Systems (EuroSys), ACM, Bordeaux, France (2015)
    Preview abstract Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it. View details
    Preview abstract One of the key factors in selecting a good scheduling algorithm is using an appropriate metric for comparing schedulers. But which metric should be used when evaluating schedulers for warehouse-scale (cloud) clusters, which have machines of different types and sizes, heterogeneous workloads with dependencies and constraints on task placement, and long-running services that consume a large fraction of the total resources? Traditional scheduler evaluations that focus on metrics such as queuing delay, makespan, and running time fail to capture important behaviors – and ones that rely on workload synthesis and scaling often ignore important factors such as constraints. This paper explains some of the complexities and issues in evaluating warehouse scale schedulers, focusing on what we find to be the single most important aspect in practice: how well they pack long-running services into a cluster. We describe and compare four metrics for evaluating the packing efficiency of schedulers in increasing order of sophistication: aggregate utilization, hole filling, workload inflation and cluster compaction. View details
    Long-term SLOs for reclaimed cloud computing resources
    Marcus Carvalho
    Franciso Brasileiro
    ACM Symposium on Cloud Computing (SoCC), ACM, Seattle, WA, USA (2014), 20:1-20:13
    Preview abstract The elasticity promised by cloud computing does not come for free. Providers need to reserve resources to allow users to scale on demand, and cope with workload variations, which results in low utilization. The current response to this low utilization is to re-sell unused resources with no Service Level Objectives (SLOs) for availability. In this paper, we show how to make some of these reclaimable resources more valuable by providing strong, long-term availability SLOs for them. These SLOs are based on forecasts of how many resources will remain unused during multi-month periods, so users can do capacity planning for their long-running services. By using confidence levels for the predictions, we give service providers control over the risk of violating the availability SLOs, and allow them trade increased risk for more resources to make available. We evaluated our approach using 45 months of workload data from 6 production clusters at Google, and show that 6--17% of the resources can be re-offered with a long-term availability of 98.9% or better. A conservative analysis shows that doing so may increase the profitability of selling reclaimed resources by 22--60%. View details
    CPI^2: CPU performance isolation for shared compute clusters
    Robert Hagmann
    Rohit Jnagal
    Vrigo Gokhale
    SIGOPS European Conference on Computer Systems (EuroSys), ACM, Prague, Czech Republic (2013), pp. 379-391
    Preview abstract Performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other program's behavior. Our solution, CPI2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job. We have rolled out CPI2 to all of Google's shared compute clusters. The paper presents the analysis that lead us to that outcome, including both case studies and a large-scale evaluation of its ability to solve real production issues. View details
    Omega: flexible, scalable schedulers for large compute clusters
    Malte Schwarzkopf
    Andy Konwinski
    Michael Abd-El-Malek
    SIGOPS European Conference on Computer Systems (EuroSys), ACM, Prague, Czech Republic (2013), pp. 351-364
    Preview abstract Increasing scale and the need for rapid response to changing requirements are hard to meet with current monolithic cluster scheduler architectures. This restricts the rate at which new features can be deployed, decreases efficiency and utilization, and will eventually limit cluster growth. We present a novel approach to address these needs using parallelism, shared state, and lock-free optimistic concurrency control. We compare this approach to existing cluster scheduler designs, evaluate how much interference between schedulers occurs and how much it matters in practice, present some techniques to alleviate it, and finally discuss a use case highlighting the advantages of our approach -- all driven by real-life Google production workloads. View details
    AGILE: elastic distributed resource scaling for Infrastructure-as-a-Service
    Hiep Nguyen
    Zhiming Shen
    Xiaohui Gu
    Sethuraman Subbiah
    10th International Conference on Autonomic Computing (ICAC), USENIX, San Jose, CA, USA (2013), pp. 69-82
    Preview abstract Dynamically adjusting the number of virtual machines (VMs) assigned to a cloud application to keep up with load changes and interference from other uses typically requires detailed application knowledge and an ability to know the future, neither of which are readily available to infrastructure service providers or application owners. The result is that systems need to be over-provisioned (costly), or risk missing their performance Service Level Objectives (SLOs) and have to pay penalties (also costly). AGILE deals with both issues: it uses wavelets to provide a medium-term resource demand prediction with enough lead time to start up new application server instances before performance falls short, and it uses dynamic VM cloning to reduce application startup times. Tests using RUBiS and Google cluster traces show that AGILE can predict varying resource demands over the medium-term with up to 3.42× better true positive rate and 0.34× the false positive rate than existing schemes. Given a target SLO violation rate, AGILE can efficiently handle dynamic application workloads, reducing both penalties and user dissatisfaction. View details
    Preview abstract Cloud providers such as Google are interested in fostering research on the daunting technical challenges they face in supporting planetary-scale distributed systems, but no academic organizations have similar scale systems on which to experiment. Fortunately, good research can still be done using traces of real-life production workloads, but there are risks in releasing such data, including inadvertently disclosing confidential or proprietary information, as happened with the Netflix Prize data. This paper discusses these risks, and our approach to them, which we call {\em systematic obfuscation}. It protects proprietary and personal data while leaving it possible to answer some interesting research questions. We explain and motivate some of the risks and concerns and propose how they can best be mitigated, using as an example our recent publication of a month-long trace of a production system workload on a 11k-machine cluster. View details
    CloudScale: elastic resource scaling for multi-tenant cloud systems
    Zhiming Shen
    Sethuraman Subbiah
    Xiaohui Gu
    Symposium on Cloud Computing (SoCC), ACM, Cascais, Portugal (2011)
    Preview
    PRESS: PRedictive Elastic ReSource Scaling for cloud systems
    Zhenhuan Gong
    Xiaohui Gu
    6th IEEE/IFIP International Conference on Network and Service Management (CNSM 2010), Niagara Falls, Canada
    Preview abstract Cloud systems require elastic resource allocation to minimize resource provisioning costs while meeting service level objectives (SLOs). In this paper, we present a novel PRedictive Elastic reSource Scaling (PRESS) scheme for cloud systems. PRESS unobtrusively extracts fine-grained dynamic patterns in application resource demands and adjust their resource allocations automatically. Our approach leverages light-weight signal processing and statistical learning algorithms to achieve online predictions of dynamic application resource requirements. We have implemented the PRESS system on Xen and tested it using RUBiS and an application load trace from Google. Our experiments show that we can achieve good resource prediction accuracy with less than 5% over-estimation error and near zero under-estimation error, and elastic resource scaling can both significantly reduce resource waste and SLO violations. View details
    Do you know your IQ? A research agenda for information quality in systems
    Kimberley Keeton, HP Labs
    Pankaj Mehra, HP Labs
    HotMETRICS'09 (2009)
    Preview abstract Information quality (IQ) is a measure of how fit information is for a purpose. Sometimes called Quality of Information (QoI) by analogy with Quality of Service (QoS), it quantifies whether the right information is being used to make a decision or take an action. Failure to understand whether information is of adequate quality can lead to bad decisions and catastrophic effects. The results can include system outages, increased costs, lost revenue -- and worse. Quantifying information quality can help improve decision making, but the ultimate goal should be to select or construct information sources that have the appropriate balance between information quality and the cost of providing it. In this paper, we provide a brief introduction to the field, argue the case for applying information quality metrics in the systems domain, and propose a research agenda to explore this space. View details
    Storage, data, and information systems
    Christopher Hoover
    Beth Keer
    Pankaj Mehra
    Alistair Veitch
    HP Laboratories, Palo Alto, CA, USA (2008)
    Hibernator: helping disk arrays sleep through the winter
    Qingbo Zhu
    Lin Tan
    Yuanyuan Zhou
    Kimberly Keeton
    SOSP (2005), pp. 177-190
    Lessons and challenges in automating data dependability
    Kimberly Keeton
    Jeffrey S. Chase
    Dirk Beyer
    Cipriano A. Santos
    ACM SIGOPS European Workshop (2004), pp. 4
    Utilification
    Jaap Suermondt
    ACM SIGOPS European Workshop (2004), pp. 13
    Seneca: remote mirroring done write
    Minwen Ji
    Alistair C. Veitch
    USENIX Annual Technical Conference (2003), pp. 253-268
    Selecting RAID levels for disk arrays
    Eric Anderson
    Ram Swaminathan
    Alistair C. Veitch
    Guillermo A. Alvarez
    FAST (2002), pp. 189-201
    Back to the future: dependable computing = dependable services
    Jeffrey S. Chase
    Amin Vahdat
    ACM SIGOPS European Workshop, Saint-Émilion, France (2002), pp. 170-173
    Minerva: An automated resource provisioning tool for large-scale storage systems
    Guillermo A. Alvarez
    Elizabeth Borowsky
    Susie Go
    Theodore H. Romer
    Ralph A. Becker-Szendy
    Richard A. Golding
    Mirjana Spasojevic
    Alistair C. Veitch
    ACM Trans. Comput. Syst., vol. 19 (2001), pp. 483-518
    Towards Global Storage Management and Data Placement
    Alistair C. Veitch
    Erik Riedel
    Simon J. Towers
    HotOS (2001), pp. 184
    Storage Service Providers: a Solution for Storage Management? (Panel)
    Banu Ozden
    Bruce Hillyer
    Wee Teck Ng
    Elizabeth A. M. Shriver
    David J. DeWitt
    Bruce Gordon
    Jim Gray
    VLDB (2001), pp. 618-619
    An experimental study of data migration algorithms
    Eric J. Anderson
    Joseph Hall
    Jason D. Hartline
    Michael Hobbs
    Anna R. Karlin
    Jared Saia
    Ram Swaminathan
    Algorithm Engineering (2001), pp. 145-158
    EOS - the dawn of the resource economy (position summary)
    Patrick Goldsack
    G. John Janakiraman
    Lance Russell
    Sharad Singhal
    Andrew Thomas
    HotOS, Elmau/Oberbayern, Germany (2001), pp. 188
    An Analytic Behavior Model for Disk Drives with Readahead Caches and Request Reordering
    Elizabeth A. M. Shriver
    SIGMETRICS (1998), pp. 182-191
    Capacity planning with phased workloads
    Elizabeth Borowsky
    Richard A. Golding
    P. Jacobson
    L. Schreier
    Mirjana Spasojevic
    WOSP (1998), pp. 199-207
    The HP AutoRAID hierarchical storage system
    Richard Golding
    Tim Sullivan
    ACM Transactions on Computer Systems (TOCS), vol. 14 (1996), pp. 108-136
    Idleness is not sloth
    Richard A. Golding
    Peter Bosch II
    Tim Sullivan
    USENIX Winter (1995), pp. 201-212
    The TickerTAIP parallel RAID architecture
    Pei Cao
    Swee Boon Lim
    Shivakumar Venkataraman
    ACM Transactions on Computer Systems (TOCS), vol. 12 (1994), pp. 236-269
    The TickerTAIP parallel RAID architecture
    Pei Cao
    Swee Boon Lim
    Shivakumar Venkataraman
    ISCA (1993), pp. 52-63
    DataMesh parallel storage servers
    Chia Chao
    Robert English
    David Jacobson
    Bart Sears
    Alex Stepanov
    SIGOPS Oper. Syst. Rev., vol. 26 (1992), pp. 11