Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
While many network research papers address issues of deployability, with
a few exceptions, this has been limited to protocol compatibility or
switch-resource constraints, such as flow table sizes.
We argue that good network designs must also consider the costs and
complexities of deploying the design within the constraints of the physical
environment in a datacenter: \emph{physical} deployability.
The traditional metrics of network ``goodness'' mostly do not account
for these costs and constraints, and this may partially explain why some
otherwise attractive designs have not been deployed in real-world datacenters.
View details
Data-driven Networking Research: models for academic collaboration with Industry (a Google point of view)
Priya Mahadevan
Christophe Diot
Amin Vahdat
Computer Communication Review, 51:4(2021), pp. 47-49
Preview abstract
We (Google's networking teams) would like to increase our collaborations with academic researchers related to data-driven networking research.
There are some significant constraints on our ability to directly share data, and in case not everyone in the community understands these, this document provides a brief summary.
There are some models which can work (primarily, interns and visiting scientists).
We describe some specific areas where we would welcome proposals to work within those models
View details
Autopilot: Workload Autoscaling at Google Scale
Paweł Findeisen
Jacek Świderski
Przemyslaw Broniek
Beata Strack
Piotr Witusowski
Proceedings of the Fifteenth European Conference on Computer Systems, Association for Computing Machinery(2020) (to appear)
Preview abstract
In many public and private Cloud systems, users need to specify a limit for the amount of resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits might be throttled or killed, resulting in delaying or dropping end-user requests, so human operators naturally err on the side of caution and request a larger limit than the job needs. At scale, this results in massive aggregate resource wastage.
To address this, Google uses Autopilot to configure resources automatically, adjusting both the number of concurrent tasks in a job (horizontal scaling) and the CPU/memory limits for individual tasks (vertical scaling). Autopilot walks the same fine line as human operators: its primary goal is to reduce slack – the difference between the limit and the actual resource usage – while minimizing the risk that a task is killed with an out-of-memory (OOM) error or its performance degraded because of CPU throttling. Autopilot uses machine learning algorithms applied to historical data about prior executions of a job, plus a set of finely-tuned heuristics, to walk this line. In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10.
Despite its advantages, ensuring that Autopilot was widely adopted took significant effort, including making potential recommendations easily visible to customers who had yet to opt in, automatically migrating certain categories of jobs, and adding support for custom recommenders. At the time of writing, Autopiloted jobs account for over 48% of Google's fleet-wide resource usage.
View details
Borg: the Next Generation
Muhammad Tirmazi
Adam Barker
Md Ehtesam Haque
Zhijing Gene Qin
Mor Harchol-Balter
EuroSys'20, ACM, Heraklion, Crete(2020)
Preview abstract
This paper analyzes a newly-published trace that covers 8
different Borg clusters for the month of May 2019. The
trace enables researchers to explore how scheduling works in
large-scale production compute clusters. We highlight how
Borg has evolved and perform a longitudinal comparison of
the newly-published 2019 trace against the 2011 trace, which
has been highly cited within the research community.
Our findings show that Borg features such as alloc sets
are used for resource-heavy workloads; automatic vertical
scaling is effective; job-dependencies account for much of
the high failure rates reported by prior studies; the workload arrival rate has increased, as has the use of resource over-commitment; the workload mix has changed, jobs have
migrated from the free tier into the best-effort batch tier;
the workload exhibits an extremely heavy-tailed distribution
where the top 1% of jobs consume over 99% of resources; and
there is a great deal of variation between different clusters.
View details
Preview abstract
Cloud customers want reliable, understandable promises from cloud providers that their applications will run reliably and with adequate performance, but today, providers offer only limited guarantees, which creates uncertainty for customers. Providers also must define internal metrics to allow them to operate their systems without violating customer promises or expectations. We explore why these guarantees are hard to define. We show that this problem shares some similarities with the challenges of applying statistics to make decisions based on sampled data. We also suggest that defining guarantees in terms of defense against threats, rather than guarantees for application-visible outcomes, can reduce the complexity of these problems. Overall, we offer a partial framework for thinking about Service Level Objectives (SLOs), and discuss some unsolved challenges.
View details
Preview abstract
Lessons learned from three container management systems over a decade.
View details
DieHard: reliable scheduling to survive correlated failures in cloud data centers
Mina Sedaghat
Eddie Wadbro
Sara De Luna
Oleg Seleznjev
Erik Elmroth
International Symposium on Cluster, Cloud and Grid Computing (CCGrid), IEEE/ACM, Cartagena, Colombia(2016), pp. 52-59
Preview abstract
In large scale data centers, a single fault can lead to correlated failures of several physical machines and the tasks running on them, simultaneously. Such correlated failures can severely damage the reliability of a service or a job.
This paper models the impact of stochastic and correlated failures on job reliability in a data center. We focus on correlated failures caused by power outages or failures of network components, on jobs running multiple replicas of identical tasks. We present a statistical reliability model and an approximation technique for computing a job’s reliability in the presence of correlated failures.
In addition, we address the problem of scheduling a job with reliability constraints. We formulate the scheduling problem as an optimization problem, with the aim being to achieve the desired reliability with the minimum number of extra tasks. We present a scheduling algorithm that approximates the minimum number of required tasks and a placement to achieve a desired job reliability.
We study the efficiency of our algorithm using an analytical approach and by simulating a cluster with different failure sources and reliabilities. The results show that the algorithm can effectively approximate the minimum number of extra tasks required to achieve the job’s reliability.
View details
Service Level Objectives
Preview
Niall Murphy
Cody Smith
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly(2016)
Large-scale cluster management at {Google} with {Borg}
Luis Pedrosa
Madhukar R. Korupolu
David Oppenheimer
Proceedings of the European Conference on Computer Systems (EuroSys), ACM, Bordeaux, France(2015)
Preview abstract
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.
It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.
We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.
View details
Long-term {SLOs} for reclaimed cloud computing resources
Marcus Carvalho
Franciso Brasileiro
ACM Symposium on Cloud Computing (SoCC), ACM, Seattle, WA, USA(2014), 20:1-20:13
Preview abstract
The elasticity promised by cloud computing does not come for free. Providers need to reserve resources to allow users to scale on demand, and cope with workload variations, which results in low utilization. The current response to this low utilization is to re-sell unused resources with no Service Level Objectives (SLOs) for availability. In this paper, we show how to make some of these reclaimable resources more valuable by providing strong, long-term availability SLOs for them. These SLOs are based on forecasts of how many resources will remain unused during multi-month periods, so users can do capacity planning for their long-running services. By using confidence levels for the predictions, we give service providers control over the risk of violating the availability SLOs, and allow them trade increased risk for more resources to make available. We evaluated our approach using 45 months of workload data from 6 production clusters at Google, and show that 6--17% of the resources can be re-offered with a long-term availability of 98.9% or better. A conservative analysis shows that doing so may increase the profitability of selling reclaimed resources by 22--60%.
View details
Preview abstract
One of the key factors in selecting a good scheduling
algorithm is using an appropriate metric for comparing
schedulers. But which metric should be used when evaluating
schedulers for warehouse-scale (cloud) clusters, which have
machines of different types and sizes, heterogeneous workloads
with dependencies and constraints on task placement, and long-running
services that consume a large fraction of the total
resources? Traditional scheduler evaluations that focus on metrics
such as queuing delay, makespan, and running time fail to
capture important behaviors – and ones that rely on workload
synthesis and scaling often ignore important factors such as
constraints. This paper explains some of the complexities and
issues in evaluating warehouse scale schedulers, focusing on what
we find to be the single most important aspect in practice: how
well they pack long-running services into a cluster. We describe
and compare four metrics for evaluating the packing efficiency
of schedulers in increasing order of sophistication: aggregate
utilization, hole filling, workload inflation and cluster compaction.
View details
{CPI^2}: {CPU} performance isolation for shared compute clusters
Xiao Zhang
Robert Hagmann
Rohit Jnagal
Vrigo Gokhale
SIGOPS European Conference on Computer Systems (EuroSys), ACM, Prague, Czech Republic(2013), pp. 379-391
Preview abstract
Performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other program's behavior.
Our solution, CPI2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job.
We have rolled out CPI2 to all of Google's shared compute clusters. The paper presents the analysis that lead us to that outcome, including both case studies and a large-scale evaluation of its ability to solve real production issues.
View details
Omega: flexible, scalable schedulers for large compute clusters
Malte Schwarzkopf
Andy Konwinski
Michael Abd-El-Malek
SIGOPS European Conference on Computer Systems (EuroSys), ACM, Prague, Czech Republic(2013), pp. 351-364
Preview abstract
Increasing scale and the need for rapid response to changing requirements are hard to meet with current monolithic cluster scheduler architectures. This restricts the rate at which new features can be deployed, decreases efficiency and utilization, and will eventually limit cluster growth. We present a novel approach to address these needs using parallelism, shared state, and lock-free optimistic concurrency control.
We compare this approach to existing cluster scheduler designs, evaluate how much interference between schedulers occurs and how much it matters in practice, present some techniques to alleviate it, and finally discuss a use case highlighting the advantages of our approach -- all driven by real-life Google production workloads.
View details
AGILE: elastic distributed resource scaling for Infrastructure-as-a-Service
Hiep Nguyen
Zhiming Shen
Xiaohui Gu
Sethuraman Subbiah
10th International Conference on Autonomic Computing (ICAC), USENIX, San Jose, CA, USA(2013), pp. 69-82
Preview abstract
Dynamically adjusting the number of virtual machines
(VMs) assigned to a cloud application to keep up with
load changes and interference from other uses typically
requires detailed application knowledge and an ability to
know the future, neither of which are readily available
to infrastructure service providers or application owners.
The result is that systems need to be over-provisioned
(costly), or risk missing their performance Service Level
Objectives (SLOs) and have to pay penalties (also
costly). AGILE deals with both issues: it uses wavelets
to provide a medium-term resource demand prediction
with enough lead time to start up new application server
instances before performance falls short, and it uses
dynamic VM cloning to reduce application startup times.
Tests using RUBiS and Google cluster traces show that
AGILE can predict varying resource demands over the
medium-term with up to 3.42× better true positive rate
and 0.34× the false positive rate than existing schemes.
Given a target SLO violation rate, AGILE can efficiently
handle dynamic application workloads, reducing both
penalties and user dissatisfaction.
View details
Preview abstract
Cloud providers such as Google are interested in
fostering research on the daunting technical challenges they
face in supporting planetary-scale distributed systems, but no
academic organizations have similar scale systems on which
to experiment. Fortunately, good research can still be done
using traces of real-life production workloads, but there are
risks in releasing such data, including inadvertently disclosing
confidential or proprietary information, as happened with the
Netflix Prize data. This paper discusses these risks, and our
approach to them, which we call {\em systematic obfuscation}. It protects proprietary and personal data while leaving it possible to answer some interesting research questions. We explain and motivate some of the risks and concerns and propose how they can best be mitigated, using as an example our recent publication of a month-long trace of a production system workload on a 11k-machine cluster.
View details
CloudScale: elastic resource scaling for multi-tenant cloud systems
Preview
Zhiming Shen
Sethuraman Subbiah
Xiaohui Gu
Symposium on Cloud Computing (SoCC), ACM, Cascais, Portugal(2011)
PRESS: PRedictive Elastic ReSource Scaling for cloud systems
Zhenhuan Gong
Xiaohui Gu
6th IEEE/IFIP International Conference on Network and Service Management (CNSM 2010), Niagara Falls, Canada
Preview abstract
Cloud systems require elastic resource allocation to
minimize resource provisioning costs while meeting service level
objectives (SLOs). In this paper, we present a novel PRedictive
Elastic reSource Scaling (PRESS) scheme for cloud systems.
PRESS unobtrusively extracts fine-grained dynamic patterns
in application resource demands and adjust their resource
allocations automatically. Our approach leverages light-weight
signal processing and statistical learning algorithms to achieve
online predictions of dynamic application resource requirements.
We have implemented the PRESS system on Xen and tested it
using RUBiS and an application load trace from Google. Our
experiments show that we can achieve good resource prediction
accuracy with less than 5% over-estimation error and near zero
under-estimation error, and elastic resource scaling can both
significantly reduce resource waste and SLO violations.
View details
Preview abstract
Information quality (IQ) is a measure of how fit information is for a purpose. Sometimes called Quality of Information (QoI) by analogy with Quality of Service (QoS), it quantifies whether the right information is being used to make a decision or take an action. Failure to understand whether information is of adequate quality can lead to bad decisions and catastrophic effects. The results can include system outages, increased costs, lost revenue -- and worse. Quantifying information quality can help improve decision making, but the ultimate goal should be to select or construct information sources that have the appropriate balance between information quality and the cost of providing it. In this paper, we provide a brief introduction to the field, argue the case for applying information quality metrics in the systems domain, and propose a research agenda to explore this space.
View details
Storage, data, and information systems
Christopher Hoover
Beth Keer
Pankaj Mehra
Alistair Veitch
HP Laboratories, Palo Alto, CA, USA(2008)
Hibernator: helping disk arrays sleep through the winter
Lessons and challenges in automating data dependability
Kimberly Keeton
Jeffrey S. Chase
Dirk Beyer
Cipriano A. Santos
ACM SIGOPS European Workshop(2004), pp. 4
Utilification
Seneca: remote mirroring done write
Selecting {RAID} levels for disk arrays
Eric Anderson
Ram Swaminathan
Alistair C. Veitch
Guillermo A. Alvarez
FAST(2002), pp. 189-201
Back to the future: dependable computing = dependable services
Jeffrey S. Chase
Amin Vahdat
ACM SIGOPS European Workshop, Saint-Émilion, France(2002), pp. 170-173
An experimental study of data migration algorithms
Eric J. Anderson
Joseph Hall
Jason D. Hartline
Michael Hobbs
Anna R. Karlin
Jared Saia
Ram Swaminathan
Algorithm Engineering(2001), pp. 145-158
Towards Global Storage Management and Data Placement
Storage Service Providers: a Solution for Storage Management? (Panel)
Banu Ozden
Bruce Hillyer
Wee Teck Ng
Elizabeth A. M. Shriver
David J. DeWitt
Bruce Gordon
Jim Gray
VLDB(2001), pp. 618-619
Minerva: An automated resource provisioning tool for large-scale storage systems
Guillermo A. Alvarez
Elizabeth Borowsky
Susie Go
Theodore H. Romer
Ralph A. Becker-Szendy
Richard A. Golding
Mirjana Spasojevic
Alistair C. Veitch
ACM Trans. Comput. Syst., 19(2001), pp. 483-518
EOS - the dawn of the resource economy (position summary)
Patrick Goldsack
G. John Janakiraman
Lance Russell
Sharad Singhal
Andrew Thomas
HotOS, Elmau/Oberbayern, Germany(2001), pp. 188
An Analytic Behavior Model for Disk Drives with Readahead Caches and Request Reordering
Capacity planning with phased workloads
Elizabeth Borowsky
Richard A. Golding
P. Jacobson
L. Schreier
Mirjana Spasojevic
WOSP(1998), pp. 199-207
The {HP AutoRAID} hierarchical storage system
Richard Golding
Tim Sullivan
ACM Transactions on Computer Systems (TOCS), 14(1996), pp. 108-136
Idleness is not sloth
Richard A. Golding
Peter Bosch II
Tim Sullivan
USENIX Winter(1995), pp. 201-212
The {TickerTAIP} parallel {RAID} architecture
Pei Cao
Swee Boon Lim
Shivakumar Venkataraman
ACM Transactions on Computer Systems (TOCS), 12(1994), pp. 236-269
The {TickerTAIP} parallel {RAID} architecture
{DataMesh} parallel storage servers
Chia Chao
Robert English
David Jacobson
Bart Sears
Alex Stepanov
SIGOPS Oper. Syst. Rev., 26(1992), pp. 11