Walfredo Cirne
Research Areas
Authored Publications
Sort By
Long-term {SLOs} for reclaimed cloud computing resources
Marcus Carvalho
Franciso Brasileiro
ACM Symposium on Cloud Computing (SoCC), ACM, Seattle, WA, USA (2014), 20:1-20:13
Preview abstract
The elasticity promised by cloud computing does not come for free. Providers need to reserve resources to allow users to scale on demand, and cope with workload variations, which results in low utilization. The current response to this low utilization is to re-sell unused resources with no Service Level Objectives (SLOs) for availability. In this paper, we show how to make some of these reclaimable resources more valuable by providing strong, long-term availability SLOs for them. These SLOs are based on forecasts of how many resources will remain unused during multi-month periods, so users can do capacity planning for their long-running services. By using confidence levels for the predictions, we give service providers control over the risk of violating the availability SLOs, and allow them trade increased risk for more resources to make available. We evaluated our approach using 45 months of workload data from 6 production clusters at Google, and show that 6--17% of the resources can be re-offered with a long-term availability of 98.9% or better. A conservative analysis shows that doing so may increase the profitability of selling reclaimed resources by 22--60%.
View details
Preview abstract
We design a novel prediction method with Bayes model to predict a load fluctuation pattern over a
long-term interval, in the context of Google data centers. We exploit a set of features that capture the
expectation, trend, stability and patterns of recent host loads. We also investigate the correlations among
these features and explore the most effective combinations of features with various training periods. All of
the prediction methods are evaluated using Google trace with 10,000+heterogeneous hosts. Experiments
show that our Bayes method improves the long-term load prediction accuracy by 5.6%–50%, compared
to other state-of-the-art methods based on moving average, auto-regression, and/or noise filters. Mean
squared error of pattern prediction with Bayes method can be approximately limited in [10−8
,10−5
].
Through a load balancing scenario, we confirm the precision of pattern prediction in finding a set of
idlest/busiest hosts from among 10,000+ hosts can be improved by about 7% on average.
View details
Preview abstract
Web datacenters and clusters can be larger than the world’s
largest supercomputers, and run workloads that are at least as heteroge-
neous and complex as their high-performance computing counterparts.
And yet little is known about the unique job scheduling challenges of
these environments. This article aims to ameliorate this situation. It dis-
cusses the challenges of running web infrastructure and describes several
techniques to address them. It also presents some of the problems that
remain open in the field.
View details
Preview abstract
Prediction of host load in Cloud systems is crit-
ical for achieving service-level agreements. However, accurate
prediction of host load in Clouds is extremely challenging
because it fluctuates drastically at small timescales. We design
a prediction method based on Bayes model to predict the mean
load over a long-term time interval, as well as the mean load in
consecutive future time intervals. We identify novel predictive
features of host load that capture the expectation, predictabil-
ity, trends and patterns of host load. We also determine the
most effective combinations of these features for prediction.
We evaluate our method using a detailed one-month trace of a
Google data center with thousands of machines. Experiments
show that the Bayes method achieves high accuracy with a
mean squared error of 0.0014. Moreover, the Bayes method
improves the load prediction accuracy by 5.6-50% compared
to other state-of-the-art methods based on moving averages,
auto-regression, and/or noise filters.
View details
Preview abstract
A new era of Cloud Computing has emerged, but
the characteristics of Cloud load in data centers is not perfectly
clear. Yet this characterization is critical for the design of novel
Cloud job and resource management systems. In this paper, we
comprehensively characterize the job/task load and host load
in a real-world production data center at Google Inc. We use
a detailed trace of over 25 million tasks across over 12,500
hosts. We study the differences between a Google data center
and other Grid/HPC systems, from the perspective of both work
load (w.r.t. jobs and tasks) and host load (w.r.t. machines). In
particular, we study the job length, job submission frequency,
and the resource utilization of jobs in the different systems,
and also investigate valuable statistics of machine’s maximum
load, queue state and relative usage levels, with different job
priorities and resource attributes. We find that the Google data
center exhibits finer resource allocation with respect to CPU
and memory than that of Grid/HPC systems. Google jobs are
always submitted with much higher frequency and they are
much shorter than Grid jobs. As such, Google host load exhibits
higher variance and noise.
View details
Perspectives on cloud computing: interviews with five leading scientists from the cloud community
Gordon Blair
Fabio Kon
Dejan Milojicic
Raghu Ramakrishnan
Dan Reed
Dilma Silva
Journal of Internet Services and Applications (2011)
Preview abstract
Cloud computing is currently one of the major topics in dis-
tributed systems, with large numbers of papers being writ-
ten on the topic, with major players in the industry releasing
a range of software platforms offering novel Internet-based
services and, most importantly, evidence of real impact on
end user communities in terms of approaches to provision-
ing software services. Cloud computing though is at a for-
mative stage, with a lot of hype surrounding the area, and
this makes it difficult to see the true contribution and impact
of the topic.
Cloud computing is a central topic for the Journal of In-
ternet Services and Applications (JISA) and indeed the most
downloaded paper from the first year of JISA is concerned
with the state-of-the-art and research challenges related to
cloud computing [1]. The Editors-in-Chief, Fabio Kon and
Gordon Blair, therefore felt it was timely to seek clarifica-
tion on the key issues around cloud computing and hence
invited five leading scientists from industrial organizations
central to cloud computing to answer a series of questions
on the topic.
The five scientists taking part are:
• Walfredo Cirne, from Google’s infrastructure group in
California, USA
• Dejan Milojicic, Senior Researcher and Director of the
Open Cirrus Cloud Computing testbed at HP Labs
• Raghu Ramakrishnan, Chief Scientist for Search and
Cloud Platforms at Yahoo!
• Dan Reed, Microsoft’s Corporate Vice President for Tech-
nology Strategy and Policy and Extreme Computing
• Dilma Silva, researcher at the IBM T.J. Watson Research
Center, in New York
View details
Towards Characterizing Cloud Backend Workloads: Insights from Google Compute Clusters
Preview
Asit Mishra
Joseph L Hellerstein
Sigmetrics Performance Evaluation Review, ACM (2009)