Walfredo Cirne

Walfredo Cirne

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Google hostload prediction based on Bayesian model with optimized feature combination
    Sheng Dia
    Derrick Kondo
    Journal Parallel and Distributed Computing (2014)
    Preview abstract We design a novel prediction method with Bayes model to predict a load fluctuation pattern over a long-term interval, in the context of Google data centers. We exploit a set of features that capture the expectation, trend, stability and patterns of recent host loads. We also investigate the correlations among these features and explore the most effective combinations of features with various training periods. All of the prediction methods are evaluated using Google trace with 10,000+heterogeneous hosts. Experiments show that our Bayes method improves the long-term load prediction accuracy by 5.6%–50%, compared to other state-of-the-art methods based on moving average, auto-regression, and/or noise filters. Mean squared error of pattern prediction with Bayes method can be approximately limited in [10−8 ,10−5 ]. Through a load balancing scenario, we confirm the precision of pattern prediction in finding a set of idlest/busiest hosts from among 10,000+ hosts can be improved by about 7% on average. View details
    Long-term {SLOs} for reclaimed cloud computing resources
    Marcus Carvalho
    Franciso Brasileiro
    ACM Symposium on Cloud Computing (SoCC), ACM, Seattle, WA, USA (2014), 20:1-20:13
    Preview abstract The elasticity promised by cloud computing does not come for free. Providers need to reserve resources to allow users to scale on demand, and cope with workload variations, which results in low utilization. The current response to this low utilization is to re-sell unused resources with no Service Level Objectives (SLOs) for availability. In this paper, we show how to make some of these reclaimable resources more valuable by providing strong, long-term availability SLOs for them. These SLOs are based on forecasts of how many resources will remain unused during multi-month periods, so users can do capacity planning for their long-running services. By using confidence levels for the predictions, we give service providers control over the risk of violating the availability SLOs, and allow them trade increased risk for more resources to make available. We evaluated our approach using 45 months of workload data from 6 production clusters at Google, and show that 6--17% of the resources can be re-offered with a long-term availability of 98.9% or better. A conservative analysis shows that doing so may increase the profitability of selling reclaimed resources by 22--60%. View details
    Web-Scale Job Scheduling
    Eitan Frachtenberg
    Lecture Notes in Computer Science, 7698 (2013)
    Preview abstract Web datacenters and clusters can be larger than the world’s largest supercomputers, and run workloads that are at least as heteroge- neous and complex as their high-performance computing counterparts. And yet little is known about the unique job scheduling challenges of these environments. This article aims to ameliorate this situation. It dis- cusses the challenges of running web infrastructure and describes several techniques to address them. It also presents some of the problems that remain open in the field. View details
    Preview abstract A new era of Cloud Computing has emerged, but the characteristics of Cloud load in data centers is not perfectly clear. Yet this characterization is critical for the design of novel Cloud job and resource management systems. In this paper, we comprehensively characterize the job/task load and host load in a real-world production data center at Google Inc. We use a detailed trace of over 25 million tasks across over 12,500 hosts. We study the differences between a Google data center and other Grid/HPC systems, from the perspective of both work load (w.r.t. jobs and tasks) and host load (w.r.t. machines). In particular, we study the job length, job submission frequency, and the resource utilization of jobs in the different systems, and also investigate valuable statistics of machine’s maximum load, queue state and relative usage levels, with different job priorities and resource attributes. We find that the Google data center exhibits finer resource allocation with respect to CPU and memory than that of Grid/HPC systems. Google jobs are always submitted with much higher frequency and they are much shorter than Grid jobs. As such, Google host load exhibits higher variance and noise. View details
    Preview abstract Prediction of host load in Cloud systems is crit- ical for achieving service-level agreements. However, accurate prediction of host load in Clouds is extremely challenging because it fluctuates drastically at small timescales. We design a prediction method based on Bayes model to predict the mean load over a long-term time interval, as well as the mean load in consecutive future time intervals. We identify novel predictive features of host load that capture the expectation, predictabil- ity, trends and patterns of host load. We also determine the most effective combinations of these features for prediction. We evaluate our method using a detailed one-month trace of a Google data center with thousands of machines. Experiments show that the Bayes method achieves high accuracy with a mean squared error of 0.0014. Moreover, the Bayes method improves the load prediction accuracy by 5.6-50% compared to other state-of-the-art methods based on moving averages, auto-regression, and/or noise filters. View details
    Perspectives on cloud computing: interviews with five leading scientists from the cloud community
    Gordon Blair
    Fabio Kon
    Dejan Milojicic
    Raghu Ramakrishnan
    Dan Reed
    Dilma Silva
    Journal of Internet Services and Applications (2011)
    Preview abstract Cloud computing is currently one of the major topics in dis- tributed systems, with large numbers of papers being writ- ten on the topic, with major players in the industry releasing a range of software platforms offering novel Internet-based services and, most importantly, evidence of real impact on end user communities in terms of approaches to provision- ing software services. Cloud computing though is at a for- mative stage, with a lot of hype surrounding the area, and this makes it difficult to see the true contribution and impact of the topic. Cloud computing is a central topic for the Journal of In- ternet Services and Applications (JISA) and indeed the most downloaded paper from the first year of JISA is concerned with the state-of-the-art and research challenges related to cloud computing [1]. The Editors-in-Chief, Fabio Kon and Gordon Blair, therefore felt it was timely to seek clarifica- tion on the key issues around cloud computing and hence invited five leading scientists from industrial organizations central to cloud computing to answer a series of questions on the topic. The five scientists taking part are: • Walfredo Cirne, from Google’s infrastructure group in California, USA • Dejan Milojicic, Senior Researcher and Director of the Open Cirrus Cloud Computing testbed at HP Labs • Raghu Ramakrishnan, Chief Scientist for Search and Cloud Platforms at Yahoo! • Dan Reed, Microsoft’s Corporate Vice President for Tech- nology Strategy and Policy and Extreme Computing • Dilma Silva, researcher at the IBM T.J. Watson Research Center, in New York View details
    The Best of CCGrid'2007: A Snapshot of an 'Adolescent' Area
    Bruno Schulze
    Concurrency and Computation: Practice and Experience, 21 (2009)
    Preview
    Towards Characterizing Cloud Backend Workloads: Insights from Google Compute Clusters
    Asit Mishra
    Joseph L Hellerstein
    Sigmetrics Performance Evaluation Review, ACM (2009)
    Preview