DieHard: reliable scheduling to survive correlated failures in cloud data centers

Mina Sedaghat; Eddie Wadbro; John Wilkes; Sara De Luna; Oleg Seleznjev; Erik Elmroth

DieHard: reliable scheduling to survive correlated failures in cloud data centers

Mina Sedaghat

Eddie Wadbro

John Wilkes

Sara De Luna

Oleg Seleznjev

Erik Elmroth

International Symposium on Cluster, Cloud and Grid Computing (CCGrid), IEEE/ACM, Cartagena, Colombia (2016), pp. 52-59

Download Google Scholar

Abstract

In large scale data centers, a single fault can lead to correlated failures of several physical machines and the tasks running on them, simultaneously. Such correlated failures can severely damage the reliability of a service or a job.

This paper models the impact of stochastic and correlated failures on job reliability in a data center. We focus on correlated failures caused by power outages or failures of network components, on jobs running multiple replicas of identical tasks. We present a statistical reliability model and an approximation technique for computing a job’s reliability in the presence of correlated failures.

In addition, we address the problem of scheduling a job with reliability constraints. We formulate the scheduling problem as an optimization problem, with the aim being to achieve the desired reliability with the minimum number of extra tasks. We present a scheduling algorithm that approximates the minimum number of required tasks and a placement to achieve a desired job reliability.

We study the efficiency of our algorithm using an analytical approach and by simulating a cluster with different failure sources and reliabilities. The results show that the algorithm can effectively approximate the minimum number of extra tasks required to achieve the job’s reliability.

Research Areas

Software systems

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

DieHard: reliable scheduling to survive correlated failures in cloud data centers

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs