SRE Principles

DevOps Days Zürich, Winterthur (2018)

Abstract

As Ben Treynor (VP of 24x7 at Google and founding father of SRE) puts it, "SRE, fundamentally, it’s what happens when you ask a software engineer to design an operations function". What does differentiate an SRE (Site Reliability Engineering) from DevOps? Aren't they the same?

SRE is a job function that focuses on the reliability and maintainability of systems. It is also a mindset and a set of engineering practices to run better production services. An SRE has to be able to engineer creative solutions to problems, strike the right balance between reliability and feature velocity and target appropriate levels of service quality.

This talk covers the principles under which all SRE teams operate at Google: consistency, design of systems, monitoring, automation, error budgets, blameless postmortems, etc.