Google Research

Metrics That Matter

  • Ben Treynor
  • Shylaja Nukala
  • Vivek Rau
ACM Queue (2018)

Abstract

Site Reliability Engineering, or SRE, is a software engineering specialization which focuses on the reliability and maintainability of large systems. Google has previously published Site Reliability Engineering: How Google Runs Production Systems (hereafter referred to as the SRE book) to explain our approach to product reliability. In this article, we discuss critical but oft-neglected metrics that the Google SRE organization has found to be important for running reliable services. This article is for product development and SRE teams, managers of such teams, and anyone else who cares about the reliability of web products or infrastructure. It is based on Ben Treynor’s talk at the Google Cloud Next 2017 conference.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work