Metrics That Matter

Ben Treynor
Shylaja Nukala
Vivek Rau
ACM Queue(2018)

Abstract

Site Reliability Engineering, or SRE, is a software engineering specialization which focuses on the reliability and maintainability of large systems. Google has previously published Site Reliability Engineering: How Google Runs Production Systems (hereafter referred to as the SRE book) to explain our approach to product reliability. In this article, we discuss critical but oft-neglected metrics that the Google SRE organization has found to be important for running reliable services. This article is for product development and SRE teams, managers of such teams, and anyone else who cares about the reliability of web products or infrastructure. It is based on Ben Treynor’s talk at the Google Cloud Next 2017 conference.