Jump to Content

Debugging Incidents in Google's Distributed Systems

Beth Cooper
ACM Queue (2020)

Abstract

Google has written the book on Site Reliability Engineering best practices, but how teams actually respond to production incidents often differs from the ideal practices we put on paper. This article will cover the reality of debugging issues in production at Google, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to effectively debug. We will: 1) detail the research approach taken to capture this data and surface patterns of behavior, 2) share findings on the common engineering pathways, processes, and attitudes in this space, and 3) share examples of how experts have debugged complex distributed systems, highlighting where best practices were followed or broken.