Google Research

Debugging Incidents in Google's Distributed Systems

ACM Queue (2020)

Abstract

Google has written the book on Site Reliability Engineering best practices, but how teams actually respond to production incidents often differs from the ideal practices we put on paper. This article will cover the reality of debugging issues in production at Google, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to effectively debug. We will: 1) detail the research approach taken to capture this data and surface patterns of behavior, 2) share findings on the common engineering pathways, processes, and attitudes in this space, and 3) share examples of how experts have debugged complex distributed systems, highlighting where best practices were followed or broken.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work