Debugging Incidents in Google's Distributed Systems

Charisma Chan; Beth Cooper

Debugging Incidents in Google's Distributed Systems

Charisma Chan

Beth Cooper

ACM Queue (2020)

Download Google Scholar

Abstract

Google has written the book on Site Reliability Engineering best practices, but how teams actually respond to production incidents often differs from the ideal practices we put on paper.
This article will cover the reality of debugging issues in production at Google, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to effectively debug. We will: 1) detail the research approach taken to capture this data and surface patterns of behavior, 2) share findings on the common engineering pathways, processes, and attitudes in this space, and 3) share examples of how experts have debugged complex distributed systems, highlighting where best practices were followed or broken.

Research Areas

Software systems

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Debugging Incidents in Google's Distributed Systems

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs