John Lunney
Site Reliability Engineer for G Suite
Authored Publications
Sort By
Reliable Data Processing with Minimal Toil
Athena Vawda
Julia Lee
Rita Sodt
(2021)
Preview abstract
This paper discusses an approach for making data pipelines both safer and less manual. We detail how we applied well known reliability best practices from user-facing services to batch jobs that underpin many of the services that make up Google Workspace. Using validation steps, canarying, and target populations for data pipelines, we ensure that only stable versions are promoted to the next environment stage. By moving to a single, standardized platform we minimized duplicate effort across services. We also touch on how we optimized batch jobs for both correctness and freshness SLOs, and the benefits of batch jobs vs. async event-based processing.
View details
Meaningful availability
Dan Ardelean
Philipp Emanuel Hoffmann
Tamás Hauer
17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (2020)
Preview abstract
Accurate measurement of service availability is the cornerstone of good service management: it quantifies the gap between user expectation and system performance, and provides actionable data to prioritize development and operational tasks. We propose a novel metric, user-uptime, which is event- based but is time-sensitive and which approximates aggregated user-perceived reliability better than current metrics. For a holistic view of availability across timescales from minutes to months or quarters, we augment user-uptime with a novel aggregation and visualization paradigm: windowed uptime. Using an example from G Suite we demonstrate its effectiveness in differentiating between unreliability caused
by flakiness and an extended outage.
View details
The Site Reliability Engineering Workbook Chapter: Simplicity
Niall Richard Murphy
Robert van Gent
Scott Ritchie
The Site Reliability Engineering Workbook: Practical Ways to Implement SRE (2018)
Preview abstract
Simplicity is an important goal for SREs, as it strongly correlates with reliability: simple software breaks less often and is easier and faster to fix when it does break. Simple systems are easier to understand, easier to maintain, and easier to test.
For SREs, simplicity is end-to-end: it includes the code itself, the system architecture, and also the tools and processes used to manage the software lifecycle. In this chapter, we explore some examples that demonstrate how SREs can measure, think about, and encourage simplicity.
View details
Preview abstract
This article follows up SRE Book chapter “Postmortem Culture: Learning from Failure." Here, we address the challenges in designing an appropriate action item plan and then executing that plan. We discuss best practices for developing high-quality action items (AIs) for a postmortem, plus methods of ensuring these AIs actually get implemented.
View details
Postmortem Culture: Learning from Failure
Preview
Gary O' Connor
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)