Jump to Content

The Site Reliability Engineering Workbook Chapter: Introducing Non-Abstract Large System Design (NALSD)

James Youngman
Richard Bondi
Tanya Reilly
The Site Reliability Engineering Workbook: Practical Ways to Implement SRE (2018)
Google Scholar

Abstract

In the first SRE Book, we described building Cron at Large Scale to illustrate techniques for decoupling processes from individual machines. In this chapter, we take a step back and describe the principles behind these techniques, and the outcome of an actual large system’s design for a common workload: Log Processing. 
 This chapter will also illustrate the importance of involving SRE throughout the entire design process, not only at the deployment phase. Our experience has shown that once key design decisions are in development, we cannot easily revert or modify defining aspects of the system without significant additional engineering effort (if at all!). The benefit of involving SREs in the design process comes from the SRE focus on key non-functional system properties, such as reliability, availability, and performance; on the efficient use of resources; and on the reuse of existing components. Consider the last requirements document you read: . It likely focused on feature requests and user journeys, rather than aspects of reliability or scale. We assert that reliability and SLOs are actually the most critical feature of any system, and not addressing them until a later phase is akin to accepting fewer features for higher costs. Following the style of review and evaluation described in this chapter leads to more robust and higher performance designs with lower costs over time.