The Site Reliability Engineering Workbook Chapter: Introducing Non-Abstract Large System Design (NALSD)

James Youngman

Richard Bondi

Salim Virji

Tanya Reilly

The Site Reliability Engineering Workbook: Practical Ways to Implement SRE (2018)

Google Scholar

Abstract

In the first SRE Book, we described building Cron at Large Scale to illustrate techniques for decoupling processes from individual machines. In this chapter, we take a step back and describe the principles behind these techniques, and the outcome of an actual large system’s design for a common workload: Log Processing.   This chapter will also illustrate the importance of involving SRE throughout the entire design process, not only at the deployment phase. Our experience has shown that once key design decisions are in development, we cannot easily revert or modify defining aspects of the system without significant additional engineering effort (if at all!). The benefit of involving SREs in the design process comes from the SRE focus on key non-functional system properties, such as reliability, availability, and performance; on the efficient use of resources; and on the reuse of existing components. Consider the last requirements document you read: . It likely focused on feature requests and user journeys, rather than aspects of reliability or scale. We assert that reliability and SLOs are actually the most critical feature of any system, and not addressing them until a later phase is akin to accepting fewer features for higher costs. Following the style of review and evaluation described in this chapter leads to more robust and higher performance designs with lower costs over time.

Research Areas

Distributed Systems and Parallel Computing

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

The Site Reliability Engineering Workbook Chapter: Introducing Non-Abstract Large System Design (NALSD)

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

The Site Reliability Engineering Workbook Chapter: Introducing Non-Abstract Large System Design (NALSD)

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities