Salim Virji
Salim Virji is a Site Reliability Engineer at Google.
Research Areas
Authored Publications
Sort By
Architecting for Reliability
Implementing Service Level Objectives, O'Reilly Media (2020), 209–226
Preview abstract
If you have been able to establish an SLO culture within your organization, and you have the appropriate buy-in, you can now make sure that you are considering SLIs and SLOs for services during their design phase. This chapter will discuss how to incorporate SLO best practices into your design doc and user story portions of development, and how this helps you build better architectures from day one.
View details
Preview abstract
This article discusses principles and best practices for DevOps and SRE practitioners who are deploying and operating ML systems. This article draws on our experiences running production services for the past 15 years as well as from discussions with Google engineers working on diverse ML systems. We will use specific incidents to illustrate where ML-based systems did not behave as expected for developers of traditional systems, and examine the outcomes in light of the recommended practices.
View details
Preview abstract
As an end-of-the-year treat, we present to the SRE community some of the more beautiful images we have seen in our monitoring system. These images offer a glimpse into the visual patterns that appear in our variables and time-series, and the beauty that emerges from chaos (Truly: some of these images appeared during difficult rollouts, or even during incidents).
View details
The Site Reliability Engineering Workbook Chapter: Introducing Non-Abstract Large System Design (NALSD)
James Youngman
Richard Bondi
Tanya Reilly
The Site Reliability Engineering Workbook: Practical Ways to Implement SRE (2018)
Preview abstract
In the first SRE Book, we described building Cron at Large Scale to illustrate techniques for decoupling processes from individual machines. In this chapter, we take a step back and describe the principles behind these techniques, and the outcome of an actual large system’s design for a common workload: Log Processing.
This chapter will also illustrate the importance of involving SRE throughout the entire design process, not only at the deployment phase. Our experience has shown that once key design decisions are in development, we cannot easily revert or modify defining aspects of the system without significant additional engineering effort (if at all!).
The benefit of involving SREs in the design process comes from the SRE focus on key non-functional system properties, such as reliability, availability, and performance; on the efficient use of resources; and on the reuse of existing components. Consider the last requirements document you read: . It likely focused on feature requests and user journeys, rather than aspects of reliability or scale. We assert that reliability and SLOs are actually the most critical feature of any system, and not addressing them until a later phase is akin to accepting fewer features for higher costs. Following the style of review and evaluation described in this chapter leads to more robust and higher performance designs with lower costs over time.
View details