 
                Sebastian Kirsch
            Sebastian Kirsch has been a Site Reliability Engineer at Google since 2006.
Based in Sunnyvale, California, his nearly two decades of experience span
three countries and over ten teams, contributing to critical systems from
Google's web crawler to Google Maps, Google Calendar, and sensitive
infrastructure like payment processing and authentication. A seasoned speaker
at international conferences, Sebastian also contributed to the foundational
2016 book "Site Reliability Engineering: How Google Runs Production Systems."
          
        
        Research Areas
      Authored Publications
    
  
  
  
    
    
  
      
        Sort By
        
        
    
    
        
          
            
              Designing and Operating Highly Available Software Systems at Scale
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Michael Wildpaner
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Escuela Politécnica de Ingeniería de Gijón, Gijón (2019)
          
          
        
        
        
          
              Preview abstract
          
          
              The talk explains what Site Reliability Engineering (SRE) is, how it is used at Google, and gives an overview of the challenges to take a regular LAMP-style small service into supporting 100M users, it also speaks about monitoring and other SRE dimensions,  from capacity planning to design reviews.
              
  
View details
          
        
      
    
        
          
            
              Interviewing for Systems Design Skills
            
          
        
        
          
            
              
                
                  
                    
                
              
            
          
          
          
          
    
    
    
    
    
            SRECon18 Asia, USENIX Association, Singapore (2018)
          
          
        
        
        
          
              Preview abstract
          
          
              Google SRE has developed a special interview format called "Non-Abstract Large Systems Design" or NALSD. The focus of this interview is developing a credible approach for solving a specific problem at large scale. Going beyond coding and algorithm skills, candidates demonstrate their skills in designing for scalability, reliability and robustness, estimating provisioning needs, and managing change. All candidates for SRE positions at Google participate in one NALSD interview as part of their recruiting process.
Attendees will learn why Google has developed this interview format and which aspects of a candidate's skill set are covered in the format. They will see an example of this interview type, and learn how to come up with their own interview questions. Tips and tricks derived from practical experience in conducting this interview type will help attendees avoid common pitfalls when interviewing candidates.
              
  
View details
          
        
      
    
        
          
            
              Reliable Launches at Scale
            
          
        
        
          
            
              
                
                  
                    
                
              
            
          
          
          
          
    
    
    
    
    
            SRECon17 Asia, USENIX Association, Singapore (2017)
          
          
        
        
        
          
              Preview abstract
          
          
              How do you perform up to 70 product and feature launches per week safely, reliably and reproducibly? Google staffed a dedicated team of Site Reliability Engineers to solve this question: Launch Coordination Engineers work across Google's service space to audit new products and features for reliability, act as liaisons between teams involved in a launch, and be gatekeepers.
              
  
View details
          
        
      
    
        
          
            
              Safe Client Behaviour
            
          
        
        
          
            
              
                
                  
                    
                
              
            
          
          
          
          
    
    
    
    
    
            Shenzhen, China (2017)
          
          
        
        
        
          
              Preview abstract
          
          
              Ubiquitous compute power has created frequent impedance mismatches between client capabilities and server capacity in distributed systems. Careful client behaviour design protects the server from unintended load and enables safe recovery after outages. These techniques improve resiliency both in microservice environments (where they protect microservices from each other) and in more traditional client-server environments (where a large number of clients such as mobile phone apps might be stacked against a comparatively small number of servers.)
              
  
View details
          
        
      
    
        
          
            
              The Many Ways Your Monitoring Is Lying To You
            
          
        
        
          
            
              
                
                  
                    
                
              
            
          
          
          
          
    
    
    
    
    
            SRECon16 Europe, USENIX Association, Dublin, Ireland (2016)
          
          
        
        
        
          
              Preview abstract
          
          
              This talk looks at various failure modes of monitoring systems, with a goal of making readers more aware of the difference between the monitoring system's view of the world and the system itself.
              
  
View details
          
        
      
    
        
          
            
              Building Blocks for  Site Reliability
            
          
        
        
          
            
              
                
                  
                    
                
              
            
          
          
          
          
    
    
    
    
    
            International Industry-Academia Workshop on Cloud Reliability and Resilience, EIT Digital, Berlin, Germany (2016)
          
          
        
        
        
          
              Preview abstract
          
          
              How does Google run reliable systems? At the heart of Site Reliability Engineering is the idea of treating reliability as a software problem and and asking software engineers to design an operations function. This talk will examine the organizational, conceptual and technological building blocks that together comprise the concept of site reliability engineering at Google.
              
  
View details
          
        
      
    
        
          
            
              Reliable Product Launches at Scale
            
          
        
        
          
            
              
                
                  
                    
    
    
    
        
         
          
  
Preview
        
    
  
                      
                        Rhandeev Singh
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vivek Rau
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
          
          
        