Sebastian Kirsch

Sebastian Kirsch

Sebastian is a Site Reliability Engineer for Google in Zürich. Sebastian joined Google in 2006 in Dublin, Ireland, and has worked both on internal systems like Google's web crawler, as well as on external products like Google Maps or Google Calendar. He specializes in the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high reliability bar as every other Google service.

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Designing and Operating Highly Available Software Systems at Scale
    Michael Wildpaner
    Escuela Politécnica de Ingeniería de Gijón, Gijón (2019)
    Preview abstract The talk explains what Site Reliability Engineering (SRE) is, how it is used at Google, and gives an overview of the challenges to take a regular LAMP-style small service into supporting 100M users, it also speaks about monitoring and other SRE dimensions, from capacity planning to design reviews. View details
    Preview abstract Google SRE has developed a special interview format called "Non-Abstract Large Systems Design" or NALSD. The focus of this interview is developing a credible approach for solving a specific problem at large scale. Going beyond coding and algorithm skills, candidates demonstrate their skills in designing for scalability, reliability and robustness, estimating provisioning needs, and managing change. All candidates for SRE positions at Google participate in one NALSD interview as part of their recruiting process. Attendees will learn why Google has developed this interview format and which aspects of a candidate's skill set are covered in the format. They will see an example of this interview type, and learn how to come up with their own interview questions. Tips and tricks derived from practical experience in conducting this interview type will help attendees avoid common pitfalls when interviewing candidates. View details
    Preview abstract Ubiquitous compute power has created frequent impedance mismatches between client capabilities and server capacity in distributed systems. Careful client behaviour design protects the server from unintended load and enables safe recovery after outages. These techniques improve resiliency both in microservice environments (where they protect microservices from each other) and in more traditional client-server environments (where a large number of clients such as mobile phone apps might be stacked against a comparatively small number of servers.) View details
    Reliable Launches at Scale
    SRECon17 Asia, USENIX Association, Singapore (2017)
    Preview abstract How do you perform up to 70 product and feature launches per week safely, reliably and reproducibly? Google staffed a dedicated team of Site Reliability Engineers to solve this question: Launch Coordination Engineers work across Google's service space to audit new products and features for reliability, act as liaisons between teams involved in a launch, and be gatekeepers. View details
    Reliable Product Launches at Scale
    Rhandeev Singh
    Vivek Rau
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    Building Blocks for Site Reliability
    International Industry-Academia Workshop on Cloud Reliability and Resilience, EIT Digital, Berlin, Germany (2016)
    Preview abstract How does Google run reliable systems? At the heart of Site Reliability Engineering is the idea of treating reliability as a software problem and and asking software engineers to design an operations function. This talk will examine the organizational, conceptual and technological building blocks that together comprise the concept of site reliability engineering at Google. View details
    The Many Ways Your Monitoring Is Lying To You
    SRECon16 Europe, USENIX Association, Dublin, Ireland (2016)
    Preview abstract This talk looks at various failure modes of monitoring systems, with a goal of making readers more aware of the difference between the monitoring system's view of the world and the system itself. View details