Sebastian Kirsch
Sebastian is a Site Reliability Engineer for Google in Zürich. Sebastian joined Google in 2006 in Dublin, Ireland, and has worked both on internal systems like Google's web crawler, as well as on external products like Google Maps or Google Calendar. He specializes in the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high reliability bar as every other Google service.
Research Areas
Authored Publications
Sort By
Designing and Operating Highly Available Software Systems at Scale
Michael Wildpaner
Escuela Politécnica de Ingeniería de Gijón, Gijón (2019)
Preview abstract
The talk explains what Site Reliability Engineering (SRE) is, how it is used at Google, and gives an overview of the challenges to take a regular LAMP-style small service into supporting 100M users, it also speaks about monitoring and other SRE dimensions, from capacity planning to design reviews.
View details
Interviewing for Systems Design Skills
(2018) (to appear)
Preview abstract
Google SRE has developed a special interview format called "Non-Abstract Large Systems Design" or NALSD. The focus of this interview is developing a credible approach for solving a specific problem at large scale. Going beyond coding and algorithm skills, candidates demonstrate their skills in designing for scalability, reliability and robustness, estimating provisioning needs, and managing change. All candidates for SRE positions at Google participate in one NALSD interview as part of their recruiting process.
Attendees will learn why Google has developed this interview format and which aspects of a candidate's skill set are covered in the format. They will see an example of this interview type, and learn how to come up with their own interview questions. Tips and tricks derived from practical experience in conducting this interview type will help attendees avoid common pitfalls when interviewing candidates.
View details
Safe Client Behaviour
Shenzhen, China (2017)
Preview abstract
Ubiquitous compute power has created frequent impedance mismatches between client capabilities and server capacity in distributed systems. Careful client behaviour design protects the server from unintended load and enables safe recovery after outages. These techniques improve resiliency both in microservice environments (where they protect microservices from each other) and in more traditional client-server environments (where a large number of clients such as mobile phone apps might be stacked against a comparatively small number of servers.)
View details
Reliable Launches at Scale
SRECon17 Asia, USENIX Association, Singapore (2017)
Preview abstract
How do you perform up to 70 product and feature launches per week safely, reliably and reproducibly? Google staffed a dedicated team of Site Reliability Engineers to solve this question: Launch Coordination Engineers work across Google's service space to audit new products and features for reliability, act as liaisons between teams involved in a launch, and be gatekeepers.
View details
Reliable Product Launches at Scale
Preview
Rhandeev Singh
Vivek Rau
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
Building Blocks for Site Reliability
International Industry-Academia Workshop on Cloud Reliability and Resilience, EIT Digital, Berlin, Germany (2016)
Preview abstract
How does Google run reliable systems? At the heart of Site Reliability Engineering is the idea of treating reliability as a software problem and and asking software engineers to design an operations function. This talk will examine the organizational, conceptual and technological building blocks that together comprise the concept of site reliability engineering at Google.
View details
The Many Ways Your Monitoring Is Lying To You
SRECon16 Europe, USENIX Association, Dublin, Ireland (2016)
Preview abstract
This talk looks at various failure modes of monitoring systems, with a goal of making readers more aware of the difference between the monitoring system's view of the world and the system itself.
View details