Luis Quesada
Luis Quesada is a Site Reliability Engineer and Manager at Google, where he is responsible for keeping Google Cloud’s Artificial Intelligence products running reliably and efficiently.
Authored Publications
Sort By
Preview abstract
As an SRE, you're responsible for determining the initial resource requirements of your service and ensuring your service behaves reasonably even in the face of unexpected demand. Capacity management is the process of ensuring you have the appropriate amount of resources for your service to be scalable, efficient, and reliable. User-facing and company internal services must accommodate both expected and unexpected growth. We define utilization as the percentage of a resource that is being used. It's difficult to determine initial resource utilization and predict future needs. We present ways to estimate utilization and identify blind spots, and we discuss the benefits of building in redundancy to avoid failures. You'll use this information to design your architecture such that increasing the resource allocation for each component of the service effectively increases the capacity of the entire service linearly.
View details
The Site Reliability Engineering Workbook Chapter: Identifying and Recovering from Overload
Maria-Hendrike Peetz
Marilia Melo
Diane Bates
The Site Reliability Engineering Workbook: Practical Ways to Implement SRE, O'Reilly (2018)
Preview abstract
When an SRE team is running smoothly, team members should feel like they can comfortably handle all of their work. They should be able to work on tickets and still have time to work on long-term projects that make it easier to manage the service in the future.
But sometimes circumstances get in the way of a team’s work goals. Team members take time off for long-term illnesses or move to new teams. Organizations hand down new production-wide programs for SRE. Changes to the service or the larger system introduce new technical challenges. As workload increases, team members start working longer hours to handle tickets and pages and spend less time on engineering work. The whole team starts to feel stressed and frustrated as they work harder but don’t feel like they are making progress. Stress, in turn, causes people to make more mistakes, impacting reliability and, ultimately, end users. In short, the team loses its ability to regulate its daily work and effectively manage the service.
At this point, the team needs to find a way out of this overloaded state. They need to rebalance their workload so that team members can focus on essential engineering work.
View details