Ramón Medrano Llamas
Ramón is a Site Reliability Engineer working on the Identity team. He started back in 2011 as an intern and has since then become team Technical Lead (TL), Engineering Manager and recently moved into a üTL role for the all Privacy, Safety and Security teams. Our role is to store, manage and safeguard user accounts, from account creation down to credential management passing by account security like hijacking and phishing protection.
Prior to Google, Ramón worked at CERN, being part of the Physics Department and the ATLAS Collaboration, where he developed the ROOT framework for data analysis and then the functional testing framework to validate and ensure the reliability of the distributed computing facilities that allowed for the Higgs Boson discovery in 2012.
He holds a Computer Engineering MSc but for the last decade has been researching part time on autonomic computing and the management of computer fleets in data centers and enterprises to optimise and reduce the power usage of them.
Research Areas
Authored Publications
Sort By
How we use GenAI in SRE
CommitConf, Madrid (2024)
Preview abstract
Google services are powered by the largest network of computers in the world. Site Reliabity Engineers (SRE) make sure that the whole stack is cool: datacenters are safe, well provisionedl; we have fallback mechanims, and data integrity; to making sure we design our stack properly, using the right storage, replication and software trade-offs.
Generative AI is a great tool to make us super-effective: having access to tools to generate our most toily configurations, to classify risks and events, to manage large swaths of machines with agents or to automate complex workflows cheaply.
This talk will cover the journey that SRE started years ago to become a truly AI-First discipline and the latest advancements in tooling, practices and workflows.
View details
Swinging the Engineer/Manager Pendulum
T3chFest, Madrid (2023)
Preview abstract
Both coding and people need continuous, intense and conscious focus and attention. Being manager teaches you the business, being IC teaches you the possibilities of the tech. Being at the intersection makes you invaluable.
When you move to one side, your skills on the other begin to deteriorate, what to do? If you move to manage, are you going to be a forever manager? What is management in tech anyway? Can go back to write code?
This talk will show what is the process to onramp and offramp and what are the pitfalls on the way.
View details
Autonomic power management of a PC fleet
Ph.D. Thesis, University of Oviedo (2022)
Preview abstract
Both the transition to green energy to reduce CO2 emissions to net-zero by 2050 and the increase in energy prices suggest that we must find ways to reduce electricity consumption in all sectors, and in particular in the ICT sector. A large majority of companies have fleets of computers for their employees, of variable size, but growing. While in the operation of these fleets one of the biggest costs is energy consumption, many of the computers spend long periods of time turned on, but idling, thus wasting large amounts of electricity.
Dynamic Power Management (DPM) is a set of techniques and methods that are applied at different levels to reduce the consumption and heat dissipation of a computer. It includes techniques as varied as microprocessor dynamic frequency scalling (DFS) or turning off devices that are not in use. The different DPM tech- niques are directed by a series of energy management policies, which establish the operating guidelines of the different components. These policies are generated using different methods, adapted to the component being managed and the objectives to be achieved.
This thesis presents a DPM technique applied to a complete computer fleet. The goal is to reduce fleet consumption by proactively shutting down computers, while maintaining high levels of user satisfaction. The generation of the policies that direct the energy management system are produced based on data collected from the fleet under study and management. Utilisation models are generated from that data and allow representing and predicting the behavior of each user, thus being able to generate fully customized policies for each user.
One of the main contributions of this thesis is the use of satisfaction as a central metric in order to solve the optimisation problem that is the generation of energy policies. New metrics are defined that allow user satisfaction to be measured when the fleet is being optimized by the energy management system and, most importantly, to generate energy policies that guarantee a certain level of satisfaction for each user.
In order to verify and apply the proposed energy management method, a tool has been implemented that allows obtaining policies for a given fleet, studying variations and, using a simulation method, generating synthetic fleet records.
Finally, a validation of the presented work has been carried out, showing the results that it is possible to save up to 90 % of the energy otherwise wasted.
View details
Incident Management at Scale
At Scale Conferences (2022)
Preview abstract
This talk is an introduction to our IMAG protocol, explaining the topics that the SRE book introduced around incident management.
View details
SRE & Python
PyConES, Granada (2022)
Preview abstract
We talk about the principles of SRE and the multitude of Python software we employ in the task of orchestrating machines and containers.
View details
Modelling user satisfaction for power-usage optimisation of computer fleets
Joaquín Entrialgo
Daniel F. García
Simulation Modelling Practice and Theory, 108 (2021), pp. 102263
Preview abstract
Power consumption costs of computer fleets can be one of the main operational costs of medium or large-sized office sites. In order to optimise the power consumption of the fleet, a set of optimal power management policies must be generated and enforced.
Generating these policies is an optimisation problem of finding the power off timeout value for each computer that maximises energy savings while guaranteeing user satisfaction. To solve this problem, understanding the computer utilisation patterns of the users and defining a metric of user satisfaction is fundamental.
This paper presents a method to analyse user activity and inactivity, extract models from previously recorded utilisation logs and use them to manage a whole computer fleet. A tool that implements this method is also introduced. This tool generates power management policies from utilisation logs. It analyses the effects of variations in fleet characteristics on policies by means of discrete event simulation. It also seeks to understand the behavioural patterns of users over weekly periods. Finally, it generates utilisation logs from high level descriptions of fleets. This tool offers a simulation to study diverse fleet configurations and generation of synthetic fleets.
View details
Autoscaling Services On All Dimensions
All Day DevOps (2019)
Preview abstract
Why doing toil, if the machine can do it for you? This talk covers all of the multitude of autoscaling mechanisms applicable to service meshes made by containers managed by systems like Borg, Kubernetes, Swarm or DC/OS. From vertical, horizontal, auto turnup, load shifting, etc.
When deploying containerized stateless services on clusters managed by Kubernetes, for example, the most efficient way to run them is with the minimal number of replicas possible to cover the load, maximizing the utilization of resources. How to calculate the number of replicas to maintain a reliable service can be tricky: Pod restarts, traffic imbalances, load shifts, etc.
Further, vertically scaling services is a multidimension problem and services based on virtual machines like the JVM present specific challenges for autoscaling.
Configuring the autoscaler for the right utilization levels, using the right metrics and the right decaying factors is key for successfully scaling services.
View details
Engineering Reliability
PC Microservices, Dortmund (2019)
Preview abstract
How do you scale up a service, so it can serve millions (or billions!) of users around the globe, make it reliable and fast while maintaining development speed and change safety?
This talk introduces Site Reliability Engineering (SRE) at Google, explaining its purpose and describing the techniques it uses and the challenges it addresses. SRE teams manage Google's many services and properties, plus all the brand new Cloud infrastructure from our offices worldwide. They draw upon Linux based computing resources that are distributed in several data centres around the globe to deploy, manage and serve globally available services four billions of users.
View details
Designing and Operating Highly Available Software Systems at Scale
Michael Wildpaner
Escuela Politécnica de Ingeniería de Gijón, Gijón (2019)
Preview abstract
The talk explains what Site Reliability Engineering (SRE) is, how it is used at Google, and gives an overview of the challenges to take a regular LAMP-style small service into supporting 100M users, it also speaks about monitoring and other SRE dimensions, from capacity planning to design reviews.
View details
SRE Principles
DevOps Days Zürich, Winterthur (2018)
Preview abstract
As Ben Treynor (VP of 24x7 at Google and founding father of SRE) puts it, "SRE, fundamentally, it’s what happens when you ask a software engineer to design an operations function". What does differentiate an SRE (Site Reliability Engineering) from DevOps? Aren't they the same?
SRE is a job function that focuses on the reliability and maintainability of systems. It is also a mindset and a set of engineering practices to run better production services. An SRE has to be able to engineer creative solutions to problems, strike the right balance between reliability and feature velocity and target appropriate levels of service quality.
This talk covers the principles under which all SRE teams operate at Google: consistency, design of systems, monitoring, automation, error budgets, blameless postmortems, etc.
View details