Jennifer Petoff
Jennifer Petoff is Director of Google Cloud Platform (GCP) & Technical Infrastructure (TI) Education and is based in Lisbon, Portugal. She leads training programs for Google's GCP and TI Engineering Teams. Jennifer is one of the co-editors of the best-selling book, Site Reliability Engineering: How Google Runs Production Systems and is a regular speaker at DevOps and SRE conferences around the world. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester in the United States.
Authored Publications
Sort By
Preview abstract
Real world experience and things that go wrong are two of life’s best teachers. This talk will explore key elements of scalable large-system design and Site Reliability Engineering (SRE) principles* through anti-patterns encountered in real life. Find out what lessons can be gleaned from watching the dynamics in a crowded cafe or dealing with a security issue during a hotel stay. Learn about fundamental site reliability engineering principles and practices including:
-Avoiding cascading failures
-Not feeding the machines with human toil
-Writing blameless postmortems
-Engineering solutions to eliminate classes of errors rather than implementing point fixes
These principles will be framed through a lens of the suboptimal while demonstrating the impact of SRE anti-patterns on user trust.
* SRE is often thought of as a specific implementation of the DevOps interface.
View details
Why Training Matters to an SRE Practice and Why SRE Matters To Your Training Program
97 Things Every SRE Should Know, O'Reilly (2021), pp. 162-163
Preview abstract
This contribution explores why training matters to a successful and inclusive SRE practice. On the flip side, I’ll share what learning and development practitioners can learn from SRE principles, practices, and culture to deliver a consistent and reliable program.
View details
Preview abstract
COVID–19 changed work and the workplace as we know it around the world. The need for social distancing meant that onboarding new team members also had to change. Google's SRE EDU team had to react and evolve in the face of rapidly changing conditions, pivoting from an in–person orientation experience for new hires with team members flying from different locations to meet together in a classroom to a fully remote experience. This talk will cover how Google's SRE EDU team delivered a work–from–home onboarding experience in 13 days, avoiding disruptions to training operations by applying SRE principles and best practices. We’ll share lessons learned from our Live → Remote postmortem that are expected to be applicable to organizations of all sizes and recommendations for how to make the most of difficult circumstances to set new hires up for success.
View details
Swim! Don't Sink. Why Training Matters to an SRE Practice in Feedback Loops - Voices of All Day DevOps, Volume 2
Feedback Loops, Voices of All Day DevOps, All Day DevOps Press (2020), pp. 127-132
Preview abstract
Do you offer training to the engineers in your organization or do you throw them off the deep end to “sink or swim”? Providing training and education is universally important to set team members up for success in your organization and is critical for establishing a thriving Site Reliability Engineering (SRE) or DevOps practice and culture in the first place. The specific training needs of each engineer varies depending on several factors including:
-The maturity of your organization in adopting DevOps / SRE principles, practices, and culture
-The knowledge those individuals have about your organization and infrastructure
-The experience of the individuals being trained, both in terms of technical skill and familiarity with the SRE / DevOps model
This talk will explore the business case for training, the trade-offs between cost and effectiveness, and best practices for training design and deployment depending on where your organization lies on the spectrum of size and maturity. Learn why training is not about unleashing a fire hose of information upon unsuspecting engineers but about giving those engineers the confidence to run production systems at scale.
View details
Preview abstract
Do you offer training to the engineers in your organization or do you throw them off the deep end to “sink or swim”? Providing training and education is universally important to set team members up for success in your organization and is critical for establishing a thriving Site Reliability Engineering (SRE) or DevOps practice and culture in the first place.
The specific training needs of each engineer varies depending on several factors including:
-The maturity of your organization in adopting DevOps / SRE principles, practices, and culture
-The knowledge those individuals have about your organization and infrastructure
-The experience of the individuals being trained, both in terms of technical skill and familiarity with the SRE / DevOps model
This talk will explore the business case for training, the trade-offs between cost and effectiveness, and best practices for training design and deployment depending on where your organization lies on the spectrum of size and maturity.
Learn why training is not about unleashing a fire hose of information upon unsuspecting engineers but about giving those engineers the confidence to run production systems at scale.
View details
Preview abstract
The DevOps Institute publishes a "Humans of DevOps" podcast. Jennifer Petoff answers a series of foundational questions about SRE principles and practices plus some information on SRE Training.
View details
Preview abstract
This talk addresses how to apply SRE principles and best practices in running a consistent and reliable training program for an SRE team. We’ll look at this from both a technical and operations perspective. We’ll share the importance of giving new SREs hands-on experience with production infrastructure early in an environment that is real but safe for them to learn. We’ll share some challenges that we encountered in building an educational stack and associated curriculum that can be induced to break on demand (e.g., SRE managed platforms are resilient and sometimes you *can’t* easily break them in the ways you want) and approaches to solve for those challenges.
View details
Preview abstract
Readers of this report will understand the state of the art for training Site Reliability Engineers in both general and domain-specific techniques. This report addresses SRE development and operations practices, along with discussion on how to sustain SRE practices through individual and organizational change. The report will look at training best practices within Google SRE, and also how some Google Customer Reliability Engineering (CRE) partners approach SRE training.
View details
Preview abstract
Short Description
This talk addresses what we learned when scaling training best practices globally at Google. Along the way, we’ll share tips for small and large organizations alike on how you can learn from our experience and ensure that you deliver an effective training experience for your SREs.
Full Description
In 2015, Andrew Widdowson gave a talk at SREcon Americas titled “From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams”. His recommendations were based on nearly a decade of personal experience ramping up new SREs at Google.
Fast forward to 2018. Google SRE now has a global training organization called SRE EDU. In many ways, SRE EDU was charged with developing a formal program to deploy these training best practices into production. Our goal? Spin up a globally consistent and reliable education program for Site Reliability Engineering.
Of course a cornerstone of SRE practice is the blameless postmortem. This talk addresses what we learned when scaling training best practices globally. Along the way, we’ll share tips for small and large organizations alike on how you can learn from our experience and ensure that you deliver an effective training experience for your SREs.
View details
Embracing Failure
(2018)
Preview abstract
The key Site Reliability Engineering principle of embracing failure is discussed on the Red Hat Command Line Heroes Podcast.
View details