Betsy (Adrienne Elizabeth) Beyer
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. She holds degrees from Stanford and Tulane.
Research Areas
Authored Publications
Sort By
Preview abstract
As with most large-scale migration efforts, the last 20% of Alphabet's BeyondCorp migration required disproportionate effort. After successfully transitioning most of the company's workflows to BeyondCorp, we still had a long tail of specific, oddball, or challenging situations to resolve. This article examines how we created processes, tools, and solutions to handle use cases that were not easily adapted to our core HTTPS-based workflow.
View details
Reliable Data Processing with Minimal Toil
Athena Vawda
Julia Lee
Rita Sodt
(2021)
Preview abstract
This paper discusses an approach for making data pipelines both safer and less manual. We detail how we applied well known reliability best practices from user-facing services to batch jobs that underpin many of the services that make up Google Workspace. Using validation steps, canarying, and target populations for data pipelines, we ensure that only stable versions are promoted to the next environment stage. By moving to a single, standardized platform we minimized duplicate effort across services. We also touch on how we optimized batch jobs for both correctness and freshness SLOs, and the benefits of batch jobs vs. async event-based processing.
View details
Preview abstract
“Canarying” is a colloquial term originating from bringing a caged canary into a mine to find dangerous gases. John Scott Haldane proposed the idea around 1913.
In this article, canarying is a partial and time-limited deployment of a change in a service, followed by an evaluation of whether the service change is safe. The production change process may then roll forward, roll back, alert a human, or do something else. Effective canarying involves many decisions—for example, how to deploy the partial service change or choose meaningful metrics—and deserves a separate discussion.
Canary Analysis Service (CAS) is a shared centralized service at Google that offers automatic (and often auto-configured) analysis of key metrics during a production change. We use CAS to analyze new versions of binaries, configuration changes, data set changes, and other production changes. CAS evaluates hundreds of thousands of production changes per day.
View details
Making it Last: Achieving Digital Permanence
Raymond 'Princess Sparklefists' Blum
ACM Queue, Nov-Dec 2018 (2018)
Preview abstract
The amount of information added to the corpus of humanity’s knowledge grows at an increasing rate. Meanwhile, the apparent “concreteness” of the datastore, and thus our confidence in the permanence and integrity of the data, is reduced with every technological leap. This presents challenges at many levels, the most basic of which is guaranteeing that the content that we retrieve is in fact the same information that we previously stored away for today’s use.
This article will
* Examine the challenges in ensuring the integrity of our datastore
* Identify classes of failure for data integrity
* Share some techniques to counter or reduce the risk presented by each type of failure—whether encountered singly or in a perfect storm—brought about by a conspiring world.
View details
Preview abstract
In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You’ll learn:
* How to run reliable services in environments you don’t completely control—like cloud
* Practical applications of how to create, monitor, and run your services via Service Level Objectives
* How to convert existing ops teams to SRE—including how to dig out of operational overload
* Methods for starting SRE from either greenfield or brownfield
View details
Preview abstract
Until recently, GDesktop was hosted on commercially-available hardware on our corporate network using a homegrown open-source virtual cluster management system called Ganeti. Today, this substantial and Google-critical workload runs on GCP. This article discusses why we moved to GCP, and how we carried out the migration.
View details
Preview abstract
What does a healthy fleet look like in a modern enterprise? How does one go from an unhealthy, or unknown, fleet to a healthy fleet? What tools and policies are essential? We dive into these topics as they formed a core part of our BeyondCorp journey at Google.
View details
How SRE relates to DevOps
Niall Richard Murphy
Liz Fong-Jones
Todd Underwood
Laura Nolan
O'Reilly and Associates (2018)
Preview abstract
DevOps and Site Reliability Engineering (SRE) have emerged in recent years as solutions for managing operations in IT and software development. Is one method better than the other? Will one of them eventually win out? This article explains why these two disciplines—in both practice and philosophy—are much more alike than you may think.
Humans have been thinking about better ways to operate things for millennia, but despite all of this effort and thought, running enterprise software operations well remains elusive for many organizations. In this article, IT operations experts provide the key tenets of DevOps and SRE, compare and contrast the two, and explain the incentives necessary to successfully adopt either approach.
View details
Migrating to BeyondCorp: Maintaining Productivity While Improving Security
Jeff Peck
Login, Summer 2017, VOl 42, No 2 (2017)
Preview abstract
If you've read the three previous installments in the series about Google's BeyondCorp network security model, you may be thinking: “That all sounds good...but how does my organization move from where we are today to a similar model? What do I need to do? What's the potential impact on my company and my employees?” This article discusses how we moved from our legacy network to the BeyondCorp model--changing the fundamentals of network access--without breaking the company’s productivity.
View details
Preview abstract
This article follows up SRE Book chapter “Postmortem Culture: Learning from Failure." Here, we address the challenges in designing an appropriate action item plan and then executing that plan. We discuss best practices for developing high-quality action items (AIs) for a postmortem, plus methods of ensuring these AIs actually get implemented.
View details