Betsy (Adrienne Elizabeth) Beyer

Betsy (Adrienne Elizabeth) Beyer

Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. She holds degrees from Stanford and Tulane.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract As with most large-scale migration efforts, the last 20% of Alphabet's BeyondCorp migration required disproportionate effort. After successfully transitioning most of the company's workflows to BeyondCorp, we still had a long tail of specific, oddball, or challenging situations to resolve. This article examines how we created processes, tools, and solutions to handle use cases that were not easily adapted to our core HTTPS-based workflow. View details
    Preview abstract This paper discusses an approach for making data pipelines both safer and less manual. We detail how we applied well known reliability best practices from user-facing services to batch jobs that underpin many of the services that make up Google Workspace. Using validation steps, canarying, and target populations for data pipelines, we ensure that only stable versions are promoted to the next environment stage. By moving to a single, standardized platform we minimized duplicate effort across services. We also touch on how we optimized batch jobs for both correctness and freshness SLOs, and the benefits of batch jobs vs. async event-based processing. View details
    Preview abstract What does a healthy fleet look like in a modern enterprise? How does one go from an unhealthy, or unknown, fleet to a healthy fleet? What tools and policies are essential? We dive into these topics as they formed a core part of our BeyondCorp journey at Google. View details
    From Corp to Cloud: Google's Virtual Desktops
    Matt Fata
    Patrick Hahn
    Philippe-Joseph Arida
    ACM Queue (2018)
    Preview abstract Until recently, GDesktop was hosted on commercially-available hardware on our corporate network using a homegrown open-source virtual cluster management system called Ganeti. Today, this substantial and Google-critical workload runs on GCP. This article discusses why we moved to GCP, and how we carried out the migration. View details
    How SRE relates to DevOps
    Niall Richard Murphy
    Liz Fong-Jones
    Todd Underwood
    Laura Nolan
    O'Reilly and Associates (2018)
    Preview abstract DevOps and Site Reliability Engineering (SRE) have emerged in recent years as solutions for managing operations in IT and software development. Is one method better than the other? Will one of them eventually win out? This article explains why these two disciplines—in both practice and philosophy—are much more alike than you may think. Humans have been thinking about better ways to operate things for millennia, but despite all of this effort and thought, running enterprise software operations well remains elusive for many organizations. In this article, IT operations experts provide the key tenets of DevOps and SRE, compare and contrast the two, and explain the incentives necessary to successfully adopt either approach. View details
    The Site Reliability Workbook
    Niall Murphy
    Kent Kawahara
    O'Reilly and Associates (2018)
    Preview abstract In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment. This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t. Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is. You’ll learn: * How to run reliable services in environments you don’t completely control—like cloud * Practical applications of how to create, monitor, and run your services via Service Level Objectives * How to convert existing ops teams to SRE—including how to dig out of operational overload * Methods for starting SRE from either greenfield or brownfield View details
    Preview abstract “Canarying” is a colloquial term originating from bringing a caged canary into a mine to find dangerous gases. John Scott Haldane proposed the idea around 1913. In this article, canarying is a partial and time-limited deployment of a change in a service, followed by an evaluation of whether the service change is safe. The production change process may then roll forward, roll back, alert a human, or do something else. Effective canarying involves many decisions—for example, how to deploy the partial service change or choose meaningful metrics—and deserves a separate discussion. Canary Analysis Service (CAS) is a shared centralized service at Google that offers automatic (and often auto-configured) analysis of key metrics during a production change. We use CAS to analyze new versions of binaries, configuration changes, data set changes, and other production changes. CAS evaluates hundreds of thousands of production changes per day. View details
    Making it Last: Achieving Digital Permanence
    Raymond 'Princess Sparklefists' Blum
    ACM Queue, Nov-Dec 2018 (2018)
    Preview abstract The amount of information added to the corpus of humanity’s knowledge grows at an increasing rate. Meanwhile, the apparent “concreteness” of the datastore, and thus our confidence in the permanence and integrity of the data, is reduced with every technological leap. This presents challenges at many levels, the most basic of which is guaranteeing that the content that we retrieve is in fact the same information that we previously stored away for today’s use. This article will * Examine the challenges in ensuring the integrity of our datastore * Identify classes of failure for data integrity * Share some techniques to counter or reduce the risk presented by each type of failure—whether encountered singly or in a perfect storm—brought about by a conspiring world. View details
    The Calculus of Service Availability
    Ben Treynor
    Benjamin Lutch
    Mike Dahlin
    Vivek Rau
    ACM Queue (2017)
    Preview abstract You're only as available as the sum of your dependencies. View details
    Preview abstract This article follows up SRE Book chapter “Postmortem Culture: Learning from Failure." Here, we address the challenges in designing an appropriate action item plan and then executing that plan. We discuss best practices for developing high-quality action items (AIs) for a postmortem, plus methods of ensuring these AIs actually get implemented. View details