Betsy (Adrienne Elizabeth) Beyer
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. She holds degrees from Stanford and Tulane.
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
As with most large-scale migration efforts, the last 20% of Alphabet's BeyondCorp migration required disproportionate effort. After successfully transitioning most of the company's workflows to BeyondCorp, we still had a long tail of specific, oddball, or challenging situations to resolve. This article examines how we created processes, tools, and solutions to handle use cases that were not easily adapted to our core HTTPS-based workflow.
View details
Reliable Data Processing with Minimal Toil
Athena Vawda
Julia Lee
Rita Sodt
(2021)
Preview abstract
This paper discusses an approach for making data pipelines both safer and less manual. We detail how we applied well known reliability best practices from user-facing services to batch jobs that underpin many of the services that make up Google Workspace. Using validation steps, canarying, and target populations for data pipelines, we ensure that only stable versions are promoted to the next environment stage. By moving to a single, standardized platform we minimized duplicate effort across services. We also touch on how we optimized batch jobs for both correctness and freshness SLOs, and the benefits of batch jobs vs. async event-based processing.
View details
Making it Last: Achieving Digital Permanence
Raymond 'Princess Sparklefists' Blum
ACM Queue, vol. Nov-Dec 2018 (2018)
Preview abstract
The amount of information added to the corpus of humanity’s knowledge grows at an increasing rate. Meanwhile, the apparent “concreteness” of the datastore, and thus our confidence in the permanence and integrity of the data, is reduced with every technological leap. This presents challenges at many levels, the most basic of which is guaranteeing that the content that we retrieve is in fact the same information that we previously stored away for today’s use.
This article will
* Examine the challenges in ensuring the integrity of our datastore
* Identify classes of failure for data integrity
* Share some techniques to counter or reduce the risk presented by each type of failure—whether encountered singly or in a perfect storm—brought about by a conspiring world.
View details
BeyondCorp 6: Building a Healthy Fleet
Michael Janosko
Hunter King
;login:, vol. 43 (2018)
Preview abstract
What does a healthy fleet look like in a modern enterprise? How does one go from an unhealthy, or unknown, fleet to a healthy fleet? What tools and policies are essential? We dive into these topics as they formed a core part of our BeyondCorp journey at Google.
View details
How SRE relates to DevOps
Niall Richard Murphy
Liz Fong-Jones
Todd Underwood
Laura Nolan
O'Reilly and Associates (2018)
Preview abstract
DevOps and Site Reliability Engineering (SRE) have emerged in recent years as solutions for managing operations in IT and software development. Is one method better than the other? Will one of them eventually win out? This article explains why these two disciplines—in both practice and philosophy—are much more alike than you may think.
Humans have been thinking about better ways to operate things for millennia, but despite all of this effort and thought, running enterprise software operations well remains elusive for many organizations. In this article, IT operations experts provide the key tenets of DevOps and SRE, compare and contrast the two, and explain the incentives necessary to successfully adopt either approach.
View details
Preview abstract
In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You’ll learn:
* How to run reliable services in environments you don’t completely control—like cloud
* Practical applications of how to create, monitor, and run your services via Service Level Objectives
* How to convert existing ops teams to SRE—including how to dig out of operational overload
* Methods for starting SRE from either greenfield or brownfield
View details
Preview abstract
Until recently, GDesktop was hosted on commercially-available hardware on our corporate network using a homegrown open-source virtual cluster management system called Ganeti. Today, this substantial and Google-critical workload runs on GCP. This article discusses why we moved to GCP, and how we carried out the migration.
View details
Preview abstract
“Canarying” is a colloquial term originating from bringing a caged canary into a mine to find dangerous gases. John Scott Haldane proposed the idea around 1913.
In this article, canarying is a partial and time-limited deployment of a change in a service, followed by an evaluation of whether the service change is safe. The production change process may then roll forward, roll back, alert a human, or do something else. Effective canarying involves many decisions—for example, how to deploy the partial service change or choose meaningful metrics—and deserves a separate discussion.
Canary Analysis Service (CAS) is a shared centralized service at Google that offers automatic (and often auto-configured) analysis of key metrics during a production change. We use CAS to analyze new versions of binaries, configuration changes, data set changes, and other production changes. CAS evaluates hundreds of thousands of production changes per day.
View details
Migrating to BeyondCorp: Maintaining Productivity While Improving Security
Jeff Peck
Login, vol. Summer 2017, VOl 42, No 2 (2017)
Preview abstract
If you've read the three previous installments in the series about Google's BeyondCorp network security model, you may be thinking: “That all sounds good...but how does my organization move from where we are today to a similar model? What do I need to do? What's the potential impact on my company and my employees?” This article discusses how we moved from our legacy network to the BeyondCorp model--changing the fundamentals of network access--without breaking the company’s productivity.
View details
Preview abstract
You're only as available as the sum of your dependencies.
View details
Preview abstract
This article follows up SRE Book chapter “Postmortem Culture: Learning from Failure." Here, we address the challenges in designing an appropriate action item plan and then executing that plan. We discuss best practices for developing high-quality action items (AIs) for a postmortem, plus methods of ensuring these AIs actually get implemented.
View details
BeyondCorp: The User Experience
Filip Zyzniewski
Login, vol. tbd (2017), tbd
Preview abstract
Previous articles in the BeyondCorp series discuss aspects of the technical challenges we solved along the way (see BeyondCorp: Design to Deployment at Google and BeyondCorp: The Access Proxy). Beyond its purely technical features, the migration also had a human element: it was vital to keep our users constantly in mind throughout this process. Our goal was to keep the end user experience as seamless as possible. When things did go wrong, we wanted users to know exactly how to proceed and where to go for help. This article describes the experience of Google employees as they work within the BeyondCorp model, some new processes that BeyondCorp enabled, and how we help users when they run into issues.
View details
Preview abstract
Improving security and usability at Google through an access model with dynamic tiers of trust for devices.
View details
Reliable Product Launches at Scale
Preview
Rhandeev Singh
Vivek Rau
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
Preview abstract
Reducing interrupts using the methodology taken from Bigtable SRE: Relieving technical debt through short projects.
This article begins by describing the landscape of work faced by Site Reliability Engineering (SRE) teams at Google: the types of work we undertake, the logistics of how SRE teams are organized across sites, and the inevitable toil we incur. Within this discussion, we focus on interrupts: how teams initially approached tickets, and why and how we implemented a better strategy. After providing a case study of how the ticket funnel was one such successful initiative, we offer practical advice about mapping what we learned to other organizations.
View details
Service Level Objectives
Preview
Niall Murphy
Cody Smith
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
The Evolution of Automation at Google
Preview
Niall Murphy
John Looney
Michael Kacirek
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
Data Integrity: What You Read Is What You Wrote
Preview
Raymond Blum
Rhandeev Singh
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
Preview abstract
This article details the implementation of BeyondCorp's front end infrastructure. It focuses on the Access Proxy, the challenges we encountered in its implementation, and the resulting lessons we learned in its design and rollout. We also touch on some of the projects we're currently undertaking to improve the overall user experience for employees accessing internal applications.
In migrating to the BeyondCorp model (previously discussed in BeyondCorp: A New Approach to Enterprise Security and BeyondCorp: Design to Deployment at Google), Google had to solve a number of problems. One particular challenge was figuring out how to enforce company policy across all our internal-only services. A conventional approach might integrate each back end with the device Trust Inferer in order to evaluate applicable policies; however, this approach would significantly slow the rate at which we're able to launch and change products.
To address this challenge, Google implemented a centralized policy enforcement front end Access Proxy (AP)--to handle coarse-grained company policies. Our implementation of the AP is generic enough to let us implement logically different gateways using the same AP codebase. At the moment, Access Proxy implements both the Web Proxy and the SSH gateway components, according to the terminology used in the previous article. As the AP was the only mechanism that allowed employees to access internal HTTP services, all internal services were required to migrate behind the AP.
Unsurprisingly, attempting to deal with only HTTP requests proved inadequate, so we had to provide solutions for additional protocols, many of which required end-to-end encryption (e.g. SSH). These additional protocols necessitated a number of client-side changes to ensure that the device was properly identified to the AP.
The combination of the AP and an Access Control Engine (a shared ACL evaluator) for all entry points provided a couple of main benefits: a common logging point for all requests allowed us to perform forensic analysis more effectively, and we were also able to make changes to enforcement policies much more quickly and consistently.
View details
The Evolving SRE Engagement Model
Preview
Acacio Cruz
Tim Harvey
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
The Production Environment at Google, from the Viewpoint of an SRE
Preview
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
Release Engineering
Preview
Dinah McNutt
Tim Harvey
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
Invent More, Toil Less
Brendan Gleason
Dave O'Connor
Vivek Rau
:login;, vol. 41, issue 3 (2016), pp. 44-48
Preview abstract
This article is a follow-up to Vivek Rau's chapter "Eliminating Toil" in Site Reliability Engineering: How Google Runs Production Systems. We begin by recapping Vivek's definition of toil and Google's approach to balancing operational work with engineering project work. The Bigtable SRE case study then presents a concrete example of how one team at Google went about reducing toil. Finally, we leave readers with a series of best practices that should be helpful in reducing toil no matter the size or makeup of the organization.
View details
Eliminating Toil
Preview
Vivek Rau
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
Communication and Collaboration in SRE
Preview
Niall Richard Murphy
Alex Rodriguez
Carl Crous
Dylan Curley
Lorenzo Blanco
Todd Underwood
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
Monitoring Distributed Systems
Preview
Rob Ewaschuk
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
Lessons Learned from Other Industries
Preview
Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
Preview abstract
The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?
In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.
This book is divided into four sections:
Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems
Management—Explore Google's best practices for training, communication, and meetings that your organization can use
View details
Preview abstract
In order to run the company’s numerous services as efficiently and reliably as possible, Google’s Site Reliability Engineering (SRE) organization leverages the expertise of two main disciplines: Software Engineering and Systems Engineering. The roles of Software Engineer (SWE) and Systems Engineer (SE) lie at the two poles of the SRE continuum of skills and interests. While Site Reliability Engineers tend to be assigned to one of these two buckets, there is much overlap between the two job roles, and the knowledge exchange between the two job roles is rather fluid.
View details
Preview abstract
Virtually every company today uses firewalls to enforce perimeter security. However, this security model is problematic because, when that perimeter is breached, an attacker has relatively easy access to a company’s privileged intranet. As companies adopt mobile and cloud technologies, the perimeter is becoming increasingly difficult to enforce. Google is taking a different approach to network security. We are removing the requirement for a privileged intranet and moving our corporate applications to the Internet.
Also see https://cloud.google.com/beyondcorp/
View details
No Results Found