Google Research

Help protect your datacenters with safety constraints

  • Christina Schulman
  • Etienne Perot
(2018)

Abstract

Running a multi-tenant, multi-datacenter compute infrastructure requires automating machine management across their respective lifecycles. We look at how Google keeps its own infrastructure safe in the face of rogue automation and human error, as well as ever-changing machine management software.

We’ll discuss common failure patterns that we’ve encountered in Google’s automation systems, and ways to avoid and mitigate them. We’ll also cover principles of a good production safety constraint checking service: when to use it, what constraints it should have, and how to make that system safe from itself.

These principles apply at any scale, and it’s easier to apply them if you start early.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work