- Christina Schulman
- Etienne Perot
Running a multi-tenant, multi-datacenter compute infrastructure requires automating machine management across their respective lifecycles. We look at how Google keeps its own infrastructure safe in the face of rogue automation and human error, as well as ever-changing machine management software.
We’ll discuss common failure patterns that we’ve encountered in Google’s automation systems, and ways to avoid and mitigate them. We’ll also cover principles of a good production safety constraint checking service: when to use it, what constraints it should have, and how to make that system safe from itself.
These principles apply at any scale, and it’s easier to apply them if you start early.