CAPA: An Architecture For Operating Cluster Networks With High Availability

Bingzhe Liu; Colin Scott; Mukarram Tariq; Andrew Ferguson; Phillipa Gill; Omid Alipourfard; Rich Alimi; Deepak Arulkannan; Virginia Beauregard; Patrick Conner; Brighten Godfrey; Xander Lin; Mayur Patel; Joon Ong; Amr Sabaa; Arjun Singh; Alex Smirnov; Manish Verma; Prerepa Viswanadham; Amin Vahdat

CAPA: An Architecture For Operating Cluster Networks With High Availability

Bingzhe Liu

Colin Scott

Mukarram Tariq

Andrew Ferguson

Phillipa Gill

Omid Alipourfard

Rich Alimi

Deepak Arulkannan

Virginia Beauregard

Patrick Conner

Brighten Godfrey

Xander Lin

Mayur Patel

Joon Ong

Amr Sabaa

Arjun Singh

Alex Smirnov

Manish Verma

Prerepa Viswanadham

Amin Vahdat

Google, Google, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 (2023)

Download Google Scholar

Abstract

Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“ layer. We evaluate CAPA based on case studies of outages prevented, counterfactual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years

Research Areas

Networking

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

CAPA: An Architecture For Operating Cluster Networks With High Availability

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs