Andrew D. Ferguson

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    CAPA: An Architecture For Operating Cluster Networks With High Availability
    Bingzhe Liu
    Mukarram Tariq
    Omid Alipourfard
    Rich Alimi
    Deepak Arulkannan
    Virginia Beauregard
    Patrick Conner
    Brighten Godfrey
    Xander Lin
    Mayur Patel
    Joon Ong
    Amr Sabaa
    Alex Smirnov
    Manish Verma
    Prerepa Viswanadham
    Google, Google, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 (2023)
    Preview abstract Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“ layer. We evaluate CAPA based on case studies of outages prevented, counterfactual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years View details
    Orion: Google’s Software-Defined Networking Control Plane
    Amr Sabaa
    Henrik Muehe
    Joon Suan Ong
    Karthik Swaminathan Nagaraj
    KondapaNaidu Bollineni
    Lorenzo Vicisano
    Mike Conley
    Min Zhu
    Rich Alimi
    Shawn Chen
    Shidong Zhang
    Waqar Mohsin
    (2021)
    Preview abstract We present Orion, a distributed Software-Defined Networking platform deployed globally in Google’s datacenter (Jupiter) as well as Wide Area (B4) networks. Orion was designed around a modular, micro-service architecture with a central publish-subscribe database to enable a distributed, yet tightly-coupled, software-defined network control system. Orion enables intent-based management and control, is highly scalable and amenable to global control hierarchies. Over the years, Orion has matured with continuously improving performance in convergence (up to 40x faster), throughput (handling up to 1.16 million network updates per second), system scalability (supporting 16x larger networks), and data plane availability (50x, 100x reduction in unavailable time in Jupiter and B4, respectively) while maintaining high development velocity with bi-weekly release cadence. Today, Orion robustly enables all of Google’s Software-Defined Networks defending against failure modes that are both generic to large scale production networks as well as unique to SDN systems. View details