Cloud networking

Our team explores all aspects of networking and distributed system research, to design and build the most reliable, highly secured and fastest Cloud networks.

About the team

We constantly evolve cloud networking solutions to provide a great cloud experience to billions of users. Our focus area covers customer-facing networking API design to the network data and control plane programming including HW programming. We exercise the Hybrid Research model by deploying our solutions in the Google Cloud Platform, which is one of the largest and fastest-growing cloud providers in industry.

Our researchers come from diversified backgrounds: networking, distributed systems, network security, kernel programming and algorithms. Many members of our team have extensive research experience with publication in conferences such as SIGCOMM, NSDI, SOSP, and OSDI.

Team focus summaries

Software defined networking (SDN)

Cloud networking employs SDN extensively. We are developing the next generation SDN controller platforms that can handle Google’s needs for scale and reliability for providing various network functions such as routing, load balancing, and firewall.

High performance and secure data plane

We are building the next generation cloud virtual network data plane, which provides low-latency, CPU-efficient, secured communication at line rate. We are exploring both software and hardware techniques for fast, flexible, safe packet processing, including H/W onload, offload, and more.

Network measurement, Network analytics

We are investing significantly to build our own network measurement and monitoring platform, to detect traffic anomalies accurately, in near-real-time at global scale. These signals are also used for ML-driven abuse prevention.

Congestion control, quality of service and traffic management

We employ state-of-art congestion control and traffic management schemes, over the WAN and in datacenters, to enable high performance as well as isolation across multiple tenants.

Network modeling and verification

We are working increasingly towards intent-based networking. We are using techniques such as topology modeling and verification for packet reachability analysis.

WAN design

We continue to innovate in high performance, reliable, fault-tolerant, and cheap data-driven WAN designs, for connectivity from the data center edge to the Internet, providing support from L3 connectivity to various cloud-level products. We aim to build solutions that are reliable and can provide seamless connectivity to GCP customers, regardless of users’ location.

Featured publications

Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization
Mike Dalton
David Schultz
Ahsan Arefin
Alex Docauer
Anshuman Gupta
Brian Matthew Fahs
Dima Rubinstein
Enrique Cauich Zermeno
Erik Rubow
Jake Adriaens
Jesse L Alpert
Jing Ai
Jon Olson
Kevin P. DeCabooter
Nan Hua
Nathan Lewis
Nikhil Kasinadhuni
Riccardo Crepaldi
Srinivas Krishnan
Subbaiah Venkata
Yossi Richter
15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018
Preview abstract This paper presents our design and experience with Andromeda, Google Cloud Platform’s network virtualization stack. Our production deployment poses several challenging requirements, including performance isolation among customer virtual networks, scalability, rapid provisioning of large numbers of virtual hosts, bandwidth and latency largely indistinguishable from the underlying hardware, and high feature velocity combined with high availability. Andromeda is designed around a flexible hierarchy of flow processing paths. Flows are mapped to a programming path dynamically based on feature and performance requirements. We introduce the Hoverboard programming model, which uses gateways for the long tail of low bandwidth flows, and enables the control plane to program network connectivity for tens of thousands of VMs in seconds. The on-host dataplane is based around a high-performance OS bypass software packet processing path. CPU-intensive per packet operations with higher latency targets are executed on coprocessor threads. This architecture allows Andromeda to decouple feature growth from fast path performance, as many features can be implemented solely on the coprocessor path. We demonstrate that the Andromeda datapath achieves performance that is competitive with hardware while maintaining the flexibility and velocity of a software-based architecture. View details
BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing
Björn Carlin
C. Stephen Gunn
Enrique Cauich Zermeno
Jing Ai
Mathieu Robin
Nikhil Kasinadhuni
Sushant Jain
ACM SIGCOMM 2015 (to appear)
Preview abstract WAN bandwidth remains a constrained resource that is economically infeasible to substantially overprovision. Hence,it is important to allocate capacity according to service priority and based on the incremental value of additional allocation in particular bandwidth regions. For example, it may be highest priority for one service to receive 10Gb/s of bandwidth but upon reaching such an allocation, incremental priority may drop sharply favoring allocation to other services. Motivated by the observation that individual flows with fixed priority may not be the ideal basis for bandwidth allocation, we present the design and implementation of Bandwidth Enforcer (BwE), a global, hierarchical bandwidth allocation infrastructure. BwE supports: i) service-level bandwidth allocation following prioritized bandwidth functions where a service can represent an arbitrary collection of ows, ii) independent allocation and delegation policies according to user-defined hierarchy, all accounting for a global view of bandwidth and failure conditions, iii) multi-path forwarding common in traffic-engineered networks, and iv) a central administrative point to override (perhaps faulty) policy during exceptional conditions. BwE has delivered more service-efficient bandwidth utilization and simpler management in production for multiple years. View details
Maglev: A Fast and Reliable Software Network Load Balancer
Carlo Contavalli
Cody Smith
Roman Kononov
Eric Mann-Hielscher
Ardas Cilingiroglu
Bin Cheyney
Wentao Shang
Jinnah Dylan Hosein
13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), USENIX Association, Santa Clara, CA (2016), pp. 523-535
Preview abstract Maglev is Google’s network load balancer. It is a large distributed software system that runs on commodity Linux servers. Unlike traditional hardware network load balancers, it does not require a specialized physical rack deployment, and its capacity can be easily adjusted by adding or removing servers. Network routers distribute packets evenly to the Maglev machines via Equal Cost Multipath (ECMP); each Maglev machine then matches the packets to their corresponding services and spreads them evenly to the service endpoints. To accommodate high and ever-increasing traffic, Maglev is specifically optimized for packet processing performance. A single Maglev machine is able to saturate a 10Gbps link with small packets. Maglev is also equipped with consistent hashing and connection tracking features, to minimize the negative impact of unexpected faults and failures on connection-oriented protocols. Maglev has been serving Google's traffic since 2008. It has sustained the rapid global growth of Google services, and it also provides network load balancing for Google Cloud Platform. View details
Evolve or Die: High-Availability Design Principles Drawn from Google's Network Infrastructure
Ramesh Govindan
Ina Minei
Mahesh Kallahalla
Amin Vahdat
ACM SIGCOMM (2016)
Preview abstract Maintaining the highest levels of availability for content providers is challenging in the face of scale, network evolution, and complexity. Little, however, is known about the network failures large content providers are susceptible to, and what mechanisms they employ to ensure high availability. From a detailed analysis of over 100 high-impact failure events within Google’s network, encompassing many data centers and two WANs, we quantify several dimensions of availability failures. We find that failures are evenly distributed across different network types and across data, control, and management planes, but that a large number of failures happen when a network management operation is in progress within the network. We discuss some of these failures in detail, and also describe our design principles for high availability motivated by these failures. These include using defense in depth, maintaining consistency across planes, failing open on large failures, carefully preventing and avoiding failures, and assessing root cause quickly. Our findings suggest that, as networks become more complicated, failures lurk everywhere, and, counter-intuitively, continuous incremental evolution of the network can, when applied together with our design principles, result in a more robust network. View details