Network infrastructure

We design and build the world's most innovative and efficient datacenter networks and end-host networking stacks, to enable compute and storage not available anywhere else.

About the team

Our team brings together experts in networking, distributed systems, kernel and systems programming, end-host stacks, and advanced algorithms to create the datacenter networks that power Google. Our networks are among the world’s largest and fastest, and we design them to be reliable, cheap, and easy to evolve. We often use new technologies unavailable outside Google.

We exemplify Google’s Hybrid Approach to Research: we deploy real-world systems at global scale. Many members of our team have extensive research experience, we publish papers in conferences such as SIGCOMM, NSDI, SOSP, and OSDI, and we work closely with interns and faculty from leading universities.

Every Google product relies on the technologies we develop. Our networks support complex, highly-available, planetary-scale distributed systems with billions of users. We constantly evolve our networks to meet the requirements of, and create opportunities for, new and better Google products, especially the rapidly-growing Google Cloud.

Our team works in many locations: Sunnyvale CA, New York City, Madison WI, Boulder CO, Reston VA, and Seattle WA.

Team focus summaries

Congestion control, network measurement, and traffic management

All networks are subject to congestion; we want to operate ours at high utilization levels (to reduce costs) while meeting strict performance objectives. We’re inventing new congestion avoidance protocols, and improving our global-scale, near-real-time, automated traffic engineering system. We’re building better ways to measure our networks, accurately and at scale, to drive our evaluation of congestion-control techniques, and as real-time input to automated traffic management.

Data-center network design

We continue to innovate in designs for scalable, fast, cheap, reliable, and evolvable data-center networks. When necessary, we design our own hardware, and innovate in network topology and routing protocols. We use automatic techniques to optimize network designs.

Network management

We’re building automated network management systems, enabling us to rapidly repair and improve our networks with little or no downtime. We’re using techniques such as formal modeling of network topologies and highly-available distributed systems, while working closely with Google’s network engineers and operators to implement automated workflows.

Programmable packet processing

We’re developing new mechanisms for low-latency, CPU-efficient communication. We want our network switches and endpoints to implement novel packet-processing functions without compromising on cost or performance. We’re exploring hardware and software techniques for fast, flexible, safe packet processing, including onload, offload, RDMA, P4, and more.

Software-Defined networking (SDN)

We employ SDN extensively. We were early users of, and contributors to, OpenFlow, and continue, with P4, to raise the level of abstraction for silicon-agnostic switching. We are developing SDN controller platforms that can handle Google’s needs for scale and reliability, and SDN applications for routing, traffic management, and other functions.

High velocity development and testing

To introduce network innovations into production as rapidly as possible, without compromising availability, we test our designs and implementations early, often, and extensively. We’re developing advanced software validation techniques, we embrace automation in all aspects of testing and qualification, and we build powerful infrastructure for testing, debugging, and root-causing, in both physical and emulated testbeds.

Featured publications

Orion: Google’s Software-Defined Networking Control Plane
Amr Sabaa
Henrik Muehe
Joon Suan Ong
KondapaNaidu Bollineni
Lorenzo Vicisano
Mike Conley
Min Zhu
Rich Alimi
Shawn Chen
Shidong Zhang
Waqar Mohsin
(2021)
Preview abstract We present Orion, a distributed Software-Defined Networking platform deployed globally in Google’s datacenter (Jupiter) as well as Wide Area (B4) networks. Orion was designed around a modular, micro-service architecture with a central publish-subscribe database to enable a distributed, yet tightly-coupled, software-defined network control system. Orion enables intent-based management and control, is highly scalable and amenable to global control hierarchies. Over the years, Orion has matured with continuously improving performance in convergence (up to 40x faster), throughput (handling up to 1.16 million network updates per second), system scalability (supporting 16x larger networks), and data plane availability (50x, 100x reduction in unavailable time in Jupiter and B4, respectively) while maintaining high development velocity with bi-weekly release cadence. Today, Orion robustly enables all of Google’s Software-Defined Networks defending against failure modes that are both generic to large scale production networks as well as unique to SDN systems. View details
1RMA: Re-Envisioning Remote Memory Access for Multi-Tenant Datacenters
Aditya Akella
Arjun Singhvi
Joel Scherpelz
Monica C Wong-Chan
Moray Mclaren
Prashant Chandra
Rob Cauble
Sean Clark
Simon Sabato
Thomas F. Wenisch
Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, Association for Computing Machinery, New York, NY, USA (2020), 708–721
Preview abstract Remote Direct Memory Access (RDMA) plays a key role in supporting performance-hungry datacenter applications. However, existing RDMA technologies are ill-suited to multi-tenant datacenters, where applications run at massive scales, tenants require isolation and security, and the workload mix changes over time. Our experiences seeking to operationalize RDMA at scale indicate that these ills are rooted in standard RDMA's basic design attributes: connection-orientedness and complex policies baked into hardware. We describe a new approach to remote memory access -- One-Shot RMA (1RMA) -- suited to the constraints imposed by our multi-tenant datacenter settings. The 1RMA NIC is connection-free and fixed-function; it treats each RMA operation independently, assisting software by offering fine-grained delay measurements and fast failure notifications. 1RMA software provides operation pacing, congestion control, failure recovery, and inter-operation ordering, when needed. The NIC, deployed in our production datacenters, supports encryption at line rate (100Gbps and 100M ops/sec) with minimal performance/availability disruption for encryption key rotation. View details
Preview abstract We report on experiences deploying Swift congestion control in Google datacenters. Swift relies on hardware timestamps in modern NICs, and is based on AIMD control with a specified end-to-end delay target. This simple design is an evolution of earlier protocols used at Google. It has emerged as a foundation for excellent performance, when network distances are well-known, that helps to meet operational challenges. Delay is easy to decompose into fabric and host components to separate concerns, and effortless to deploy and maintain as a signal from switches in changing datacenter environments. With Swift, we obtain low flow completion times for short RPCs, even at the 99th-percentile, while providing high throughput for long RPCs. At datacenter scale, Swift achieves 50$\mu$s tail latencies for short RPCs while sustaining a 100Gbps throughput per-server, a load close to 100\%. This is much better than protocols such as DCTCP that degrade latency and loss at high utilization. View details
Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization
Mike Dalton
David Schultz
Ahsan Arefin
Alex Docauer
Anshuman Gupta
Brian Matthew Fahs
Dima Rubinstein
Enrique Cauich Zermeno
Erik Rubow
Jake Adriaens
Jesse L Alpert
Jing Ai
Jon Olson
Kevin P. DeCabooter
Nan Hua
Nathan Lewis
Nikhil Kasinadhuni
Riccardo Crepaldi
Srinivas Krishnan
Subbaiah Venkata
Yossi Richter
15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018
Preview abstract This paper presents our design and experience with Andromeda, Google Cloud Platform’s network virtualization stack. Our production deployment poses several challenging requirements, including performance isolation among customer virtual networks, scalability, rapid provisioning of large numbers of virtual hosts, bandwidth and latency largely indistinguishable from the underlying hardware, and high feature velocity combined with high availability. Andromeda is designed around a flexible hierarchy of flow processing paths. Flows are mapped to a programming path dynamically based on feature and performance requirements. We introduce the Hoverboard programming model, which uses gateways for the long tail of low bandwidth flows, and enables the control plane to program network connectivity for tens of thousands of VMs in seconds. The on-host dataplane is based around a high-performance OS bypass software packet processing path. CPU-intensive per packet operations with higher latency targets are executed on coprocessor threads. This architecture allows Andromeda to decouple feature growth from fast path performance, as many features can be implemented solely on the coprocessor path. We demonstrate that the Andromeda datapath achieves performance that is competitive with hardware while maintaining the flexibility and velocity of a software-based architecture. View details
Snap: a Microkernel Approach to Host Networking
Jacob Adriaens
Sean Bauer
Carlo Contavalli
Mike Dalton
William C. Evans
Nicholas Kidd
Roman Kononov
Carl Mauer
Emily Musick
Lena Olson
Mike Ryan
Erik Rubow
Kevin Springborn
Valas Valancius
In ACM SIGOPS 27th Symposium on Operating Systems Principles, ACM, New York, NY, USA (2019) (to appear)
Preview abstract This paper presents our design and experience with a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system that supports Google’s rapidly evolving needs with flexible modules that implement a range of network functions, including edge packet switching, virtualization for our cloud platform, traffic shaping policy enforcement, and a high-performance reliable messaging and RDMA-like service. Snap has been running in production for over three years, supporting the extensible communication needs of several large and critical systems. Snap enables fast development and deployment of new networking features, leveraging the benefits of address space isolation and the productivity of userspace software development together with support for transparently upgrading networking services without migrating applications off of a machine. At the same time, Snap achieves compelling performance through a modular architecture that promotes principled synchronization with minimal state sharing, and supports real-time scheduling with dynamic scaling of CPU resources through a novel kernel/userspace CPU scheduler co-design. Our evaluation demonstrates over 3x Gbps/core improvement compared to a kernel networking stack for RPC workloads, software-based RDMA-like performance of up to 5M IOPS/core, and transparent upgrades that are largely imperceptible to user applications. Snap is deployed to over half of our fleet of machines and supports the needs of numerous teams. View details
Preview abstract Network management is becoming increasingly automated, and automation depends on detailed, explicit representations of data about both the state of a network, and about an operator’s intent for its networks. In particular, we must explicitly represent the desired and actual topology of a network; almost all other network-management data either derives from its topology, constrains how to use a topology, or associates resources (e.g., addresses) with specific places in a topology. We describe MALT, a Multi-Abstraction-Layer Topology representation, which supports virtually all of our network management phases: design, deployment, configuration, operation, measurement, and analysis. MALT provides interoperability across software systems, and its support for abstraction allows us to explicitly tie low-level network elements to high-level design intent. MALT supports a declarative style that simplifies what-if analysis and testbed support. We also describe the software base that supports efficient use of MALT, as well as numerous, sometimes painful lessons we have learned about curating the taxonomy for a comprehensive, and evolving, representation for topology. View details
Nines are Not Enough: Meaningful Metrics for Clouds
Proc. 17th Workshop on Hot Topics in Operating Systems (HoTOS) (2019)
Preview abstract Cloud customers want reliable, understandable promises from cloud providers that their applications will run reliably and with adequate performance, but today, providers offer only limited guarantees, which creates uncertainty for customers. Providers also must define internal metrics to allow them to operate their systems without violating customer promises or expectations. We explore why these guarantees are hard to define. We show that this problem shares some similarities with the challenges of applying statistics to make decisions based on sampled data. We also suggest that defining guarantees in terms of defense against threats, rather than guarantees for application-visible outcomes, can reduce the complexity of these problems. Overall, we offer a partial framework for thinking about Service Level Objectives (SLOs), and discuss some unsolved challenges. View details
Minimal Rewiring: Efficient Live Expansion for Clos Data Center Networks
Shizhen Zhao
Joon Ong
Proc. 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2019), USENIX Association (to appear)
Preview abstract Clos topologies have been widely adopted for large-scale data center networks (DCNs), but it has been difficult to support incremental expansions of Clos DCNs. Some prior work has assumed that it is impossible to design DCN topologies that are both well-structured (non-random) and incrementally expandable at arbitrary granularities. We demonstrate that it is indeed possible to design such networks, and to expand them while they are carrying live traffic, without incurring packet loss. We use a layer of patch panels between blocks of switches in a Clos network, which makes physical rewiring feasible, and we describe how to use integer linear programming (ILP) to minimize the number of patch-panel connections that must be changed, which makes expansions faster and cheaper. We also describe a block-aggregation technique that makes our ILP approach scalable. We tested our "minimal-rewiring" solver on two kinds of fine-grained expansions using 2250 synthetic DCN topologies, and found that the solver can handle 99% of these cases while changing under 25% of the connections. Compared to prior approaches, this solver (on average) reduces the number of "stages" per expansion by about 3.1X -- a significant improvement to our operational costs, and to our exposure (during expansions) to capacity-reducing faults. View details
BBR: Congestion-Based Congestion Control
C. Stephen Gunn
Van Jacobson
Communications of the ACM, 60 (2017), pp. 58-66
Preview abstract By all accounts, today’s Internet is not moving data as well as it should. Most of the world’s cellular users experience delays of seconds to minutes; public Wi-Fi in airports and conference venues is often worse. Physics and climate researchers need to exchange petabytes of data with global collaborators but find their carefully engineered multi-Gbps infrastructure often delivers at only a few Mbps over intercontinental distances.6 These problems result from a design choice made when TCP congestion control was created in the 1980s—interpreting packet loss as “congestion.”13 This equivalence was true at the time but was because of technology limitations, not first principles. As NICs (network interface controllers) evolved from Mbps to Gbps and memory chips from KB to GB, the relationship between packet loss and congestion became more tenuous. Today TCP’s loss-based congestion control—even with the current best of breed, CUBIC11—is the primary cause of these problems. When bottleneck buffers are large, loss-based congestion control keeps them full, causing bufferbloat. When bottleneck buffers are small, loss-based congestion control misinterprets loss as a signal of congestion, leading to low throughput. Fixing these problems requires an alternative to loss-based congestion control. Finding this alternative requires an understanding of where and how network congestion originates. View details
Preview abstract We increasingly depend on the availability of online services, either directly as users, or indirectly, when cloud-provider services support directly-accessed services. The availability of these "visible services" depends in complex ways on the availability of a complex underlying set of invisible infrastructure services. In our experience, most software engineers lack useful frameworks to create and evaluate designs for individual services that support end-to-end availability in these infrastructures, especially given cost, performance, and other constraints on viable commercial services. Even given the extensive research literature on techniques for replicated state machines and other fault-tolerance mechanisms, we found little help in this literature for addressing infrastructure-wide availability. Past research has often focused on point solutions, rather than end-to-end ones. In particular, it seems quite difficult to define useful targets for infrastructure-level availability, and then to translate these to design requirements for individual services. We argue that, in many but not all ways, one can think about availability with the mindset that we have learned to use for security, and we discuss some general techniques that appear useful for implementing and operating high-availability infrastructures. We encourage a shift in emphasis for academic research into availability. View details

Join our team

Internships

We have a vigorous internship program, with a strong focus on PhD-level students who would like to understand how large-scale networks are designed, built, and operated. We also hire Bachelors and Masters interns. Most of our internship projects are focused on building software, especially distributed systems and kernels, and do not necessarily require a prior background in networking.

Please check again in September or October 2024 to find out about internships for 2025.

Open role(s)

  • Software Engineer, Systems and Infrastructure, PhD University Graduate : Learn more
    • PhD-level software engineers in Network Infrastructure apply their research training to the toughest problems of designing and building large-scale, high-performance, high-availability distributed systems to design, manage, measure, and control our datacenter, WAN, and peering-edge SDN networks (each of which has been the subject of at least one SIGCOMM paper). We're also creating innovative end-host stacks, to support CPU-efficient, low-latency, congestion-aware communication, with secure isolation between users. You'll work with other skillful, creative people, including people who wrote research papers you've read, and you'll keep connected with the academic research community.
    • Note that this job opening covers teams besides Network Infrastructure; we have several teams looking for a candidates with a mix of various "Systems" skills.