Jump to Content
Amin Vahdat

Amin Vahdat

Amin Vahdat is an Engineering Fellow and Vice President for the Machine Leaning, Systems, and Cloud AI team. The team is responsible for product and engineering across:
  • Compute (Google Compute Engine, Borg/Cluster Scheduling, Operating Systems and Kernel)
  • Platforms (TPUs, GPUs, Servers, Storage, and Networking)
  • Cloud AI and Core ML (Vertex AI, training, serving, compilers, frameworks)
  • Network Infrastructure (Datacenter, Campus, RPC, and End Host network software)
  • Vahdat is active in Computer Science research, with more than 54,000 citations to over 200 refereed publications across cloud infrastructure, software defined networking, data consistency, operating systems, storage systems, data center architecture, and optical networking.

    In the past, he was the SAIC Professor of Computer Science and Engineering at UC San Diego. Vahdat received his PhD from UC Berkeley in Computer Science. Vahdat is an ACM Fellow and a member of the National Academy of Engineering. He has been recognized with a number of awards, including the National Science Foundation CAREER award, the Alfred P. Sloan Fellowship, the Duke University David and Janet Vaughn Teaching Award, the UC Berkeley Distinguished EECS Alumni Award, and the SIGCOMM Lifetime Achievement Award.

    Authored Publications
    Google Publications
    Other Publications
    Sort By
    • Title
    • Title, desc
    • Year
    • Year, desc
      CAPA: An Architecture For Operating Cluster Networks With High Availability
      Bingzhe Liu
      Mukarram Tariq
      Omid Alipourfard
      Rich Alimi
      Deepak Arulkannan
      Virginia Beauregard
      Patrick Conner
      Brighten Godfrey
      Xander Lin
      Mayur Patel
      Joon Ong
      Amr Sabaa
      Alex Smirnov
      Manish Verma
      Prerepa Viswanadham
      Google, Google, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 (2023)
      Preview abstract Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“ layer. We evaluate CAPA based on case studies of outages prevented, counterfactual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years View details
      Change Management in Physical Network Lifecycle Automation
      Virginia Beauregard
      Kevin Grant
      Angus Griffith
      Jahangir Hasan
      Chen Huang
      Quan Leng
      Jiayao Li
      Alexander Lin
      Zhoutao Liu
      Ahmed Mansy
      Bill Martinusen
      Nikil Mehta
      Andrew Narver
      Anshul Nigham
      Melanie Obenberger
      Sean Smith
      Kurt Steinkraus
      Sheng Sun
      Edward Thiele
      Proc. 2023 USENIX Annual Technical Conference (USENIX ATC 23)
      Preview abstract Automated management of a physical network's lifecycle is critical for large networks. At Google, we manage network design, construction, evolution, and management via multiple automated systems. In our experience, one of the primary challenges is to reliably and efficiently manage change in this domain -- additions of new hardware and connectivity, planning and sequencing of topology mutations, introduction of new architectures, new software systems and fixes to old ones, etc. We especially have learned the importance of supporting multiple kinds of change in parallel without conflicts or mistakes (which cause outages) while also maintaining parallelism between different teams and between different processes. We now know that this requires automated support. This paper describes some of our network lifecycle goals, the automation we have developed to meet those goals, and the change-management challenges we encountered. We then discuss in detail our approaches to several specific kinds of change management: (1) managing conflicts between multiple operations on the same network; (2) managing conflicts between operations spanning the boundaries between networks; (3) managing representational changes in the models that drive our automated systems. These approaches combine both novel software systems and software-engineering practices. While this paper reports on our experience with large-scale datacenter network infrastructures, we are also applying the same tools and practices in several adjacent domains, such as the management of WAN systems, of machines, and of datacenter physical designs. Our approaches are likely to be useful at smaller scales, too. View details
      Improving Network Availability with Protective ReRoute
      Abdul Kabbani
      Van Jacobson
      Jim Winget
      Brad Morrey
      Uma Parthavi Moravapalle
      Steven Knight
      SIGCOMM 2023
      Preview abstract We present PRR (Protective ReRoute), a transport technique for shortening user-visible outages that complements routing repair. It can be added to any transport to provide benefits in multipath networks. PRR responds to flow connectivity failure signals, e.g., retransmission timeouts, by changing the FlowLabel on packets of the flow, which causes switches and hosts to choose a different network path that may avoid the outage. To enable it, we shifted our IPv6 network architecture to use the FlowLabel, so that hosts can change the paths of their flows without application involvement. PRR is deployed fleetwide at Google for TCP and Pony Express, where it has been protecting all production traffic for several years. It is also available to our Cloud customers. We find it highly effective for real outages. In a measurement study on our network backbones, adding PRR reduced the cumulative region-pair outage time for RPC traffic by 63--84%. This is the equivalent of adding 0.4--0.8 "nines'" of availability. View details
      Preview abstract We describe our experience with Fathom, a system for identifying the network performance bottlenecks of any service running in the Google fleet. Fathom passively samples RPCs, the principal unit of work for services. It segments the overall latency into host and network components with kernel and RPC stack instrumentation. It records these detailed latency metrics, along with detailed transport connection state, for every sampled RPC. This lets us determine if the completion is constrained by the client, network or server. To scale while enabling analysis, we also aggregate samples into distributions that retain multi-dimensional breakdowns. This provides us with a macroscopic view of individual services. Fathom runs globally in our datacenters for all production traffic, where it monitors billions of TCP connections 24x7. For five years Fathom has been our primary tool for troubleshooting service network issues and assessing network infrastructure changes. We present case studies to show how it has helped us improve our production services. View details
      Aquila: A unified, low-latency fabric for datacenter networks
      Hema Hariharan
      Eric Lance
      Moray Mclaren
      Stephen Wang
      Zhehua Wu
      Sunghwan Yoo
      Raghuraman Balasubramanian
      Prashant Chandra
      Michael Cutforth
      Peter James Cuy
      David Decotigny
      Rakesh Gautam
      Rick Roy
      Zuowei Shen
      Ming Tan
      Ye Tang
      Monica C Wong-Chan
      Joe Zbiciak
      Aquila: A unified, low-latency fabric for datacenter networks (2022)
      Preview abstract Datacenter workloads have evolved from the data intensive, loosely-coupled workloads of the past decade to more tightly coupled ones, wherein ultra-low latency communication is essential for resource disaggregation over the network and to enable emerging programming models. We introduce Aquila, an experimental datacenter network fabric built with ultra-low latency support as a first-class design goal, while also supporting traditional datacenter traffic. Aquila uses a new Layer 2 cell-based protocol, GNet, an integrated switch, and a custom ASIC with low-latency Remote Memory Access (RMA) capabilities co-designed with GNet. We demonstrate that Aquila is able to achieve under 40 μs tail fabric Round Trip Time (RTT) for IP traffic and sub-10 μs RMA execution time across hundreds of host machines, even in the presence of background throughput-oriented IP traffic. This translates to more than 5x reduction in tail latency for a production quality key-value store running on a prototype Aquila network. View details
      Carbink: Fault-tolerant Far Memory
      Yang Zhou
      Sihang Liu
      Jiaqi Gao
      James Mickens
      Minlan Yu
      Hank Levy
      Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, Usenix (2022)
      Preview abstract Memory-intensive applications would benefit from using available memory from other machines (ie, remote memory or far memory). However, there is a missing piece in recent far memory proposals -- cost-efficient fault tolerance for far memory. In this paper, we motivate the strong need for fault tolerance for far memory using machine/task failure statistics from a major internet service provider. Then we describe the design and implementation off a Fault-Tolerant application-integrated Far Memory (i.e., FTFM) framework. We compare several candidate fault tolerance schemes, and discuss their pros and cons. Finally, we test FTFM using several X-internal applications, including graph processing, globally-distributed database, and in-memory database. Our results show that FTFM has little impact on application performance (~x.x%), while achieving xx% performance of running applications purely in local memory. View details
      Understanding Host Interconnect Congestion
      Khaled Elmeleegy
      Masoud Moshref
      Rachit Agarwal
      Saksham Agarwal
      Sylvia Ratnasamy
      Association for Computing Machinery, New York, NY, USA (2022), 198–204
      Preview abstract We present evidence and characterization of host congestion in production clusters: adoption of high-bandwidth access links leading to emergence of bottlenecks within the host interconnect (NIC-to-CPU data path). We demonstrate that contention on existing IO memory management units and/or the memory subsystem can significantly reduce the available NIC-to-CPU bandwidth, resulting in hundreds of microseconds of queueing delays and eventual packet drops at hosts (even when running a state-of-the-art congestion control protocol that accounts for CPU-induced host congestion). We also discuss implications of host interconnect congestion to design of future host architecture, network stacks and network protocols. View details
      Preview abstract Traffic load balancing across multiple paths is a critical task for modern networks to reduce network congestion and improve network efficiency. Hashing which is the foundation of traffic load balancing still faces practical challenges. The key problem is there is a growing need for more hash functions because networks are getting larger with more switches, more stages and increased path diversity. Meanwhile topology and routing becomes more agile in order to efficiently serve traffic demands with stricter throughput and latency SLAs. On the other hand, current generation switch chips only provide a limited number of uncorrelated hash functions. We first demonstrate why the limited number of hashing functions is a practical challenge in today's datacenter network (DCN) and wide-area network (WAN) designs. Then, to mitigate the problem, we propose a novel approach named \textsl{color recombining} which enables hash functions reuse via leveraging topology traits of multi-stage DCN networks. We also describe a novel framework based on \textsl{\coprime} theory to mitigate hash correlation in generic mesh topologies (i.e., spineless DCN and WAN). Our evaluation on real network trace data and topologies demonstrate that we can reduce the extent of load imbalance (measured by coefficient of variation) by an order of magnitude. View details
      Preview abstract A modern datacenter hosts thousands of services with a mix of latency-sensitive, throughput-intensive, and best-effort traffic with high degrees of fan-out and fan-in patterns. Maintaining low tail latency under high overload conditions is difficult, especially for latency-sensitive (LS) RPCs. In this paper, we consider the challenging case of providing service-level objectives (SLO) to LS RPCs when there are unpredictable surges in demand. We present Aequitas, a distributed sender-driven admission control scheme that is anchored on the key conceptual insight: Weighted-Fair Quality of Service (QoS) queues, found in standard NICs and switches, can be used to guarantee RPC level latency SLOs by a judicious selection of QoS weights and traffic-mix across QoS queues. Aequitas installs cluster-wide RPC latency SLOs by mapping LS RPCs to higher weight QoS queues, and coping with overloads by adaptively apportioning LS RPCs amongst QoS queues based on measured completion times for each queue. When the network demand spikes unexpectedly to 25× of provisioned capacity, Aequitas achieves a latency SLO that is 3.8× lower than the state-of-art congestion control at the 99.9th-p and admits 15× more RPCs meeting SLO target compared to pFabric when RPC sizes are not aligned with priorities. View details
      Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking
      Joon Ong
      Arjun Singh
      Mukarram Tariq
      Rui Wang
      Jianan Zhang
      Virginia Beauregard
      Patrick Conner
      Rishi Kapoor
      Stephen Kratzer
      Nanfang Li
      Hong Liu
      Karthik Nagaraj
      Jason Ornstein
      Samir Sawhney
      Ryohei Urata
      Lorenzo Vicisano
      Kevin Yasumura
      Shidong Zhang
      Junlan Zhou
      Proceedings of ACM SIGCOMM 2022
      Preview abstract We present a decade of evolution and production experience with Jupiter datacenter network fabrics. In this period Jupiter has delivered 5x higher speed and capacity, 30% reduction in capex, 41% reduction in power, incremental deployment and technology refresh all while serving live production traffic. A key enabler for these improvements is evolving Jupiter from a Clos to a direct-connect topology among the machine aggregation blocks. Critical architectural changes for this include: A datacenter interconnection layer employing Micro-ElectroMechanical Systems (MEMS) based Optical Circuit Switches (OCSes) to enable dynamic topology reconfiguration, centralized Software-Defined Networking (SDN) control for traffic engineering, and automated network operations for incremental capacity delivery and topology engineering. We show that the combination of traffic and topology engineering on direct-connect fabrics achieves similar throughput as Clos fabrics for our production traffic patterns. We also optimize for path lengths: 60% of the traffic takes direct path from source to destination aggregation blocks, while the remaining transits one additional block, achieving an average blocklevel path length of 1.4 in our fleet today. OCS also achieves 3x faster fabric reconfiguration compared to pre-evolution Clos fabrics that used a patch panel based interconnect. View details
      Preview abstract We (Google's networking teams) would like to increase our collaborations with academic researchers related to data-driven networking research. There are some significant constraints on our ability to directly share data, and in case not everyone in the community understands these, this document provides a brief summary. There are some models which can work (primarily, interns and visiting scientists). We describe some specific areas where we would welcome proposals to work within those models View details
      CliqueMap: Productionizing an RMA-Based Distributed Caching System
      Aditya Akella
      Amanda Strominger
      Arjun Singhvi
      Maggie Anderson
      Rob Cauble
      Thomas F. Wenisch
      SIGCOMM 2021 (2021) (to appear)
      Preview abstract Distributed caching is a key component in the design of performant, scalable Internet services, but accessing such caches via RPC incurs high cost. Remote Memory Access (RMA) offers a promising, less costly alternative, but achieving a rich production feature set with RMA-based systems is a significant challenge, as the rich abstraction of RPC lends itself to solutions for interoperability and upgradeability requirements of real systems. This work describes CliqueMap, a fully productionized RMA/RPC hybrid serving and caching system, and the production experience derived from three years of operation in Google’s datacenters. Building on internal technologies, CliqueMap serves multiple internal product areas and underlies several end-user-visible services. View details
      Orion: Google’s Software-Defined Networking Control Plane
      Amr Sabaa
      Henrik Muehe
      Joon Suan Ong
      KondapaNaidu Bollineni
      Lorenzo Vicisano
      Mike Conley
      Min Zhu
      Rich Alimi
      Shawn Chen
      Shidong Zhang
      Waqar Mohsin
      (2021)
      Preview abstract We present Orion, a distributed Software-Defined Networking platform deployed globally in Google’s datacenter (Jupiter) as well as Wide Area (B4) networks. Orion was designed around a modular, micro-service architecture with a central publish-subscribe database to enable a distributed, yet tightly-coupled, software-defined network control system. Orion enables intent-based management and control, is highly scalable and amenable to global control hierarchies. Over the years, Orion has matured with continuously improving performance in convergence (up to 40x faster), throughput (handling up to 1.16 million network updates per second), system scalability (supporting 16x larger networks), and data plane availability (50x, 100x reduction in unavailable time in Jupiter and B4, respectively) while maintaining high development velocity with bi-weekly release cadence. Today, Orion robustly enables all of Google’s Software-Defined Networks defending against failure modes that are both generic to large scale production networks as well as unique to SDN systems. View details
      Preview abstract We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent'': the only symptom is an erroneous computation. We refer to a core that develops such behavior as "mercurial.'' Mercurial cores are extremely rare, but in a large fleet of servers we can observe the correlated disruption they cause, often enough to see them as a distinct problem -- one that will require collaboration between hardware designers, processor vendors, and systems software architects. This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause. Please watch our short video summarizing the paper. View details
      Preview abstract To reduce cost, datacenter network operators are exploring blocking network designs. An example of such a design is a "spine-free" form of a Fat-Tree, in which pods directly connect to each other, rather than via spine blocks. To maintain application-perceived performance in the face of dynamic workloads, these new designs must be able to reconfigure routing and the inter-pod topology. Gemini is a system designed to achieve these goals on commodity hardware while reconfiguring the network infrequently, rendering these blocking designs practical enough for deployment in the near future. The key to Gemini is the joint optimization of topology and routing, using as input a robust estimation of future traffic derived from multiple historical traffic matrices. Gemini “hedges” against unpredicted bursts, by spreading these bursts across multiple paths, to minimize packet loss in exchange for a small increase in path lengths. It incorporates a robust decision algorithm to determine when to reconfigure, and whether to use hedging. Data from tens of production fabrics allows us to categorize these as either low- or high-volatility; these categories seem stable. For the former, Gemini finds topologies and routing with near-optimal performance and cost. For the latter, Gemini’s use of multi-traffic-matrix optimization and hedging avoids the need for frequent topology reconfiguration, with only marginal increases in path length. As a result, Gemini can support existing workloads on these production fabrics using a spine-free topology that is half the cost of the existing topology on these fabrics. View details
      1RMA: Re-Envisioning Remote Memory Access for Multi-Tenant Datacenters
      Aditya Akella
      Arjun Singhvi
      Joel Scherpelz
      Monica C Wong-Chan
      Moray Mclaren
      Prashant Chandra
      Rob Cauble
      Sean Clark
      Simon Sabato
      Thomas F. Wenisch
      Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, Association for Computing Machinery, New York, NY, USA (2020), 708–721
      Preview abstract Remote Direct Memory Access (RDMA) plays a key role in supporting performance-hungry datacenter applications. However, existing RDMA technologies are ill-suited to multi-tenant datacenters, where applications run at massive scales, tenants require isolation and security, and the workload mix changes over time. Our experiences seeking to operationalize RDMA at scale indicate that these ills are rooted in standard RDMA's basic design attributes: connection-orientedness and complex policies baked into hardware. We describe a new approach to remote memory access -- One-Shot RMA (1RMA) -- suited to the constraints imposed by our multi-tenant datacenter settings. The 1RMA NIC is connection-free and fixed-function; it treats each RMA operation independently, assisting software by offering fine-grained delay measurements and fast failure notifications. 1RMA software provides operation pacing, congestion control, failure recovery, and inter-operation ordering, when needed. The NIC, deployed in our production datacenters, supports encryption at line rate (100Gbps and 100M ops/sec) with minimal performance/availability disruption for encryption key rotation. View details
      Sundial: Fault-tolerant Clock Synchronization for Datacenters
      Hema Hariharan
      Dave Platt
      Simon Sabato
      Minlan Yu
      Prashant Chandra
      14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), USENIX Association (2020), pp. 1171-1186
      Preview abstract Clock synchronization is critical for many datacenter applications such as distributed transactional databases, consistent snapshots, and network telemetry. As applications have increasing performance requirements and datacenter networks get into ultra-low latency, we need submicrosecond-level bound on time-uncertainty to reduce transaction delay and enable new network management applications (e.g., measuring one-way delay for congestion control). The state-of-the-art clock synchronization solutions focus on improving clock precision but may incur significant time-uncertainty bound due to the presence of failures. This significantly affects applications because in large-scale datacenters, temperature-related, link, device, and domain failures are common. We present Sundial, a fault-tolerant clock-synchronization system for datacenters that achieves ~100ns time-uncertainty bound under various types of failures. Sundial provides fast failure detection based on frequent synchronization messages in hardware. Sundial enables fast failure recovery using a novel graph-based algorithm to precompute a backup plan that is generic to failures. Through experiments in a >500-machine testbed and large-scale simulations, we show that Sundial can achieve ~100ns time-uncertainty bound under different types of failures, which is more than two orders of magnitude lower than the state-of-the-art solutions. We also demonstrate the benefit of Sundial on applications such as Spanner and Swift congestion control. View details
      Preview abstract We report on experiences deploying Swift congestion control in Google datacenters. Swift relies on hardware timestamps in modern NICs, and is based on AIMD control with a specified end-to-end delay target. This simple design is an evolution of earlier protocols used at Google. It has emerged as a foundation for excellent performance, when network distances are well-known, that helps to meet operational challenges. Delay is easy to decompose into fabric and host components to separate concerns, and effortless to deploy and maintain as a signal from switches in changing datacenter environments. With Swift, we obtain low flow completion times for short RPCs, even at the 99th-percentile, while providing high throughput for long RPCs. At datacenter scale, Swift achieves 50$\mu$s tail latencies for short RPCs while sustaining a 100Gbps throughput per-server, a load close to 100\%. This is much better than protocols such as DCTCP that degrade latency and loss at high utilization. View details
      Snap: a Microkernel Approach to Host Networking
      Jacob Adriaens
      Sean Bauer
      Carlo Contavalli
      Mike Dalton
      William C. Evans
      Nicholas Kidd
      Roman Kononov
      Carl Mauer
      Emily Musick
      Lena Olson
      Mike Ryan
      Erik Rubow
      Kevin Springborn
      Valas Valancius
      In ACM SIGOPS 27th Symposium on Operating Systems Principles, ACM, New York, NY, USA (2019) (to appear)
      Preview abstract This paper presents our design and experience with a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system that supports Google’s rapidly evolving needs with flexible modules that implement a range of network functions, including edge packet switching, virtualization for our cloud platform, traffic shaping policy enforcement, and a high-performance reliable messaging and RDMA-like service. Snap has been running in production for over three years, supporting the extensible communication needs of several large and critical systems. Snap enables fast development and deployment of new networking features, leveraging the benefits of address space isolation and the productivity of userspace software development together with support for transparently upgrading networking services without migrating applications off of a machine. At the same time, Snap achieves compelling performance through a modular architecture that promotes principled synchronization with minimal state sharing, and supports real-time scheduling with dynamic scaling of CPU resources through a novel kernel/userspace CPU scheduler co-design. Our evaluation demonstrates over 3x Gbps/core improvement compared to a kernel networking stack for RPC workloads, software-based RDMA-like performance of up to 5M IOPS/core, and transparent upgrades that are largely imperceptible to user applications. Snap is deployed to over half of our fleet of machines and supports the needs of numerous teams. View details
      Preview abstract Network virtualization stacks such as Andromeda and Virtual Filtering Platform are the linchpins of public clouds hosting Virtual Machines (VMs). The dataplane is based on a combination of high performance OS bypass software and hardware packet processing paths. A key goal is to provide network performance isolation such that workloads of one VM do not adversely impact the network experience of another VM. In this work, we characterize how isolation breakages occur in virtualization stacks and motivate predictable VM performance just as if they were operating on dedicated hardware. We formulate an abstraction of a Predictable Virtualized NIC for bandwidth, latency and packet loss. We propose three constructs to achieve predictability: egress traffic shaping, and a combination of congestion control and CPU-fair weighted fair queueing for ingress isolation. Using these constructs in coherence, we provide the illusion of a dedicated NIC to VMs, all while maintaining the raw performance of the fastpath dataplane. View details
      Minimal Rewiring: Efficient Live Expansion for Clos Data Center Networks
      Shizhen Zhao
      Joon Ong
      Proc. 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2019), USENIX Association (to appear)
      Preview abstract Clos topologies have been widely adopted for large-scale data center networks (DCNs), but it has been difficult to support incremental expansions of Clos DCNs. Some prior work has assumed that it is impossible to design DCN topologies that are both well-structured (non-random) and incrementally expandable at arbitrary granularities. We demonstrate that it is indeed possible to design such networks, and to expand them while they are carrying live traffic, without incurring packet loss. We use a layer of patch panels between blocks of switches in a Clos network, which makes physical rewiring feasible, and we describe how to use integer linear programming (ILP) to minimize the number of patch-panel connections that must be changed, which makes expansions faster and cheaper. We also describe a block-aggregation technique that makes our ILP approach scalable. We tested our "minimal-rewiring" solver on two kinds of fine-grained expansions using 2250 synthetic DCN topologies, and found that the solver can handle 99% of these cases while changing under 25% of the connections. Compared to prior approaches, this solver (on average) reduces the number of "stages" per expansion by about 3.1X -- a significant improvement to our operational costs, and to our exposure (during expansions) to capacity-reducing faults. View details
      Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization
      Mike Dalton
      David Schultz
      Ahsan Arefin
      Alex Docauer
      Anshuman Gupta
      Brian Matthew Fahs
      Dima Rubinstein
      Enrique Cauich Zermeno
      Erik Rubow
      Jake Adriaens
      Jesse L Alpert
      Jing Ai
      Jon Olson
      Kevin P. DeCabooter
      Nan Hua
      Nathan Lewis
      Nikhil Kasinadhuni
      Riccardo Crepaldi
      Srinivas Krishnan
      Subbaiah Venkata
      Yossi Richter
      15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018
      Preview abstract This paper presents our design and experience with Andromeda, Google Cloud Platform’s network virtualization stack. Our production deployment poses several challenging requirements, including performance isolation among customer virtual networks, scalability, rapid provisioning of large numbers of virtual hosts, bandwidth and latency largely indistinguishable from the underlying hardware, and high feature velocity combined with high availability. Andromeda is designed around a flexible hierarchy of flow processing paths. Flows are mapped to a programming path dynamically based on feature and performance requirements. We introduce the Hoverboard programming model, which uses gateways for the long tail of low bandwidth flows, and enables the control plane to program network connectivity for tens of thousands of VMs in seconds. The on-host dataplane is based around a high-performance OS bypass software packet processing path. CPU-intensive per packet operations with higher latency targets are executed on coprocessor threads. This architecture allows Andromeda to decouple feature growth from fast path performance, as many features can be implemented solely on the coprocessor path. We demonstrate that the Andromeda datapath achieves performance that is competitive with hardware while maintaining the flexibility and velocity of a software-based architecture. View details
      B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN
      Min Zhu
      Rich Alimi
      Kondapa Naidu Bollineni
      Chandan Bhagat
      Sourabh Jain
      Jay Kaimal
      Jeffrey Liang
      Kirill Mendelev
      Faro Thomas Rabe
      Saikat Ray
      Malveeka Tewari
      Monika Zahn
      Joon Ong
      SIGCOMM'18 (2018)
      Preview abstract Private WANs are increasingly important to the operation of enterprises, telecoms, and cloud providers. For example, B4, Google’s private software-defined WAN, is larger and growing faster than our connectivity to the public Internet. In this paper, we present the five-year evolution of B4. We describe the techniques we employed to incrementally move from offering best-effort content-copy services to carrier-grade availability, while concurrently scaling B4 to accommodate 100x more traffic. Our key challenge is balancing the tension introduced by hierarchy required for scalability, the partitioning required for availability, and the capacity asymmetry inherent to the construction and operation of any large-scale network. We discuss our approach to managing this tension: i) we design a custom hierarchical network topology for both horizontal and vertical software scaling, ii) we manage inherent capacity asymmetry in hierarchical topologies using a novel traffic engineering algorithm without packet encapsulation, and iii) we re-architect switch forwarding rules via two-stage matching/hashing to deal with asymmetric network failures at scale. View details
      Preview abstract In this presentation, we will discuss Google’s intra-datacenter networks and interconnect. We will first review the evolution of datacenter interconnects and networking over the past decade, then outline future technology directions which will be needed to keep pace with the requirements and growth of the datacenter. View details
      Carousel: Scalable Traffic Shaping at End-Hosts
      Ahmed Saeed
      Valas Valancius
      Terry Lam
      Carlo Contavalli
      ACM SIGCOMM 2017
      Preview abstract Traffic shaping, including pacing and rate limiting, is fundamental to the correct and efficient operation of both datacenter and wide area networks. Sample use cases include policy-based bandwidth allocation to flow aggregates, rate-based congestion control algorithms, and packet pacing to avoid bursty transmissions that can overwhelm router buffers. Driven by the need to scale to millions of flows and to apply complex policies, traffic shaping is moving from network switches into the end hosts, typically implemented in software in the kernel networking stack. In this paper, we show that the performance overhead of end-host traffic shaping is substantial limits overall system scalability as we move to thousands of individual traffic classes per server. Measurements from production servers show that shaping at hosts consumes considerable CPU and memory, unnecessarily drops packets, suffers from head of line blocking and inaccuracy, and does not provide backpressure up the stack. We present Carousel, a framework that scales to tens of thousands of policies and flows per server, built from the synthesis of three key ideas: i) a single queue shaper using time as the basis for releasing packets, ii) fine-grained, just-in-time freeing of resources in higher layers coupled to actual packet departures, and iii) one shaper per CPU core, with lock-free coordination. Our production experience in serving video traffic at a Cloud service provider shows that Carousel shapes traffic accurately while improving overall machine CPU utilization by 8% (an improvement of 20% in the CPU utilization attributed to networking) relative to state-of-art deployments. It also conforms 10 times more accurately to target rates, and consumes two orders of magnitude less memory than existing approaches. View details
      Preview abstract In this presentation, we will review the evolution of Google’s intra-datacenter interconnects and networking over the past decade, then outline future technology directions which, along with a more holistic design approach, will be needed to keep pace with the requirements and growth of the datacenter. View details
      Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering
      Matthew Holliman
      Gary Baldus
      Marcus Hines
      TaeEun Kim
      Ashok Narayanan
      Victor Lin
      Colin Rice
      Brian Rogan
      Bert Tanaka
      Manish Verma
      Puneet Sood
      Mukarram Tariq
      Dzevad Trumic
      Vytautas Valancius
      Calvin Ying
      Mahesh Kallahalla
      Sigcomm (2017)
      Preview abstract We present the design of Espresso, Google’s SDN-based Internet peering edge routing infrastructure. This architecture grew out of a need to exponentially scale the Internet edge cost-effectively and to enable application-aware routing at Internet-peering scale. Espresso utilizes commodity switches and host-based routing/packet processing to implement a novel fine-grained traffic engineering capability. Overall, Espresso provides Google a scalable peering edge that is programmable, reliable, and integrated with global traffic systems. Espresso also greatly accelerated deployment of new networking features at our peering edge. Espresso has been in production for two years and serves over 22% of Google’s total traffic to the Internet. View details
      Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network
      Joon Ong
      Amit Agarwal
      Glen Anderson
      Ashby Armistead
      Roy Bannon
      Seb Boving
      Gaurav Desai
      Bob Felderman
      Paulie Germano
      Anand Kanagala
      Jeff Provost
      Jason Simmons
      Eiichi Tanda
      Jim Wanderer
      Stephen Stuart
      Communications of the ACM, vol. Vol. 59, No. 9 (2016), pp. 88-97
      Preview abstract We present our approach for overcoming the cost, operational complexity, and limited scale endemic to datacenter networks a decade ago. Three themes unify the five generations of datacenter networks detailed in this paper. First, multi-stage Clos topologies built from commodity switch silicon can support cost-effective deployment of building-scale networks. Second, much of the general, but complex, decentralized network routing and management protocols supporting arbitrary deployment scenarios were overkill for single-operator, pre-planned datacenter networks. We built a centralized control mechanism based on a global configuration pushed to all datacenter switches. Third, modular hardware design coupled with simple, robust software allowed our design to also support inter-cluster and wide-area networks. Our datacenter networks run at dozens of sites across the planet, scaling in capacity by 100x over 10 years to more than 1 Pbps of bisection bandwidth. View details
      BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing
      Björn Carlin
      C. Stephen Gunn
      Enrique Cauich Zermeno
      Jing Ai
      Mathieu Robin
      Nikhil Kasinadhuni
      Sushant Jain
      ACM SIGCOMM 2015 (to appear)
      Preview abstract WAN bandwidth remains a constrained resource that is economically infeasible to substantially overprovision. Hence,it is important to allocate capacity according to service priority and based on the incremental value of additional allocation in particular bandwidth regions. For example, it may be highest priority for one service to receive 10Gb/s of bandwidth but upon reaching such an allocation, incremental priority may drop sharply favoring allocation to other services. Motivated by the observation that individual flows with fixed priority may not be the ideal basis for bandwidth allocation, we present the design and implementation of Bandwidth Enforcer (BwE), a global, hierarchical bandwidth allocation infrastructure. BwE supports: i) service-level bandwidth allocation following prioritized bandwidth functions where a service can represent an arbitrary collection of ows, ii) independent allocation and delegation policies according to user-defined hierarchy, all accounting for a global view of bandwidth and failure conditions, iii) multi-path forwarding common in traffic-engineered networks, and iv) a central administrative point to override (perhaps faulty) policy during exceptional conditions. BwE has delivered more service-efficient bandwidth utilization and simpler management in production for multiple years. View details
      WCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers
      Malveeka Tewari
      Min Zhu
      Abdul Kabbani
      EuroSys '14: Proceedings of the Ninth European Conference on Computer Systems (2014), Article No. 5
      Preview abstract Data Center topologies employ multiple paths among servers to deliver scalable, cost-effective network capacity. The simplest and the most widely deployed approach for load balancing among these paths, Equal Cost Multipath (ECMP), hashes flows among the shortest paths toward a destination. ECMP leverages uniform hashing of balanced flow sizes to achieve fairness and good load balancing in data centers. However, we show that ECMP further assumes a balanced, regular, and fault-free topology, which are invalid assumptions in practice that can lead to substantial performance degradation and, worse, variation in flow bandwidths even for same size flows. We present a set of simple algorithms that achieve Weighted Cost Multipath (WCMP) to balance traffic in the data center based on the changing network topology. The state required for WCMP is already disseminated as part of standard routing protocols and it can be readily implemented in the current switch silicon without any hardware modifications. We show how to deploy WCMP in a production OpenFlow network environment and present experimental and simulation results to show that variation in flow bandwidths can be reduced by as much as 25X by employing WCMP relative to ECMP. View details
      No Results Found