Jump to Content
Nandita Dukkipati

Nandita Dukkipati

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Bolt is a congestion-control algorithm designed to providesingle-digit microsecond tail network-queuing at near-linerate utilization. Motivated by the need for ultra-low latencyto support applications such as NVMe, as line rates reach200G and beyond, most transfers fit within a single BDP en-tailing that transfer times predominantly become a functionof queuing and propagation delays. Bolt is an attempt topush congestion-control to its theoretical limits by harness-ing the power of programmable dataplanes such as Tofinoand Trident3+ chips. Bolt is founded on three key ideas, (i)Sub-RTT reaction (SRR): reacting to congestion faster thanRTT control-loop delay, (ii) Proactive Ramp-up (PRU): bytracking future flow-completions, and (iii) Supply matching(SM): leveraging Network Calculus concepts to maximizeutilization. Our current results achieve a 75% reduction inqueuing-delays over Swift with upto 3x improvement incompletion times for short transfers. View details
    Preview abstract The difficulty in gaining visibility into the fine-time scale hop-level congestion state of networks has been a key challenge faced by congestion control protocols for decades. How-ever, the emergence of commodity switches supporting in-network telemetry (INT) enables more advanced congestion control. In this paper, we presentPoseidon, a novel congestion control protocol that exploits INT to address blind spots of end-to-end algorithms and realize several fundamentally advantageous properties. Specifically, Poseidon realizes congestion control for the actual bottleneck hop. In the steady state,Poseidon realizes network-wide max-min fair bandwidth al-location. Furthermore, Poseidon decouples the bandwidth fairness requirement from the traditional AIMD control law, making it possible for Poseidon to converge fast and smooth out bandwidth oscillations. Equally important, Poseidon is de-signed to be amenable to incremental brownfield deployment in networks that mix INT and non-INT switches. Our testbed and simulation experiments show that compared to a widely-deployed state-of-the-art non-INT protocol, Swift, Poseidon improves op latency up to 10x in some percentiles (61% in average), lowers fabric RTT by more than 50%, reduces congestion window ramp up time by 40% while decreasing the throughput variation for flows with small windows by 94%.Finally, it is robust to reverse-path and multi-hop congestion. View details
    Preview abstract A modern datacenter hosts thousands of services with a mix of latency-sensitive, throughput-intensive, and best-effort traffic with high degrees of fan-out and fan-in patterns. Maintaining low tail latency under high overload conditions is difficult, especially for latency-sensitive (LS) RPCs. In this paper, we consider the challenging case of providing service-level objectives (SLO) to LS RPCs when there are unpredictable surges in demand. We present Aequitas, a distributed sender-driven admission control scheme that is anchored on the key conceptual insight: Weighted-Fair Quality of Service (QoS) queues, found in standard NICs and switches, can be used to guarantee RPC level latency SLOs by a judicious selection of QoS weights and traffic-mix across QoS queues. Aequitas installs cluster-wide RPC latency SLOs by mapping LS RPCs to higher weight QoS queues, and coping with overloads by adaptively apportioning LS RPCs amongst QoS queues based on measured completion times for each queue. When the network demand spikes unexpectedly to 25× of provisioned capacity, Aequitas achieves a latency SLO that is 3.8× lower than the state-of-art congestion control at the 99.9th-p and admits 15× more RPCs meeting SLO target compared to pFabric when RPC sizes are not aligned with priorities. View details
    Preview abstract This document presents the RACK-TLP loss detection algorithm for TCP. RACK-TLP uses per-segment transmit timestamps and selective acknowledgments (SACKs) and has two parts. Recent Acknowledgment (RACK) starts fast recovery quickly using time-based inferences derived from acknowledgment (ACK) feedback, and Tail Loss Probe (TLP) leverages RACK and sends a probe packet to trigger ACK feedback to avoid retransmission timeout (RTO) events. Compared to the widely used duplicate acknowledgment (DupAck) threshold approach, RACK-TLP detects losses more efficiently when there are application-limited flights of data, lost retransmissions, or data packet reordering events. It is intended to be an alternative to the DupAck threshold approach. View details
    Sundial: Fault-tolerant Clock Synchronization for Datacenters
    Hema Hariharan
    Dave Platt
    Simon Sabato
    Minlan Yu
    Prashant Chandra
    14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), USENIX Association (2020), pp. 1171-1186
    Preview abstract Clock synchronization is critical for many datacenter applications such as distributed transactional databases, consistent snapshots, and network telemetry. As applications have increasing performance requirements and datacenter networks get into ultra-low latency, we need submicrosecond-level bound on time-uncertainty to reduce transaction delay and enable new network management applications (e.g., measuring one-way delay for congestion control). The state-of-the-art clock synchronization solutions focus on improving clock precision but may incur significant time-uncertainty bound due to the presence of failures. This significantly affects applications because in large-scale datacenters, temperature-related, link, device, and domain failures are common. We present Sundial, a fault-tolerant clock-synchronization system for datacenters that achieves ~100ns time-uncertainty bound under various types of failures. Sundial provides fast failure detection based on frequent synchronization messages in hardware. Sundial enables fast failure recovery using a novel graph-based algorithm to precompute a backup plan that is generic to failures. Through experiments in a >500-machine testbed and large-scale simulations, we show that Sundial can achieve ~100ns time-uncertainty bound under different types of failures, which is more than two orders of magnitude lower than the state-of-the-art solutions. We also demonstrate the benefit of Sundial on applications such as Spanner and Swift congestion control. View details
    Preview abstract We report on experiences deploying Swift congestion control in Google datacenters. Swift relies on hardware timestamps in modern NICs, and is based on AIMD control with a specified end-to-end delay target. This simple design is an evolution of earlier protocols used at Google. It has emerged as a foundation for excellent performance, when network distances are well-known, that helps to meet operational challenges. Delay is easy to decompose into fabric and host components to separate concerns, and effortless to deploy and maintain as a signal from switches in changing datacenter environments. With Swift, we obtain low flow completion times for short RPCs, even at the 99th-percentile, while providing high throughput for long RPCs. At datacenter scale, Swift achieves 50$\mu$s tail latencies for short RPCs while sustaining a 100Gbps throughput per-server, a load close to 100\%. This is much better than protocols such as DCTCP that degrade latency and loss at high utilization. View details
    Snap: a Microkernel Approach to Host Networking
    Jacob Adriaens
    Sean Bauer
    Carlo Contavalli
    Mike Dalton
    William C. Evans
    Nicholas Kidd
    Roman Kononov
    Carl Mauer
    Emily Musick
    Lena Olson
    Mike Ryan
    Erik Rubow
    Kevin Springborn
    Valas Valancius
    In ACM SIGOPS 27th Symposium on Operating Systems Principles, ACM, New York, NY, USA (2019) (to appear)
    Preview abstract This paper presents our design and experience with a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system that supports Google’s rapidly evolving needs with flexible modules that implement a range of network functions, including edge packet switching, virtualization for our cloud platform, traffic shaping policy enforcement, and a high-performance reliable messaging and RDMA-like service. Snap has been running in production for over three years, supporting the extensible communication needs of several large and critical systems. Snap enables fast development and deployment of new networking features, leveraging the benefits of address space isolation and the productivity of userspace software development together with support for transparently upgrading networking services without migrating applications off of a machine. At the same time, Snap achieves compelling performance through a modular architecture that promotes principled synchronization with minimal state sharing, and supports real-time scheduling with dynamic scaling of CPU resources through a novel kernel/userspace CPU scheduler co-design. Our evaluation demonstrates over 3x Gbps/core improvement compared to a kernel networking stack for RPC workloads, software-based RDMA-like performance of up to 5M IOPS/core, and transparent upgrades that are largely imperceptible to user applications. Snap is deployed to over half of our fleet of machines and supports the needs of numerous teams. View details
    Preview abstract Network virtualization stacks such as Andromeda and Virtual Filtering Platform are the linchpins of public clouds hosting Virtual Machines (VMs). The dataplane is based on a combination of high performance OS bypass software and hardware packet processing paths. A key goal is to provide network performance isolation such that workloads of one VM do not adversely impact the network experience of another VM. In this work, we characterize how isolation breakages occur in virtualization stacks and motivate predictable VM performance just as if they were operating on dedicated hardware. We formulate an abstraction of a Predictable Virtualized NIC for bandwidth, latency and packet loss. We propose three constructs to achieve predictability: egress traffic shaping, and a combination of congestion control and CPU-fair weighted fair queueing for ingress isolation. Using these constructs in coherence, we provide the illusion of a dedicated NIC to VMs, all while maintaining the raw performance of the fastpath dataplane. View details
    Carousel: Scalable Traffic Shaping at End-Hosts
    Ahmed Saeed
    Valas Valancius
    Terry Lam
    Carlo Contavalli
    ACM SIGCOMM 2017
    Preview abstract Traffic shaping, including pacing and rate limiting, is fundamental to the correct and efficient operation of both datacenter and wide area networks. Sample use cases include policy-based bandwidth allocation to flow aggregates, rate-based congestion control algorithms, and packet pacing to avoid bursty transmissions that can overwhelm router buffers. Driven by the need to scale to millions of flows and to apply complex policies, traffic shaping is moving from network switches into the end hosts, typically implemented in software in the kernel networking stack. In this paper, we show that the performance overhead of end-host traffic shaping is substantial limits overall system scalability as we move to thousands of individual traffic classes per server. Measurements from production servers show that shaping at hosts consumes considerable CPU and memory, unnecessarily drops packets, suffers from head of line blocking and inaccuracy, and does not provide backpressure up the stack. We present Carousel, a framework that scales to tens of thousands of policies and flows per server, built from the synthesis of three key ideas: i) a single queue shaper using time as the basis for releasing packets, ii) fine-grained, just-in-time freeing of resources in higher layers coupled to actual packet departures, and iii) one shaper per CPU core, with lock-free coordination. Our production experience in serving video traffic at a Cloud service provider shows that Carousel shapes traffic accurately while improving overall machine CPU utilization by 8% (an improvement of 20% in the CPU utilization attributed to networking) relative to state-of-art deployments. It also conforms 10 times more accurately to target rates, and consumes two orders of magnitude less memory than existing approaches. View details
    TIMELY: RTT-based Congestion Control for the Datacenter
    Radhika Mittal
    Terry Lam
    Emily Blem
    Monia Ghobadi
    Amin Vahdat
    David Zats
    Sigcomm '15, Google Inc (2015)
    Preview abstract Datacenter transports aim to deliver low latency messaging together with high throughput. We show that simple packet delay, measured as round-trip times at hosts, is an effective congestion signal without the need for switch feedback. First, we show that advances in NIC hardware have made RTT measurement possible with microsecond accuracy, and that these RTTs are sufficient to estimate switch queueing. Then we describe how TIMELY can adjust transmission rates using RTT gradients to keep packet latency low while delivering high bandwidth. We implement our design in host software running over NICs with OS-bypass capabilities. We show using experiments with up to hundreds of machines on a Clos network topology that it provides excellent performance: turning on TIMELY for OS-bypass messaging over a fabric with PFC lowers 99 percentile tail latency by 9X while maintaining near line-rate throughput. Our system also outperforms DCTCP running in an optimized kernel, reducing tail latency by 13X. To the best of our knowledge, TIMELY is the first delay-based congestion control protocol for use in the datacenter, and it achieves its results despite having an order of magnitude fewer RTT signals (due to NIC offload) than earlier delay-based schemes such as Vegas. View details
    RFC6928 - Increasing TCP's Initial Window
    H.K. Jerry Chu
    Matt Mathis
    Internet Engineering Task Force (IETF) (2013)
    Preview abstract This document proposes an experiment to increase the permitted TCP initial window (IW) from between 2 and 4 segments, as specified in RFC 3390, to 10 segments with a fallback to the existing recommendation when performance issues are detected. It discusses the motivation behind the increase, the advantages and disadvantages of the higher initial window, and presents results from several large-scale experiments showing that the higher initial window improves the overall performance of many web services without resulting in a congestion collapse. The document closes with a discussion of usage and deployment for further experimental purposes recommended by the IETF TCP Maintenance and Minor Extensions (TCPM) working group. View details
    RFC 6937 - Proportional Rate Reduction for TCP
    Matt Mathis
    Internet Engineering Task Force (IETF) (2013)
    Preview abstract This document describes an experimental Proportional Rate Reduction (PRR) algorithm as an alternative to the widely deployed Fast Recovery and Rate-Halving algorithms. These algorithms determine the amount of data sent by TCP during loss recovery. PRR minimizes excess window adjustments, and the actual window size at the end of recovery will be as close as possible to the ssthresh, as determined by the congestion control algorithm. View details
    packetdrill: Scriptable Network Stack Testing, from Sockets to Packets
    Lawrence Brakmo
    Matt Mathis
    Barath Raghavan
    Hsiao-keng Jerry Chu
    Tom Herbert
    Proceedings of the USENIX Annual Technical Conference (USENIX ATC 2013), USENIX, 2560 Ninth Street, Suite 215, Berkeley, CA, 94710 USA, pp. 213-218
    Preview abstract Testing today’s increasingly complex network protocol implementations can be a painstaking process. To help meet this challenge, we developed packetdrill, a portable, open-source scripting tool that enables testing the correctness and performance of entire TCP/UDP/IP network stack implementations, from the system call layer to the hardware network interface, for both IPv4 and IPv6. We describe the design and implementation of the tool, and our experiences using it to execute 657 test cases. The tool was instrumental in our development of three new features for Linux TCP—Early Retransmit, Fast Open, and Loss Probes—and allowed us to find and fix 10 bugs in Linux. Our team uses packetdrill in all phases of the development process for the kernel used in one of the world’s largest Linux installations. View details
    Reducing Web Latency: the Virtue of Gentle Aggression
    Tobias Flach
    Barath Raghavan
    Shuai Hao
    Ethan Katz-Bassett
    Ramesh Govindan
    Proceedings of the ACM Conference of the Special Interest Group on Data Communication (SIGCOMM '13), ACM (2013)
    Preview abstract To serve users quickly, Web service providers build infrastructure closer to clients and use multi-stage transport connections. Although these changes reduce client-perceived round-trip times, TCP's current mechanisms fundamentally limit latency improvements. We performed a measurement study of a large Web service provider and found that, while connections with no loss complete close to the ideal latency of one round-trip time, TCP's timeout-driven recovery causes transfers with loss to take five times longer on average. In this paper, we present the design of novel loss recovery mechanisms for TCP that judiciously use redundant transmissions to minimize timeout-driven recovery. Proactive, Reactive, and Corrective are three qualitatively different, easily-deployable mechanisms that (1) proactively recover from losses, (2) recover from them as quickly as possible, and (3) reconstruct packets to mask loss. Crucially, the mechanisms are compatible both with middleboxes and with TCP's existing congestion control and loss recovery. Our large-scale experiments on Google's production network that serves billions of flows demonstrate a 23% decrease in the mean and 47% in 99th percentile latency over today's TCP. View details
    Proportional Rate Reduction for TCP
    Matt Mathis
    Monia Ghobadi
    Proceedings of the 11th ACM SIGCOMM Conference on Internet Measurement 2011, Berlin, Germany - November 2-4, 2011
    Preview abstract Packet losses increase latency for Web users. Fast recovery is a key mechanism for TCP to recover from packet losses. In this paper, we explore some of the weaknesses of the standard algorithm described in RFC 3517 and the non-standard algorithms implemented in Linux. We find that these algorithms deviate from their intended behavior in the real world due to the combined effect of short flows, application stalls, burst losses, acknowledgment (ACK) loss and reordering, and stretch ACKs. Linux suffers from excessive congestion window reductions while RFC 3517 transmits large bursts under high losses, both of which harm the rest of the flow and increase Web latency. Our primary contribution is a new design to control transmission in fast recovery called proportional rate reduction (PRR). PRR recovers from losses quickly, smoothly and accurately by pacing out retransmissions across received ACKs. In addition to PRR, we evaluate the TCP early retransmit (ER) algorithm which lowers the duplicate acknowledgment threshold for short transfers, and show that delaying early retransmissions for a short interval is effective in avoiding spurious retransmissions in the presence of a small degree of reordering. PRR and ER reduce the TCP latency of connections experiencing losses by 3-10% depending on the response size. Based on our instrumentation on Google Web and YouTube servers in U.S. and India, we also present key statistics on the nature of TCP retransmissions. View details
    An Argument for Increasing TCP's Initial Congestion Window
    Jerry Chu
    Tom Herbert
    Amit Agarwal
    Arvind Jain
    Natalia Sutin
    ACM SIGCOMM Computer Communications Review, vol. 40 (2010), pp. 27-33
    Preview abstract TCP flows start with an initial congestion window of at most four segments or approximately 4KB of data. Because most Web transactions are short-lived, the initial congestion window is a critical TCP parameter in determining how quickly flows can finish. While the global network access speeds increased dramatically on average in the past decade, the standard value of TCP’s initial congestion window has remained unchanged. In this paper, we propose to increase TCP’s initial congestion window to at least ten segments (about 15KB). Through large-scale Internet experiments, we quantify the latency benefits and costs of using a larger window, as functions of network bandwidth, round-trip time (RTT), bandwidthdelay product (BDP), and nature of applications. We show that the average latency of HTTP responses improved by approximately 10% with the largest benefits being demonstrated in high RTT and BDP networks. The latency of low bandwidth networks also improved by a significant amount in our experiments. The average retransmission rate increased by a modest 0.5%, with most of the increase coming from applications that effectively circumvent TCP’s slow start algorithm by using multiple concurrent connections. Based on the results from our experiments, we believe the initial congestion window should be at least ten segments and the same be investigated for standardization by the IETF. View details
    No Results Found