David Wetherall
David Wetherall is a Distinguished Engineer working on low-level networking issues in Google's datacenter networks, including congestion control, load balancing, reliability, offload, and performance for HPC and ML workloads. Before joining Google in 2014 he was a Professor at the University of Washington, and Director of Intel's Seattle Research Lab. He is a Fellow of the ACM and IEEE, and author of the textbook "Computer Networks".
Research Areas
Authored Publications
Sort By
Fathom: Understanding Datacenter Application Network Performance
Junhua Yan
Mubashir Adnan Qureshi
Van Jacobson
Yousuk Seung
Proceedings of ACM SIGCOMM 2023
Preview abstract
We describe our experience with Fathom, a system for identifying the network performance bottlenecks of any service running in the Google fleet. Fathom passively samples RPCs, the principal unit of work for services. It segments the overall latency into host and network components with kernel and RPC stack instrumentation. It records these detailed latency metrics, along with detailed transport connection state, for every sampled RPC. This lets us determine if the completion is constrained by the client, network or server. To scale while enabling analysis, we also aggregate samples into distributions that retain multi-dimensional breakdowns. This provides us with a macroscopic view of individual services. Fathom runs globally in our datacenters for all production traffic, where it monitors billions of TCP connections 24x7. For five years Fathom has been our primary tool for troubleshooting service network issues and assessing network infrastructure changes. We present case studies to show how it has helped us improve our production services.
View details
Improving Network Availability with Protective ReRoute
Abdul Kabbani
Van Jacobson
Jim Winget
Brad Morrey
Uma Parthavi Moravapalle
Steven Knight
SIGCOMM 2023
Preview abstract
We present PRR (Protective ReRoute), a transport technique for shortening user-visible outages that complements routing repair. It can be added to any transport to provide benefits in multipath networks. PRR responds to flow connectivity failure signals, e.g., retransmission timeouts, by changing the FlowLabel on packets of the flow, which causes switches and hosts to choose a different network path that may avoid the outage. To enable it, we shifted our IPv6 network architecture to use the FlowLabel, so that hosts can change the paths of their flows without application involvement. PRR is deployed fleetwide at Google for TCP and Pony Express, where it has been protecting all production traffic for several years. It is also available to our Cloud customers. We find it highly effective for real outages. In a measurement study on our network backbones, adding PRR reduced the cumulative region-pair outage time for RPC traffic by 63--84%. This is the equivalent of adding 0.4--0.8 "nines'" of availability.
View details
PLB: Congestion Signals are Simple and Effective for Network Load Balancing
Abdul Kabbani
Junhua Yan
Kira Yin
Masoud Moshref
Mubashir Adnan Qureshi
Qiaobin Fu
Van Jacobson
SIGCOMM (2022)
Preview abstract
We describe our experience with PLB, a host-based load balancing design for modern networks. PLB randomly changes the paths of connections that experience congestion, preferring idle periods to minimize transport interactions. It does so by changing the IPv6 FlowLabel on the packets of a connection, which switches include as part of the ECMP flow hash. Across many hosts, this action drives down the number of hotspots in the network, while separating short RPCs from elephant flows to keep their completion time low.
We use PLB fleetwide in Google networks for TCP and PonyExpress (RDMA-like) traffic. We find it to be simple, general, and effective. It was easy to deploy, co-existing with other traffic, requiring only small transport modifications and little of switches, and needing no application changes. And it has produced large gains across the board, for multiple transports and from datacenter through backbone networks. After deploying PLB, the median utilization imbalance of busy switches in Google datacenter networks fell by 60\% and packet drops correspondingly fell by 33\%. At hosts, the tail latency (99$^{th}$ percentile) of short RPCs fell by up to 25\%.
View details
Swift: Delay is Simple and Effective for Congestion Control in the Datacenter
Keon Jang
Kevin Springborn
Mike Ryan
SIGCOMM 2020 (2020)
Preview abstract
We report on experiences deploying Swift congestion control in Google datacenters. Swift relies on hardware timestamps in modern NICs, and is based on AIMD control with a specified end-to-end delay target. This simple design is an evolution of earlier protocols used at Google. It has emerged as a foundation for excellent performance, when network distances are well-known, that helps to meet operational challenges. Delay is easy to decompose into fabric and host components to separate concerns, and effortless to deploy and maintain as a signal from switches in changing datacenter environments. With Swift, we obtain low flow completion times for short RPCs, even at the 99th-percentile, while providing high throughput for long RPCs. At datacenter scale, Swift achieves 50$\mu$s tail latencies for short RPCs while sustaining a 100Gbps throughput per-server, a load close to 100\%. This is much better than protocols such as DCTCP that degrade latency and loss at high utilization.
View details
TIMELY: RTT-based Congestion Control for the Datacenter
Radhika Mittal
Terry Lam
Emily Blem
Monia Ghobadi
Amin Vahdat
David Zats
Sigcomm '15, Google Inc (2015)
Preview abstract
Datacenter transports aim to deliver low latency messaging together with high throughput. We show that simple packet delay, measured as round-trip times at hosts, is an effective congestion signal without the need for switch feedback. First, we show that advances in NIC hardware have made RTT measurement possible with microsecond accuracy, and that these RTTs are sufficient to estimate switch queueing. Then we describe how TIMELY can adjust transmission rates using RTT gradients to keep packet latency low while delivering high bandwidth. We implement our design in host software running over NICs with OS-bypass capabilities. We show using experiments with up to hundreds of machines on a Clos network topology that it provides excellent performance: turning on TIMELY for OS-bypass messaging over a fabric with PFC lowers 99 percentile tail latency by 9X while maintaining near line-rate throughput. Our system also outperforms DCTCP running in an optimized kernel, reducing tail latency by 13X. To the best of our knowledge, TIMELY is the first delay-based congestion control protocol for use in the datacenter, and it achieves its results despite having an order of magnitude fewer RTT signals (due to NIC offload) than earlier delay-based schemes such as Vegas.
View details