Jump to Content

Xian Wu

Xian Wu worked on Google's Network Infrastructure team from July 2019 to July 2022.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract A modern datacenter hosts thousands of services with a mix of latency-sensitive, throughput-intensive, and best-effort traffic with high degrees of fan-out and fan-in patterns. Maintaining low tail latency under high overload conditions is difficult, especially for latency-sensitive (LS) RPCs. In this paper, we consider the challenging case of providing service-level objectives (SLO) to LS RPCs when there are unpredictable surges in demand. We present Aequitas, a distributed sender-driven admission control scheme that is anchored on the key conceptual insight: Weighted-Fair Quality of Service (QoS) queues, found in standard NICs and switches, can be used to guarantee RPC level latency SLOs by a judicious selection of QoS weights and traffic-mix across QoS queues. Aequitas installs cluster-wide RPC latency SLOs by mapping LS RPCs to higher weight QoS queues, and coping with overloads by adaptively apportioning LS RPCs amongst QoS queues based on measured completion times for each queue. When the network demand spikes unexpectedly to 25× of provisioned capacity, Aequitas achieves a latency SLO that is 3.8× lower than the state-of-art congestion control at the 99.9th-p and admits 15× more RPCs meeting SLO target compared to pFabric when RPC sizes are not aligned with priorities. View details
    Preview abstract We report on experiences deploying Swift congestion control in Google datacenters. Swift relies on hardware timestamps in modern NICs, and is based on AIMD control with a specified end-to-end delay target. This simple design is an evolution of earlier protocols used at Google. It has emerged as a foundation for excellent performance, when network distances are well-known, that helps to meet operational challenges. Delay is easy to decompose into fabric and host components to separate concerns, and effortless to deploy and maintain as a signal from switches in changing datacenter environments. With Swift, we obtain low flow completion times for short RPCs, even at the 99th-percentile, while providing high throughput for long RPCs. At datacenter scale, Swift achieves 50$\mu$s tail latencies for short RPCs while sustaining a 100Gbps throughput per-server, a load close to 100\%. This is much better than protocols such as DCTCP that degrade latency and loss at high utilization. View details
    No Results Found