Yuliang Li
Research Areas
Authored Publications
Sort By
Poseidon: An Efficient Congestion Control using Deployable INT for Data Center Networks
Weitao Wang
Masoud Moshref
T. S. Eugene Ng
NSDI (2023)
Preview abstract
The difficulty in gaining visibility into the fine-time scale hop-level congestion state of networks has been a key challenge faced by congestion control protocols for decades. How-ever, the emergence of commodity switches supporting in-network telemetry (INT) enables more advanced congestion control. In this paper, we presentPoseidon, a novel congestion control protocol that exploits INT to address blind spots of end-to-end algorithms and realize several fundamentally advantageous properties. Specifically, Poseidon realizes congestion control for the actual bottleneck hop. In the steady state,Poseidon realizes network-wide max-min fair bandwidth al-location. Furthermore, Poseidon decouples the bandwidth fairness requirement from the traditional AIMD control law, making it possible for Poseidon to converge fast and smooth out bandwidth oscillations. Equally important, Poseidon is de-signed to be amenable to incremental brownfield deployment in networks that mix INT and non-INT switches. Our testbed and simulation experiments show that compared to a widely-deployed state-of-the-art non-INT protocol, Swift, Poseidon improves op latency up to 10x in some percentiles (61% in average), lowers fabric RTT by more than 50%, reduces congestion window ramp up time by 40% while decreasing the throughput variation for flows with small windows by 94%.Finally, it is robust to reverse-path and multi-hop congestion.
View details
Preview abstract
Bolt is a congestion-control algorithm designed to providesingle-digit microsecond tail network-queuing at near-linerate utilization. Motivated by the need for ultra-low latencyto support applications such as NVMe, as line rates reach200G and beyond, most transfers fit within a single BDP en-tailing that transfer times predominantly become a functionof queuing and propagation delays. Bolt is an attempt topush congestion-control to its theoretical limits by harness-ing the power of programmable dataplanes such as Tofinoand Trident3+ chips. Bolt is founded on three key ideas, (i)Sub-RTT reaction (SRR): reacting to congestion faster thanRTT control-loop delay, (ii) Proactive Ramp-up (PRU): bytracking future flow-completions, and (iii) Supply matching(SM): leveraging Network Calculus concepts to maximizeutilization. Our current results achieve a 75% reduction inqueuing-delays over Swift with upto 3x improvement incompletion times for short transfers.
View details
Sundial: Fault-tolerant Clock Synchronization for Datacenters
Hema Hariharan
Dave Platt
Simon Sabato
Minlan Yu
Prashant Chandra
14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), USENIX Association (2020), pp. 1171-1186
Preview abstract
Clock synchronization is critical for many datacenter applications such as distributed transactional databases, consistent snapshots, and network telemetry. As applications have increasing performance requirements and datacenter networks get into ultra-low latency, we need submicrosecond-level bound on time-uncertainty to reduce transaction delay and enable new network management applications (e.g., measuring one-way delay for congestion control). The state-of-the-art clock synchronization solutions focus on improving clock precision but may incur significant time-uncertainty bound due to the presence of failures. This significantly affects applications because in large-scale datacenters, temperature-related, link, device, and domain failures are common. We present Sundial, a fault-tolerant clock-synchronization system for datacenters that achieves ~100ns time-uncertainty bound under various types of failures. Sundial provides fast failure detection based on frequent synchronization messages in hardware. Sundial enables fast failure recovery using a novel graph-based algorithm to precompute a backup plan that is generic to failures. Through experiments in a >500-machine testbed and large-scale simulations, we show that Sundial can achieve ~100ns time-uncertainty bound under different types of failures, which is more than two orders of magnitude lower than the state-of-the-art solutions. We also demonstrate the benefit of Sundial on applications such as Spanner and Swift congestion control.
View details