Jump to Content
Hassan Wassel

Hassan Wassel

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Aquila: A unified, low-latency fabric for datacenter networks
    Hema Hariharan
    Eric Lance
    Moray Mclaren
    Stephen Wang
    Zhehua Wu
    Sunghwan Yoo
    Raghuraman Balasubramanian
    Prashant Chandra
    Michael Cutforth
    Peter James Cuy
    David Decotigny
    Rakesh Gautam
    Rick Roy
    Zuowei Shen
    Ming Tan
    Ye Tang
    Monica C Wong-Chan
    Joe Zbiciak
    Aquila: A unified, low-latency fabric for datacenter networks (2022)
    Preview abstract Datacenter workloads have evolved from the data intensive, loosely-coupled workloads of the past decade to more tightly coupled ones, wherein ultra-low latency communication is essential for resource disaggregation over the network and to enable emerging programming models. We introduce Aquila, an experimental datacenter network fabric built with ultra-low latency support as a first-class design goal, while also supporting traditional datacenter traffic. Aquila uses a new Layer 2 cell-based protocol, GNet, an integrated switch, and a custom ASIC with low-latency Remote Memory Access (RMA) capabilities co-designed with GNet. We demonstrate that Aquila is able to achieve under 40 μs tail fabric Round Trip Time (RTT) for IP traffic and sub-10 μs RMA execution time across hundreds of host machines, even in the presence of background throughput-oriented IP traffic. This translates to more than 5x reduction in tail latency for a production quality key-value store running on a prototype Aquila network. View details
    Carbink: Fault-tolerant Far Memory
    Yang Zhou
    Sihang Liu
    Jiaqi Gao
    James Mickens
    Minlan Yu
    Hank Levy
    Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, Usenix (2022)
    Preview abstract Memory-intensive applications would benefit from using available memory from other machines (ie, remote memory or far memory). However, there is a missing piece in recent far memory proposals -- cost-efficient fault tolerance for far memory. In this paper, we motivate the strong need for fault tolerance for far memory using machine/task failure statistics from a major internet service provider. Then we describe the design and implementation off a Fault-Tolerant application-integrated Far Memory (i.e., FTFM) framework. We compare several candidate fault tolerance schemes, and discuss their pros and cons. Finally, we test FTFM using several X-internal applications, including graph processing, globally-distributed database, and in-memory database. Our results show that FTFM has little impact on application performance (~x.x%), while achieving xx% performance of running applications purely in local memory. View details
    Preview abstract Traffic load balancing across multiple paths is a critical task for modern networks to reduce network congestion and improve network efficiency. Hashing which is the foundation of traffic load balancing still faces practical challenges. The key problem is there is a growing need for more hash functions because networks are getting larger with more switches, more stages and increased path diversity. Meanwhile topology and routing becomes more agile in order to efficiently serve traffic demands with stricter throughput and latency SLAs. On the other hand, current generation switch chips only provide a limited number of uncorrelated hash functions. We first demonstrate why the limited number of hashing functions is a practical challenge in today's datacenter network (DCN) and wide-area network (WAN) designs. Then, to mitigate the problem, we propose a novel approach named \textsl{color recombining} which enables hash functions reuse via leveraging topology traits of multi-stage DCN networks. We also describe a novel framework based on \textsl{\coprime} theory to mitigate hash correlation in generic mesh topologies (i.e., spineless DCN and WAN). Our evaluation on real network trace data and topologies demonstrate that we can reduce the extent of load imbalance (measured by coefficient of variation) by an order of magnitude. View details
    1RMA: Re-Envisioning Remote Memory Access for Multi-Tenant Datacenters
    Aditya Akella
    Arjun Singhvi
    Joel Scherpelz
    Monica C Wong-Chan
    Moray Mclaren
    Prashant Chandra
    Rob Cauble
    Sean Clark
    Simon Sabato
    Thomas F. Wenisch
    Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, Association for Computing Machinery, New York, NY, USA (2020), 708–721
    Preview abstract Remote Direct Memory Access (RDMA) plays a key role in supporting performance-hungry datacenter applications. However, existing RDMA technologies are ill-suited to multi-tenant datacenters, where applications run at massive scales, tenants require isolation and security, and the workload mix changes over time. Our experiences seeking to operationalize RDMA at scale indicate that these ills are rooted in standard RDMA's basic design attributes: connection-orientedness and complex policies baked into hardware. We describe a new approach to remote memory access -- One-Shot RMA (1RMA) -- suited to the constraints imposed by our multi-tenant datacenter settings. The 1RMA NIC is connection-free and fixed-function; it treats each RMA operation independently, assisting software by offering fine-grained delay measurements and fast failure notifications. 1RMA software provides operation pacing, congestion control, failure recovery, and inter-operation ordering, when needed. The NIC, deployed in our production datacenters, supports encryption at line rate (100Gbps and 100M ops/sec) with minimal performance/availability disruption for encryption key rotation. View details
    Sundial: Fault-tolerant Clock Synchronization for Datacenters
    Hema Hariharan
    Dave Platt
    Simon Sabato
    Minlan Yu
    Prashant Chandra
    14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), USENIX Association (2020), pp. 1171-1186
    Preview abstract Clock synchronization is critical for many datacenter applications such as distributed transactional databases, consistent snapshots, and network telemetry. As applications have increasing performance requirements and datacenter networks get into ultra-low latency, we need submicrosecond-level bound on time-uncertainty to reduce transaction delay and enable new network management applications (e.g., measuring one-way delay for congestion control). The state-of-the-art clock synchronization solutions focus on improving clock precision but may incur significant time-uncertainty bound due to the presence of failures. This significantly affects applications because in large-scale datacenters, temperature-related, link, device, and domain failures are common. We present Sundial, a fault-tolerant clock-synchronization system for datacenters that achieves ~100ns time-uncertainty bound under various types of failures. Sundial provides fast failure detection based on frequent synchronization messages in hardware. Sundial enables fast failure recovery using a novel graph-based algorithm to precompute a backup plan that is generic to failures. Through experiments in a >500-machine testbed and large-scale simulations, we show that Sundial can achieve ~100ns time-uncertainty bound under different types of failures, which is more than two orders of magnitude lower than the state-of-the-art solutions. We also demonstrate the benefit of Sundial on applications such as Spanner and Swift congestion control. View details
    Preview abstract We report on experiences deploying Swift congestion control in Google datacenters. Swift relies on hardware timestamps in modern NICs, and is based on AIMD control with a specified end-to-end delay target. This simple design is an evolution of earlier protocols used at Google. It has emerged as a foundation for excellent performance, when network distances are well-known, that helps to meet operational challenges. Delay is easy to decompose into fabric and host components to separate concerns, and effortless to deploy and maintain as a signal from switches in changing datacenter environments. With Swift, we obtain low flow completion times for short RPCs, even at the 99th-percentile, while providing high throughput for long RPCs. At datacenter scale, Swift achieves 50$\mu$s tail latencies for short RPCs while sustaining a 100Gbps throughput per-server, a load close to 100\%. This is much better than protocols such as DCTCP that degrade latency and loss at high utilization. View details
    TIMELY: RTT-based Congestion Control for the Datacenter
    Radhika Mittal
    Terry Lam
    Emily Blem
    Monia Ghobadi
    Amin Vahdat
    David Zats
    Sigcomm '15, Google Inc (2015)
    Preview abstract Datacenter transports aim to deliver low latency messaging together with high throughput. We show that simple packet delay, measured as round-trip times at hosts, is an effective congestion signal without the need for switch feedback. First, we show that advances in NIC hardware have made RTT measurement possible with microsecond accuracy, and that these RTTs are sufficient to estimate switch queueing. Then we describe how TIMELY can adjust transmission rates using RTT gradients to keep packet latency low while delivering high bandwidth. We implement our design in host software running over NICs with OS-bypass capabilities. We show using experiments with up to hundreds of machines on a Clos network topology that it provides excellent performance: turning on TIMELY for OS-bypass messaging over a fabric with PFC lowers 99 percentile tail latency by 9X while maintaining near line-rate throughput. Our system also outperforms DCTCP running in an optimized kernel, reducing tail latency by 13X. To the best of our knowledge, TIMELY is the first delay-based congestion control protocol for use in the datacenter, and it achieves its results despite having an order of magnitude fewer RTT signals (due to NIC offload) than earlier delay-based schemes such as Vegas. View details
    No Results Found