Leon Poutievski

Leon Poutievski

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Three prominent traffic features including peak alignment, stable ranking, and gravity model, have guided the design of current Google Jupiter fabrics in traffic engineering, topology engineering, and capacity planning. View details
    Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-Defined Networking
    Joon Ong
    Arjun Singh
    Mukarram Tariq
    Rui Wang
    Jianan Zhang
    Virginia Beauregard
    Patrick Conner
    Rishi Kapoor
    Stephen Kratzer
    Nanfang Li
    Hong Liu
    Karthik Nagaraj
    Jason Ornstein
    Samir Sawhney
    Ryohei Urata
    Lorenzo Vicisano
    Kevin Yasumura
    Shidong Zhang
    Junlan Zhou
    Proceedings of ACM SIGCOMM 2022
    Preview abstract We present a decade of evolution and production experience with Jupiter datacenter network fabrics. In this period Jupiter has delivered 5x higher speed and capacity, 30% reduction in capex, 41% reduction in power, incremental deployment and technology refresh all while serving live production traffic. A key enabler for these improvements is evolving Jupiter from a Clos to a direct-connect topology among the machine aggregation blocks. Critical architectural changes for this include: A datacenter interconnection layer employing Micro-ElectroMechanical Systems (MEMS) based Optical Circuit Switches (OCSes) to enable dynamic topology reconfiguration, centralized Software-Defined Networking (SDN) control for traffic engineering, and automated network operations for incremental capacity delivery and topology engineering. We show that the combination of traffic and topology engineering on direct-connect fabrics achieves similar throughput as Clos fabrics for our production traffic patterns. We also optimize for path lengths: 60% of the traffic takes direct path from source to destination aggregation blocks, while the remaining transits one additional block, achieving an average blocklevel path length of 1.4 in our fleet today. OCS also achieves 3x faster fabric reconfiguration compared to pre-evolution Clos fabrics that used a patch panel based interconnect. View details
    Preview abstract Traffic load balancing across multiple paths is a critical task for modern networks to reduce network congestion and improve network efficiency. Hashing which is the foundation of traffic load balancing still faces practical challenges. The key problem is there is a growing need for more hash functions because networks are getting larger with more switches, more stages and increased path diversity. Meanwhile topology and routing becomes more agile in order to efficiently serve traffic demands with stricter throughput and latency SLAs. On the other hand, current generation switch chips only provide a limited number of uncorrelated hash functions. We first demonstrate why the limited number of hashing functions is a practical challenge in today's datacenter network (DCN) and wide-area network (WAN) designs. Then, to mitigate the problem, we propose a novel approach named \textsl{color recombining} which enables hash functions reuse via leveraging topology traits of multi-stage DCN networks. We also describe a novel framework based on \textsl{\coprime} theory to mitigate hash correlation in generic mesh topologies (i.e., spineless DCN and WAN). Our evaluation on real network trace data and topologies demonstrate that we can reduce the extent of load imbalance (measured by coefficient of variation) by an order of magnitude. View details
    Orion: Google’s Software-Defined Networking Control Plane
    Amr Sabaa
    Henrik Muehe
    Joon Suan Ong
    Karthik Swaminathan Nagaraj
    KondapaNaidu Bollineni
    Lorenzo Vicisano
    Mike Conley
    Min Zhu
    Rich Alimi
    Shawn Chen
    Shidong Zhang
    Waqar Mohsin
    (2021)
    Preview abstract We present Orion, a distributed Software-Defined Networking platform deployed globally in Google’s datacenter (Jupiter) as well as Wide Area (B4) networks. Orion was designed around a modular, micro-service architecture with a central publish-subscribe database to enable a distributed, yet tightly-coupled, software-defined network control system. Orion enables intent-based management and control, is highly scalable and amenable to global control hierarchies. Over the years, Orion has matured with continuously improving performance in convergence (up to 40x faster), throughput (handling up to 1.16 million network updates per second), system scalability (supporting 16x larger networks), and data plane availability (50x, 100x reduction in unavailable time in Jupiter and B4, respectively) while maintaining high development velocity with bi-weekly release cadence. Today, Orion robustly enables all of Google’s Software-Defined Networks defending against failure modes that are both generic to large scale production networks as well as unique to SDN systems. View details
    WCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers
    Malveeka Tewari
    Min Zhu
    Abdul Kabbani
    EuroSys '14: Proceedings of the Ninth European Conference on Computer Systems (2014), Article No. 5
    Preview abstract Data Center topologies employ multiple paths among servers to deliver scalable, cost-effective network capacity. The simplest and the most widely deployed approach for load balancing among these paths, Equal Cost Multipath (ECMP), hashes flows among the shortest paths toward a destination. ECMP leverages uniform hashing of balanced flow sizes to achieve fairness and good load balancing in data centers. However, we show that ECMP further assumes a balanced, regular, and fault-free topology, which are invalid assumptions in practice that can lead to substantial performance degradation and, worse, variation in flow bandwidths even for same size flows. We present a set of simple algorithms that achieve Weighted Cost Multipath (WCMP) to balance traffic in the data center based on the changing network topology. The state required for WCMP is already disseminated as part of standard routing protocols and it can be readily implemented in the current switch silicon without any hardware modifications. We show how to deploy WCMP in a production OpenFlow network environment and present experimental and simulation results to show that variation in flow bandwidths can be reduced by as much as 25X by employing WCMP relative to ECMP. View details
    B4: Experience with a Globally Deployed Software Defined WAN
    Sushant Jain
    Joon Ong
    Subbaiah Venkata
    Jim Wanderer
    Junlan Zhou
    Min Zhu
    Amin Vahdat
    Proceedings of the ACM SIGCOMM Conference, Hong Kong, China (2013)
    Preview
    Onix: a distributed control platform for large-scale production networks
    Teemu Koponen
    Martin Casado
    Natasha Gude
    Jeremy Stribling
    Min Zhu
    Rajiv Ramanathan
    Yuichiro Iwata NEC
    Hiroaki Inoue NEC
    Takayuki Hama NEC
    Scott Shenker
    Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA (2010), pp. 351-364
    Preview abstract Computer networks lack a general control paradigm, as traditional networks do not provide any network-wide management abstractions. As a result, each new function (such as routing) must provide its own state distribution, element discovery, and failure recovery mechanisms. We believe this lack of a common control platform has significantly hindered the development of flexible, reliable and feature-rich network control planes. To address this, we present Onix, a platform on top of which a network control plane can be implemented as a distributed system. Control planes written within Onix operate on a global view of the network, and use basic state distribution primitives provided by the platform. Thus Onix provides a general API for control plane implementations, while allowing them to make their own trade-offs among consistency, durability, and scalability. View details