Arjun Singh

Arjun Singh

Arjun is an Engineering Fellow and Technical Lead for networking in Google. During his tenure in Google, he has worked on developing solutions for Google’s data center, Wide Area and Edge/Peering networks with a focus on Software-defined networking. Arjun has participated in over five generations of data center and wide area networking infrastructure at Google over 19 years and has been recognized with the ACM SIGCOMM Networking Systems Award and Test of Time Paper Award for his work. Before joining Google, Arjun received a PhD, M.S. in Electrical Engineering from Stanford University and a Bachelor of Technology in Computer Science and Engineering from the Indian Institute of Technology (IIT), Kharagpur.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    CAPA: An Architecture For Operating Cluster Networks With High Availability
    Bingzhe Liu
    Mukarram Tariq
    Omid Alipourfard
    Rich Alimi
    Deepak Arulkannan
    Virginia Beauregard
    Patrick Conner
    Brighten Godfrey
    Xander Lin
    Mayur Patel
    Joon Ong
    Amr Sabaa
    Alex Smirnov
    Manish Verma
    Prerepa Viswanadham
    Google, Google, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 (2023)
    Preview abstract Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“ layer. We evaluate CAPA based on case studies of outages prevented, counterfactual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years View details
    Aquila: A unified, low-latency fabric for datacenter networks
    Hema Hariharan
    Eric Lance
    Moray Mclaren
    Stephen Wang
    Zhehua Wu
    Sunghwan Yoo
    Raghuraman Balasubramanian
    Prashant Chandra
    Michael Cutforth
    Peter James Cuy
    David Decotigny
    Rakesh Gautam
    Rick Roy
    Zuowei Shen
    Ming Tan
    Ye Tang
    Monica C Wong-Chan
    Joe Zbiciak
    Aquila: A unified, low-latency fabric for datacenter networks (2022)
    Preview abstract Datacenter workloads have evolved from the data intensive, loosely-coupled workloads of the past decade to more tightly coupled ones, wherein ultra-low latency communication is essential for resource disaggregation over the network and to enable emerging programming models. We introduce Aquila, an experimental datacenter network fabric built with ultra-low latency support as a first-class design goal, while also supporting traditional datacenter traffic. Aquila uses a new Layer 2 cell-based protocol, GNet, an integrated switch, and a custom ASIC with low-latency Remote Memory Access (RMA) capabilities co-designed with GNet. We demonstrate that Aquila is able to achieve under 40 μs tail fabric Round Trip Time (RTT) for IP traffic and sub-10 μs RMA execution time across hundreds of host machines, even in the presence of background throughput-oriented IP traffic. This translates to more than 5x reduction in tail latency for a production quality key-value store running on a prototype Aquila network. View details
    Orion: Google’s Software-Defined Networking Control Plane
    Amr Sabaa
    Henrik Muehe
    Joon Suan Ong
    Karthik Swaminathan Nagaraj
    KondapaNaidu Bollineni
    Lorenzo Vicisano
    Mike Conley
    Min Zhu
    Rich Alimi
    Shawn Chen
    Shidong Zhang
    Waqar Mohsin
    (2021)
    Preview abstract We present Orion, a distributed Software-Defined Networking platform deployed globally in Google’s datacenter (Jupiter) as well as Wide Area (B4) networks. Orion was designed around a modular, micro-service architecture with a central publish-subscribe database to enable a distributed, yet tightly-coupled, software-defined network control system. Orion enables intent-based management and control, is highly scalable and amenable to global control hierarchies. Over the years, Orion has matured with continuously improving performance in convergence (up to 40x faster), throughput (handling up to 1.16 million network updates per second), system scalability (supporting 16x larger networks), and data plane availability (50x, 100x reduction in unavailable time in Jupiter and B4, respectively) while maintaining high development velocity with bi-weekly release cadence. Today, Orion robustly enables all of Google’s Software-Defined Networks defending against failure modes that are both generic to large scale production networks as well as unique to SDN systems. View details
    Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering
    Matthew Holliman
    Gary Baldus
    Marcus Hines
    TaeEun Kim
    Ashok Narayanan
    Victor Lin
    Colin Rice
    Brian Rogan
    Bert Tanaka
    Manish Verma
    Puneet Sood
    Mukarram Tariq
    Dzevad Trumic
    Vytautas Valancius
    Calvin Ying
    Mahesh Kallahalla
    Sigcomm (2017)
    Preview abstract We present the design of Espresso, Google’s SDN-based Internet peering edge routing infrastructure. This architecture grew out of a need to exponentially scale the Internet edge cost-effectively and to enable application-aware routing at Internet-peering scale. Espresso utilizes commodity switches and host-based routing/packet processing to implement a novel fine-grained traffic engineering capability. Overall, Espresso provides Google a scalable peering edge that is programmable, reliable, and integrated with global traffic systems. Espresso also greatly accelerated deployment of new networking features at our peering edge. Espresso has been in production for two years and serves over 22% of Google’s total traffic to the Internet. View details
    Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network
    Joon Ong
    Amit Agarwal
    Glen Anderson
    Ashby Armistead
    Roy Bannon
    Seb Boving
    Gaurav Desai
    Bob Felderman
    Paulie Germano
    Anand Kanagala
    Jeff Provost
    Jason Simmons
    Eiichi Tanda
    Jim Wanderer
    Stephen Stuart
    Communications of the ACM, Vol. 59, No. 9 (2016), pp. 88-97
    Preview abstract We present our approach for overcoming the cost, operational complexity, and limited scale endemic to datacenter networks a decade ago. Three themes unify the five generations of datacenter networks detailed in this paper. First, multi-stage Clos topologies built from commodity switch silicon can support cost-effective deployment of building-scale networks. Second, much of the general, but complex, decentralized network routing and management protocols supporting arbitrary deployment scenarios were overkill for single-operator, pre-planned datacenter networks. We built a centralized control mechanism based on a global configuration pushed to all datacenter switches. Third, modular hardware design coupled with simple, robust software allowed our design to also support inter-cluster and wide-area networks. Our datacenter networks run at dozens of sites across the planet, scaling in capacity by 100x over 10 years to more than 1 Pbps of bisection bandwidth. View details
    Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network
    Joon Ong
    Amit Agarwal
    Glen Anderson
    Ashby Armistead
    Roy Bannon
    Seb Boving
    Gaurav Desai
    Paulie Germano
    Jeff Provost
    Jason Simmons
    Eiichi Tanda
    Jim Wanderer
    Amin Vahdat
    Sigcomm '15, Google Inc (2015)
    Preview abstract We present our approach for overcoming the cost, operational complexity, and limited scale endemic to datacenter networks a decade ago. Three themes unify the five generations of datacenter networks detailed in this paper. First, multi-stage Clos topologies built from commodity switch silicon can support cost-effective deployment of building-scale networks. Second, much of the general, but complex, decentralized network routing and management protocols supporting arbitrary deployment scenarios were overkill for single-operator, pre-planned datacenter networks. We built a centralized control mechanism based on a global configuration pushed to all datacenter switches. Third, modular hardware design coupled with simple, robust software allowed our design to also support inter-cluster and wide-area networks. Our datacenter networks run at dozens of sites across the planet, scaling in capacity by 100x over ten years to more than 1Pbps of bisection bandwidth. View details
    WCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers
    Malveeka Tewari
    Min Zhu
    Abdul Kabbani
    EuroSys '14: Proceedings of the Ninth European Conference on Computer Systems (2014), Article No. 5
    Preview abstract Data Center topologies employ multiple paths among servers to deliver scalable, cost-effective network capacity. The simplest and the most widely deployed approach for load balancing among these paths, Equal Cost Multipath (ECMP), hashes flows among the shortest paths toward a destination. ECMP leverages uniform hashing of balanced flow sizes to achieve fairness and good load balancing in data centers. However, we show that ECMP further assumes a balanced, regular, and fault-free topology, which are invalid assumptions in practice that can lead to substantial performance degradation and, worse, variation in flow bandwidths even for same size flows. We present a set of simple algorithms that achieve Weighted Cost Multipath (WCMP) to balance traffic in the data center based on the changing network topology. The state required for WCMP is already disseminated as part of standard routing protocols and it can be readily implemented in the current switch silicon without any hardware modifications. We show how to deploy WCMP in a production OpenFlow network environment and present experimental and simulation results to show that variation in flow bandwidths can be reduced by as much as 25X by employing WCMP relative to ECMP. View details
    B4: Experience with a Globally Deployed Software Defined WAN
    Sushant Jain
    Joon Ong
    Subbaiah Venkata
    Jim Wanderer
    Junlan Zhou
    Min Zhu
    Amin Vahdat
    Proceedings of the ACM SIGCOMM Conference, Hong Kong, China (2013)
    Preview
    Preview abstract One of the goals of traffic engineering is to achieve a flexible trade-off between fairness and throughput so that users are satisfied with their bandwidth allocation and the network operator is satisfied with the utilization of network resources. In this paper, we propose a novel way to balance the throughput and fairness objectives with linear programming. It allows the network operator to precisely control the trade-off by bounding the fairness degradation for each commodity compared to the max-min fair solution or the throughput degradation compared to the optimal throughput. We also present improvements to a previous algorithm that achieves max-min fairness by solving a series of linear programs. We significantly reduce the number of steps needed when the access rate of commodities is limited. We extend the algorithm to two important practical use cases: importance weights and piece-wise linear utility functions for commodities. Our experiments on synthetic and real networks show that our algorithms achieve a significant speedup and provide practical insights on the trade-off between fairness and throughput. View details