Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 326 publications
Preview abstract
Despite the advent of legislation such as the General Data Protection Regulation (GDPR) with its associated "Right to be Forgotten" (RTBF), few, if any, studies have measured user reactions to realistic edge cases with public-interest content. Surveying both users covered by and excluded from RTBF, this vignette-based survey experiment sought to better understand how users think of delisting content from search engine results and what factors influence user perceptions. While leaving information accessible in search engine results generally leads to warmer feelings towards those search engines than delisting it, we find that users do prefer different outcomes
depending on contextual elements specific to given cases. We also find that whether a country has active RTBF legislation does seem to be associated with both knowledge and attitudes about RTBF, but is unlikely to explain all of it. These results indicate a complex context around removing public-interest content from search engines’ results; it is essential that experts sensitive to local context perform the review in order to ensure that removal requests are handled in a way that meets users’ expectations.
View details
On the Benefits of Traffic “Reprofiling” The Single Hop Case
Jiayi Song
Jiaming Qiu
Roch Guerin
Henry Sariowan
IEEE/ACM Transactions on Networking (2024)
Preview abstract
Datacenters have become a significant source of traffic, much of which is carried over private networks. The operators of those networks commonly have access to detailed traffic profiles and performance goals, which they seek to meet as efficiently as possible. Of interest are solutions that guarantee latency while minimizing network bandwidth. The paper explores a basic building block towards realizing such solutions, namely, a single hop configuration. The main results are in the form of optimal solutions for meeting local deadlines under schedulers of varying complexity and therefore cost. The results demonstrate how judiciously modifying flows’ traffic profiles, i.e., reprofiling them, can help simple schedulers reduce the bandwidth they require, often performing nearly as well as more complex ones.
View details
Preview abstract
This is an invited OFC 2024 conference workshop talk regarding a new type of lower-power datacenter optics design choice: linear pluggable optics. In this talk I will discuss the fundamental performance constraints facing linear pluggable optics and their implications on DCN and ML use cases
View details
KATch: A Fast Symbolic Verifier for NetKAT
Mark Moeller
David Darais
Jules Jacobs
Nate Foster
Cole Schlesinger
Olivier Savary Belanger
Alexandra Silva
Programming Languages and Implementation (PLDI) (2024) (to appear)
Preview abstract
We develop new data structures and algorithms for checking verification queries in NetKAT, a domain-specific language for specifying the behavior of network data planes. Our results extend the techniques obtained in prior work on symbolic automata and provide a framework for building efficient and scalable verification tools. We present \KATch, an implementation of these ideas in Scala, including extended logical operators that are useful for expressing network-wide specifications and optimizations that construct a bisimulation quickly or generate a counter-example showing that none exists. We evaluate the performance of our implementation on real-world and synthetic benchmarks, verifying properties such as reachability and slice isolation, typically returning a result in well under a second, which is orders of magnitude faster than previous approaches.
View details
On the Benefits of Traffic “Reprofiling” The Multiple Hops Case – Part I
Roch Guerin
Jiayi Song
Jiaming Qiu
Henry Sariowan
IEEE/ACM Transactions on Networking (2024)
Preview abstract
Abstract—This paper considers networks where user traffic is regulated through deterministic traffic profiles, e.g. token buckets, and requirescleanguaranteed hard delay bounds. The network’s goal is to minimize the resources it needs to meet those cleanrequirementsbounds. The paper explores how reprofiling, i.e. proactively modifying how user traffic enters the network, can be of benefit. Reprofiling produces “smoother” flows but introduces an up-front access delay that forces tighter network delays. The paper explores this trade-off and demonstrates that, unlike what holds in the single-hop case, reprofiling can be of benefit even when “optimal”cleansophisticated schedulers are available at each hop.
View details
A Decentralized SDN Architecture for the WAN
Hakim Weatherspoon
Sylvia Ratnasamy
Ashok Narayanan
Nitika Saran
Ankit Singla
2024 ACM Special Interest Group on Data Communication (SIGCOMM) (2024)
Preview abstract
Motivated by our experiences operating a global WAN, we argue that SDN’s reliance on infrastructure external to the data plane has significantly complicated the challenge of maintaining high availability. We propose a new decentralized SDN (dSDN) architecture in which SDN control logic instead runs within routers, eliminating the control plane’s reliance on external infrastructure and restoring fate sharing between control and data planes.
We present dSDN as a simpler approach to realizing the benefits of SDN in the WAN. Despite its much simpler design, we show that dSDN is practical from an implementation viewpoint, and outperforms centralized SDN in terms of routing convergence and SLO impact.
View details
Ubiquitous and Low-Cost Generation of Elevation Pseudo Ground Control Points
Moustafa Youssef
Etienne Le Grand
14th International Conference on Indoor Positioning and Indoor Navigation (IPIN). Hong Kong, China, 2024.
Preview abstract
In this paper, we design a system to generate Pseudo Ground Control Points (PGCPs) using standard low-cost widely available GNSS receivers in a crowd-sourcing manner. We propose a number of GNSS points filters that removes different causes of errors and biases, and design a linear regression height estimator leading to high-accuracy PGCP elevations. Evaluation of our system shows that the PGCPs can achieve a median accuracy of 22.5 cm in 25 metropolitan areas in the USA.
View details
Distributed Tracing for InterPlanetary File System
Haorui Guo
Rachel Han
Marshall David Miller
2024 International Symposium on Parallel Computing and Distributed Systems (PCDS), IEEE, pp. 1-5
Preview abstract
The InterPlanetary File System (IPFS) is on its way to becoming the backbone of the next generation of the web. However, it suffers from several performance bottlenecks, particularly on the content retrieval path, which are often difficult to debug. This is because content retrieval involves multiple peers on the decentralized network and the issue could lie anywhere in the network. Traditional debugging tools are insufficient to help web developers who face the challenge of slow loading websites and detrimental user experience. This limits the adoption and future scalability of IPFS.
In this paper, we aim to gain valuable insights into how content retrieval requests propagate within the IPFS network as well as identify potential performance bottlenecks which could lead to opportunities for improvement. We propose a custom tracing framework that generates and manages traces for crucial events that take place on each peer during content retrieval. The framework leverages event semantics to build a timeline of each protocol involved in the retrieval, helping developers pinpoint problems. Additionally, it is resilient to malicious behaviors of the peers in the decentralized environment.
We have implemented this framework on top of an existing IPFS implementation written in Java called Nabu. Our evaluation shows that the framework can identify network delays and issues with each peer involved in content retrieval requests at a very low overhead.
View details
(Invited) How Traffic Analytics Shapes Traffic Engineering, Topology Engineering, and Capacity Planning of Jupiter
Jianan Zhang
Optical Fiber Communication (OFC) Conference, IEEE (2023)
Preview abstract
Three prominent traffic features including peak alignment, stable ranking, and gravity model, have guided the design of current Google Jupiter fabrics in traffic engineering, topology engineering, and capacity planning.
View details
Change Management in Physical Network Lifecycle Automation
Sean Smith
Melanie Obenberger
Bill Martinusen
Jahangir Hasan
Zhoutao Liu
Edward Thiele
Virginia Beauregard
Anshul Nigham
Chen Huang
Ahmed Mansy
Nikil Mehta
Quan Leng
Alexander Lin
Jiayao Li
Angus Griffith
Kevin Grant
Kurt Steinkraus
Sheng Sun
Andrew Narver
Proc. 2023 USENIX Annual Technical Conference (USENIX ATC 23)
Preview abstract
Automated management of a physical network's lifecycle is critical for large networks. At Google, we manage network design, construction, evolution, and management via multiple automated systems. In our experience, one of the primary challenges is to reliably and efficiently manage change in this domain -- additions of new hardware and connectivity, planning and sequencing of topology mutations, introduction of new architectures, new software systems and fixes to old ones, etc.
We especially have learned the importance of supporting multiple kinds of change in parallel without conflicts or mistakes (which cause outages) while also maintaining parallelism between different teams and between different processes. We now know that this requires automated support.
This paper describes some of our network lifecycle goals, the automation we have developed to meet those goals, and the change-management challenges we encountered. We then discuss in detail our approaches to several specific kinds of change
management:
(1) managing conflicts between multiple operations on the same network;
(2) managing conflicts between operations spanning the boundaries between networks;
(3) managing representational changes in the models that drive our automated systems.
These approaches combine both novel software systems and software-engineering practices.
While this paper reports on our experience with large-scale datacenter network infrastructures, we are also applying the same tools and practices in several adjacent domains, such as the management of WAN systems, of machines, and of datacenter physical designs. Our approaches are likely to be useful at smaller scales, too.
View details
Preview abstract
We introduce logical synchrony, a framework that allows distributed computing to be coordinated as tightly as with pure synchrony without the distribution of a global clock or any reference to a universal time. We describe and prove the main properties of the framework and point to how processes can be executed on a logically synchronous system.
View details
CAPA: An Architecture For Operating Cluster Networks With High Availability
Bingzhe Liu
Brighten Godfrey
Omid Alipourfard
Joon Ong
Virginia Beauregard
Mukarram Tariq
Mayur Patel
Prerepa Viswanadham
Manish Verma
Xander Lin
Patrick Conner
Deepak Arulkannan
Amr Sabaa
Rich Alimi
Alex Smirnov
Google, Google, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 (2023)
Preview abstract
Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“ layer. We evaluate CAPA based on case studies of outages prevented, counterfactual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years
View details
Preview abstract
Bolt is a congestion-control algorithm designed to providesingle-digit microsecond tail network-queuing at near-linerate utilization. Motivated by the need for ultra-low latencyto support applications such as NVMe, as line rates reach200G and beyond, most transfers fit within a single BDP en-tailing that transfer times predominantly become a functionof queuing and propagation delays. Bolt is an attempt topush congestion-control to its theoretical limits by harness-ing the power of programmable dataplanes such as Tofinoand Trident3+ chips. Bolt is founded on three key ideas, (i)Sub-RTT reaction (SRR): reacting to congestion faster thanRTT control-loop delay, (ii) Proactive Ramp-up (PRU): bytracking future flow-completions, and (iii) Supply matching(SM): leveraging Network Calculus concepts to maximizeutilization. Our current results achieve a 75% reduction inqueuing-delays over Swift with upto 3x improvement incompletion times for short transfers.
View details
Preview abstract
We review state-of-the-art datacenter technologies for 800G, 1.6T and beyond interconnect speeds, focusing on 200G per-lane IM-DD (intensity modulated-direct detect) and 800G-LR1 coherent-lite transmissions.
View details
Improving Network Availability with Protective ReRoute
Abdul Kabbani
Brad Morrey
Uma Parthavi Moravapalle
Steven Knight
Van Jacobson
Jim Winget
SIGCOMM 2023
Preview abstract
We present PRR (Protective ReRoute), a transport technique for shortening user-visible outages that complements routing repair. It can be added to any transport to provide benefits in multipath networks. PRR responds to flow connectivity failure signals, e.g., retransmission timeouts, by changing the FlowLabel on packets of the flow, which causes switches and hosts to choose a different network path that may avoid the outage. To enable it, we shifted our IPv6 network architecture to use the FlowLabel, so that hosts can change the paths of their flows without application involvement. PRR is deployed fleetwide at Google for TCP and Pony Express, where it has been protecting all production traffic for several years. It is also available to our Cloud customers. We find it highly effective for real outages. In a measurement study on our network backbones, adding PRR reduced the cumulative region-pair outage time for RPC traffic by 63--84%. This is the equivalent of adding 0.4--0.8 "nines'" of availability.
View details