Jump to Content

Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Publications

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 344 publications
    Preview abstract We discuss distributed reset control of bittide systems. In a bittide system, multiple processors communicate over a network. The processors remain in logical synchrony by controlling the timing of frame transmissions. The protocol for doing this relies upon an underlying dynamic control system, where each node makes only local observations and performs no direct coordination with other nodes. In this paper we develop a control algorithm based on the idea of reset control, which allows all nodes to maintain small buffer offsets while also requiring very little state information at each node. This offers the potential for simplified boot processes and failure handling. View details
    Progressive Partitioning for Parallelized Query Execution in Google’s Napa
    Junichi Tatemura
    Yanlai Huang
    Jim Chen
    Yupu Zhang
    Kevin Lai
    Divyakant Agrawal
    Brad Adelberg
    Shilpa Kolhar
    49th International Conference on Very Large Data Bases, VLDB (2023), pp. 3475-3487
    Preview abstract Napa powers Google's critical data warehouse needs. It utilizes Log-Structured Merge Tree (LSM) for real-time data ingestion and achieves sub-second query latency for billions of queries per day. Napa handles a wide variety of query workloads: from full-table scans, to range scans, and multi-key lookups. Our design challenge is to handle this diverse query workload that runs concurrently. In particular, a large percentage of our query volume consists of external reporting queries characterized by multi-key lookups with strict sub-second query latency targets. Query parallelization, which is achieved by processing a query in parallel by partitioning the input data (i.e., the SIMD model of computation), is an important technique to meet the low latency targets. Traditionally, the effectiveness of parallelization of a query is highly dependent on the alignment with the data partitioning established at write time. Unfortunately, such a write-time partitioning scheme cannot handle the highly variable parallelization requirements that are needed on a per-query basis. The key to Napa’s success is its ability to adapt its query parallelization requirements on a per-query basis. This paper describes an index-based approach to perform data partitioning for queries that have sub-second latency requirements. Napa’s approach is progressive in that it can provide good partitioning within the time budgeted for partitioning. Since the end-to-end query time also includes the time to perform partitioning there is a tradeoff in terms of the time spent for partitioning and the resulting evenness of the partitioning. Our approach balances these opposing considerations to provide sub-second querying for billions of queries each day. We use production data to establish the effectiveness of Napa’s approach across easy to handle workloads to the most pathological conditions. View details
    Opportunistic Package Delivery as a Service on Road Networks
    Debajyoti Ghosh
    Kiran Khatter
    Hanan Samet
    GeoInformatica (2023)
    Preview abstract In the new “gig” economy, a user plays the role of a consumer as well as a service provider. As a service provider, drivers travelling from a source to a destination may opportunistically pickup and drop-off packages along the way if that does not add significantly to their trip distance or time. This gives rise to a new business offering called Package Delivery as a Service (PDaaS) that brokers package pickups and deliveries at one end and connects them to drivers on the other end, thus creating an ecosystem of supply and demand. The dramatic cost savings of such a service model come from the fact that the driver is already en-route to their destination and the package delivery adds a small overhead to an already pre-planned trip. From a technical perspective, this problem introduces new technical challenges that are uncommon in the literature. The driver may want to optimise for distance or time. Furthermore, new packages arrive for delivery all the time and are assigned to various drivers continuously. This means that the algorithm has to work in an environment that is dynamic, thereby precluding most standard road network precomputation efforts. Furthermore, the number of packages that are available for delivery could be in the hundreds of thousands, which has to be quickly pruned down for the algorithm to scale. The paper proposes a variation called dual Dijkstra’s that combines a forward and a backward scan in order to find delivery options that satisfy the constraints specified by the driver. The new dual heuristic improves the standard single Dijkstra’s approach by narrowing down the search space, thus resulting in significant speed-ups over the standard algorithms. Furthermore, a combination of dual Dijkstra’s with a heuristic landmark approach results in a dramatic speed-up compared to the baseline algorithms. Experimental results show that the proposed approach can offer drivers a choice of packages to deliver under specified constraints of time or distance, and providing sub-second response time despite the complexity of the problem involved. As the number of packages in the system increases, the matchmaking process becomes easier resulting in faster response times. The scalability of the PDaaS infrastructure is demonstrated using extensive experimental results. View details
    Preview abstract We introduce logical synchrony, a framework that allows distributed computing to be coordinated as tightly as with pure synchrony without the distribution of a global clock or any reference to a universal time. We describe and prove the main properties of the framework and point to how processes can be executed on a logically synchronous system. View details
    CAPA: An Architecture For Operating Cluster Networks With High Availability
    Bingzhe Liu
    Mukarram Tariq
    Omid Alipourfard
    Rich Alimi
    Deepak Arulkannan
    Virginia Beauregard
    Patrick Conner
    Brighten Godfrey
    Xander Lin
    Mayur Patel
    Joon Ong
    Amr Sabaa
    Alex Smirnov
    Manish Verma
    Prerepa Viswanadham
    Google, Google, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 (2023)
    Preview abstract Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“ layer. We evaluate CAPA based on case studies of outages prevented, counterfactual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years View details
    Code Generation for Data-Dependent Stencils
    Mohammed Essadki
    Bertrand Michel
    Bruno Maugars
    Nicolas Vasilache
    CGO, IEEE (2023)
    Preview abstract Numerical simulation often resorts to iterative in-place stencils such as the Gauss-Seidel or Successive Overrelaxation (SOR) methods. Writing high performance implementations of such stencils requires significant effort and time; it also involves non-local transformations beyond the stencil kernel itself. While automated code generation is a mature technology for image processing stencils, convolutions and out-of place iterative stencils (such as the Jacobi method), the optimization of in-place stencils requires manual craftsmanship. Building on recent advances in tensor compiler construction, we propose the first domain-specific code generator for iterative in-place stencils. Starting from a generic tensor compiler implemented in the MLIR framework, tensor abstractions are incrementally refined and lowered down to parallel, tiled, fused and vectorized code. We used our generator to implement a realistic, implicit solver for structured meshes, and demonstrate results competitive with an industrial computational fluid dynamics framework. We also compare with stand-alone stencil kernels for dense tensors. View details
    Constant RMR System-wide Failure Resilient Durable Locks with Dynamic Joining
    Anup Joshi
    Prasad Jayanti
    ACM Symposium on Parallelism in Algorithms and Architectures (2023)
    Preview abstract We design a Recoverable Mutual Exclusion (RME) algorithm for the system-wide failure model. Our algorithm requires only O(1) space per process, and achieves O(1) worst-case RMR complexity in both the CC and DSM models. Furthermore, in contrast to existing RME algorithms which can only support a pre-declared set of n threads with names from 1 to n, our algorithm can be accessed by arbitrarily many dynamically allocated threads of arbitrary names. View details
    Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)
    Shashwat Silas
    Dave Clausen
    File and Storage Technologies (FAST), USENIX (2023)
    Preview abstract Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures. We conduct a practically-minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs, which show excellent performance in simulations, and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers. View details
    Brief Announcement: Efficient Recoverable Writable-CAS
    Prasad Jayanti
    Sucharita Jayanti
    Submitting to Principles of Distributed Computing (PODC) (2023)
    Preview abstract We present DuraCAS, a durable, i.e., recoverably linearizable and detectable implementation of the CAS (compare-and-swap) primitive. DuraCAS is writable, meaning it supports a Write() operation along with CAS() and Read(); has constant time complexity per operation; allows for dynamic joining, meaning newly created processes (a.k.a. threads) of arbitrary names can join the protocol and access our implementation; and has adaptive space complexity, meaning the space use scales in the number of processes n that actually use the objects, as opposed to previous protocols whose space complexity depends on N, the maximum number of processes that the protocol is designed for. Furthermore, DuraCAS, requires only O(m + n) space to support m objects that get accessed by n processes, improving on the state-of-the-art O(m + N2). To our knowledge, DuraCAS is the first durable CAS algorithm that allows for dynamic joining, and is the first to exhibit adaptive space complexity. View details
    RFC 9476 - The .alt Special-Use Top-Level Domain
    Paul Hoffman
    IETF Request For Comments, RFC Editor (2023), pp. 7
    Preview abstract This document reserves a Top-Level Domain (TLD) label "alt" to be used in non-DNS contexts. It also provides advice and guidance to developers creating alternative namespaces. View details
    Preview abstract We describe our experience with Fathom, a system for identifying the network performance bottlenecks of any service running in the Google fleet. Fathom passively samples RPCs, the principal unit of work for services. It segments the overall latency into host and network components with kernel and RPC stack instrumentation. It records these detailed latency metrics, along with detailed transport connection state, for every sampled RPC. This lets us determine if the completion is constrained by the client, network or server. To scale while enabling analysis, we also aggregate samples into distributions that retain multi-dimensional breakdowns. This provides us with a macroscopic view of individual services. Fathom runs globally in our datacenters for all production traffic, where it monitors billions of TCP connections 24x7. For five years Fathom has been our primary tool for troubleshooting service network issues and assessing network infrastructure changes. We present case studies to show how it has helped us improve our production services. View details
    HALP: Heuristic Aided Learned Preference Eviction Policy for YouTube Content Delivery Network
    Zhenyu Song
    Kevin Chen
    Νikhil Sarda
    Eugene Brevdo
    Jimmy Coleman
    Xiao Ju
    Pawel Jurczyk
    Richard Schooler
    20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), USENIX Association, Boston, MA (2023), pp. 1149-1163
    Preview abstract Video streaming services are among the largest web applications in production, and a large source of downstream internet traffic. A large-scale video streaming service at Google, YouTube, leverages a Content Delivery Network (CDN) to serve its users. A key consideration in providing a seamless service is cache efficiency. In this work, we demonstrate machine learning techniques to improve the efficiency of YouTube's CDN DRAM cache. While many recently proposed learning-based caching algorithms show promising results, we identify and address three challenges blocking deployment of such techniques in a large-scale production environment: computation overhead for learning, robust byte miss ratio improvement, and measuring impact under production noise. We propose a novel caching algorithm, HALP, which achieves low CPU overhead and robust byte miss ratio improvement by augmenting a heuristic policy with machine learning. We also propose a production measurement method, impact distribution analysis, that can accurately measure the impact distribution of a new caching algorithm deployment in a noisy production environment. HALP has been running in YouTube CDN production as a DRAM level eviction algorithm since early 2022 and has reliably reduced the byte miss during peak by an average of 9.1% while expending a modest CPU overhead of 1.8%. View details
    A Model-based, Quality Attribute-guided Architecture Re-Design Process at Google
    Qin Jia
    Yuanfang Cai
    2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)
    Preview abstract Communicating and justifying design decisions are difficult, especially when the architecture design has to evolve. In this paper, we report our experiences of using formal but lightweight design models to communicate, justify, and analyze the quality trade-offs of an architecture revision plan for Monarch, a large-scale legacy system from Google. We started from a few critical user scenarios and their associated quality attribute scenarios, which makes these models lightweight and concise, expressing high-level abstractions only. We also separated static views from dynamic views so that each diagram can be precise and suitable for analyzing different types of quality attributes respectively. The combination of scenarios, quality attributes, and lightweight modeling was well accepted by the team as an effective way to analyze and communicate the trade-offs. A few days after we presented and shared this process, two new projects within the Monarch team adopted component and sequence diagrams in their design documents, and two other product areas within Google started to learn and to adopt the process as well. Our experience indicates that these architecture modeling and analysis techniques can be integrated into software development process to communicate and assess features, quality attributes, or design decisions continuously and iteratively. View details
    Adaptive Massively Parallel Connectivity in Optimal Space
    Jara Uitto
    Rustam Latypov
    Yannic Maus
    SPAA'23 (2023)
    Preview abstract We study the problem of finding connected components in the Adaptive Massively Parallel Computation (AMPC) model. We show that when the total available space is linear in the size of the input graph the problem can be solved in O(log* n) rounds in forests (with high probability) and 2^O(log* n) expected rounds in general graphs. This improves upon an existing O(log log_(m/n) n) round algorithm. For the case when the desired number of rounds is constant we show that both problems can be solved with only O(m + n log^(k) n) total space, where k is an arbitrarily large constant and log^(k) is the k-th iterate of the log2 function. This improves upon existing algorithms requiring Omega(m + n log n) total space. View details
    Durable Algorithms for Writable LL/SC and CAS with Dynamic Joining
    Prasad Jayanti
    Sucharita Jayanti
    International Symposium on Distributed Computing (2023)
    Preview abstract We present durable implementations for two well known universal primitives---CAS (compare-and-swap), and its ABA-free counter-part LLSC (load-linked, store-conditional). Our implementations satisfy method-based recoverable linearizability (MRL) and method-based detectability (M-detectability)---novel correctness conditions that require only a simple usage pattern to guarantee resilience to individual process crashes (and system-wide crashes), including in implementations with nesting. Additionally, our implementations are: writable, meaning they support a Write() operation; have constant time complexity per operation; allow for dynamic joining, meaning newly created processes (a.k.a. threads) of arbitrary names can join a protocol and access our implementations; and have adaptive space complexity, meaning the space use scales in the number of processes $n$ that actually use the objects, as opposed to previous protocols whose space complexity depends on $N$, the maximum number of processes that the protocol is designed for. Our durable Writable-CAS implementation, DuraCAS, requires $O(m + n)$ space to support $m$ objects that get accessed by $n$ processes, improving on the state-of-the-art $O(m + N^2)$. By definition, LLSC objects must store ``contexts'' in addition to object values. Our Writable-LLSC implementation, DuraLL, requires $O(m + n + C)$ space, where $C$ is the number of ``contexts'' stored across all the objects. While LLSC has an advantage over CAS due to being ABA-free, the object definition seems to require additional space usage. To address this trade-off, we define an External Context (EC) variant of LLSC. Our EC Writable-LLSC implementation is ABA-free and has a space complexity of just $O(m + n)$. To our knowledge, our algorithms are the first durable CAS algorithms that allow for dynamic joining, and are the first to exhibit adaptive space complexity. To our knowledge, we are the first to implement any type of durable LLSC objects. View details