Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 514 publications
    Federated Variational Inference: Towards Improved Personalization and Generalization
    Elahe Vedadi
    Josh Dillon
    Philip Mansfield
    Karan Singhal
    Arash Afkanpour
    Warren Morningstar
    AAAI Federated Learning on the Edge Symposium(2024)
    Preview abstract Conventional federated learning algorithms train a single global model by leveraging all participating clients' data. However, due to heterogeneity in client generative distributions and predictive models, these approaches may not appropriately approximate the predictive process, converge to an optimal state, or generalize to new clients. We study personalization and generalization in stateless cross-device federated learning setups assuming heterogeneity in client data distributions and predictive models. We first propose a hierarchical generative model and formalize it using Bayesian Inference. We then approximate this process using Variational Inference to train our model efficiently. We call this algorithm Federated Variational Inference (FedVI). We use PAC-Bayes analysis to provide generalization bounds for FedVI. We evaluate our model on FEMNIST and CIFAR-100 image classification and show that FedVI beats the state-of-the-art on both tasks. View details
    Preview abstract Google services are powered by the largest network of computers in the world. Site Reliabity Engineers (SRE) make sure that the whole stack is cool: datacenters are safe, well provisionedl; we have fallback mechanims, and data integrity; to making sure we design our stack properly, using the right storage, replication and software trade-offs. Generative AI is a great tool to make us super-effective: having access to tools to generate our most toily configurations, to classify risks and events, to manage large swaths of machines with agents or to automate complex workflows cheaply. This talk will cover the journey that SRE started years ago to become a truly AI-First discipline and the latest advancements in tooling, practices and workflows. View details
    Preview abstract Representative modeling of I/O activity is crucial when designing large-scale distributed storage systems. Particularly important use cases are counterfactual “what-if” analyses that assess the impact of anticipated or hypothetical new storage policies or hardware prior to deployment. We propose Thesios, a methodology to accurately synthesize such hypothetical full-resolution I/O traces by carefully combining down-sampled I/O traces collected from multiple disks attached to multiple storage servers. Applying this approach to real-world traces that a real ready routinely sampled at Google, we show that our synthesized traces achieve 95–99.5% accuracy in read/write request numbers, 90–97% accuracy in utilization, and 80–99.8% accuracy in read latency compared to metrics collected from actual disks. We demonstrate how The-sios enables diverse counterfactual I/O trace synthesis and analyses of hypothetical policy, hardware, and server changes through four case studies: (1) studying the effects of changing disk’s utilization, fullness, and capacity, (2) evaluating new data placement policy, (3) analyzing the impact on power and performance of deploying disks with reduced rotations-per-minute (RPM), and (4) understanding the impact of increased buffer cache size on a storage server. Without Thesios, such counterfactual analyses would require costly and potentially risky A/B experiments in production. View details
    Load is not what you should balance: Introducing Prequal
    Bartek Wydrowski
    Bobby Kleinberg
    Steve Rumble
    (2024)
    Preview abstract We present Prequal (\emph{Probing to Reduce Queuing and Latency}), a load balancer for distributed multi-tenant systems. Prequal aims to minimize real-time request latency in the presence of heterogeneous server capacities and non-uniform, time-varying antagonist load. It actively probes server load to leverage the \emph{power of $d$ choices} paradigm, extending it with asynchronous and reusable probes. Cutting against received wisdom, Prequal does not balance CPU load, but instead selects servers according to estimated latency and active requests-in-flight (RIF). We explore its major design features on a testbed system and evaluate it on YouTube, where it has been deployed for more than two years. Prequal has dramatically decreased tail latency, error rates, and resource use, enabling YouTube and other production systems at Google to run at much higher utilization. View details
    BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse
    Garrett Casto
    Mingge Deng
    Rushabh Desai
    Thibaud Hottelier
    Amir Hormati
    Jeff Johnson
    Dawid Kurzyniec
    Prem Ramanathan
    Gaurav Saxena
    Vidya Shanmugam
    Yuri Volobuev
    SIGMOD(2024)
    Preview abstract BigQuery’s cloud-native disaggregated architecture has allowed Google Cloud to evolve the system to meet several customer needs across the analytics and AI/ML workload spectrum. A key customer requirement for BigQuery centers around the unification of data lake and enterprise data warehousing workloads. This approach combines: (1) the need for core data management primitives, e.g., security, governance, common runtime metadata, performance acceleration, ACID transactions, provided by an enterprise data warehouses coupled with (2) harnessing the flexibility of the open source format and analytics ecosystem along with new workload types such as AI/ML over unstructured data on object storage. In addition, there is a strong requirement to support BigQuery as a multi-cloud offering given cloud customers are opting for a multi-cloud footprint by default. This paper describes BigLake, an evolution of BigQuery toward a multi-cloud lakehouse to address these customer requirements in novel ways. We describe three main innovations in this space. We first present BigLake tables, making open-source table formats (e.g., Apache Parquet, Iceberg) first class citizens, providing fine-grained governance enforcement and performance acceleration over these formats to BigQuery and other open-source analytics engines. Next, we cover the design and implementation of BigLake Object tables that allow BigQuery to integrate AI/ML for inferencing and processing over unstructured data. Finally, we present Omni, a platform for deploying BigQuery on non-GCP clouds, focusing on the infrastructure and operational innovations we made to provide an enterprise lakehouse product regardless of the cloud provider hosting the data. View details
    Preview abstract Vortex is an exabyte scale structured storage system built for streaming and batch analytics. It supports high-throughput batch and stream ingestion. For the user, it supports both batch-oriented and stream-based processing on the ingested data. View details
    Preview abstract We discuss distributed reset control of bittide systems. In a bittide system, multiple processors communicate over a network. The processors remain in logical synchrony by controlling the timing of frame transmissions. The protocol for doing this relies upon an underlying dynamic control system, where each node makes only local observations and performs no direct coordination with other nodes. In this paper we develop a control algorithm based on the idea of reset control, which allows all nodes to maintain small buffer offsets while also requiring very little state information at each node. This offers the potential for simplified boot processes and failure handling. View details
    Progressive Partitioning for Parallelized Query Execution in Google’s Napa
    Junichi Tatemura
    Yanlai Huang
    Jim Chen
    Yupu Zhang
    Kevin Lai
    Divyakant Agrawal
    Brad Adelberg
    Shilpa Kolhar
    49th International Conference on Very Large Data Bases, VLDB(2023), pp. 3475-3487
    Preview abstract Napa powers Google's critical data warehouse needs. It utilizes Log-Structured Merge Tree (LSM) for real-time data ingestion and achieves sub-second query latency for billions of queries per day. Napa handles a wide variety of query workloads: from full-table scans, to range scans, and multi-key lookups. Our design challenge is to handle this diverse query workload that runs concurrently. In particular, a large percentage of our query volume consists of external reporting queries characterized by multi-key lookups with strict sub-second query latency targets. Query parallelization, which is achieved by processing a query in parallel by partitioning the input data (i.e., the SIMD model of computation), is an important technique to meet the low latency targets. Traditionally, the effectiveness of parallelization of a query is highly dependent on the alignment with the data partitioning established at write time. Unfortunately, such a write-time partitioning scheme cannot handle the highly variable parallelization requirements that are needed on a per-query basis. The key to Napa’s success is its ability to adapt its query parallelization requirements on a per-query basis. This paper describes an index-based approach to perform data partitioning for queries that have sub-second latency requirements. Napa’s approach is progressive in that it can provide good partitioning within the time budgeted for partitioning. Since the end-to-end query time also includes the time to perform partitioning there is a tradeoff in terms of the time spent for partitioning and the resulting evenness of the partitioning. Our approach balances these opposing considerations to provide sub-second querying for billions of queries each day. We use production data to establish the effectiveness of Napa’s approach across easy to handle workloads to the most pathological conditions. View details
    Opportunistic Package Delivery as a Service on Road Networks
    Debajyoti Ghosh
    Kiran Khatter
    Hanan Samet
    GeoInformatica(2023)
    Preview abstract In the new “gig” economy, a user plays the role of a consumer as well as a service provider. As a service provider, drivers travelling from a source to a destination may opportunistically pickup and drop-off packages along the way if that does not add significantly to their trip distance or time. This gives rise to a new business offering called Package Delivery as a Service (PDaaS) that brokers package pickups and deliveries at one end and connects them to drivers on the other end, thus creating an ecosystem of supply and demand. The dramatic cost savings of such a service model come from the fact that the driver is already en-route to their destination and the package delivery adds a small overhead to an already pre-planned trip. From a technical perspective, this problem introduces new technical challenges that are uncommon in the literature. The driver may want to optimise for distance or time. Furthermore, new packages arrive for delivery all the time and are assigned to various drivers continuously. This means that the algorithm has to work in an environment that is dynamic, thereby precluding most standard road network precomputation efforts. Furthermore, the number of packages that are available for delivery could be in the hundreds of thousands, which has to be quickly pruned down for the algorithm to scale. The paper proposes a variation called dual Dijkstra’s that combines a forward and a backward scan in order to find delivery options that satisfy the constraints specified by the driver. The new dual heuristic improves the standard single Dijkstra’s approach by narrowing down the search space, thus resulting in significant speed-ups over the standard algorithms. Furthermore, a combination of dual Dijkstra’s with a heuristic landmark approach results in a dramatic speed-up compared to the baseline algorithms. Experimental results show that the proposed approach can offer drivers a choice of packages to deliver under specified constraints of time or distance, and providing sub-second response time despite the complexity of the problem involved. As the number of packages in the system increases, the matchmaking process becomes easier resulting in faster response times. The scalability of the PDaaS infrastructure is demonstrated using extensive experimental results. View details
    Preview abstract We introduce logical synchrony, a framework that allows distributed computing to be coordinated as tightly as with pure synchrony without the distribution of a global clock or any reference to a universal time. We describe and prove the main properties of the framework and point to how processes can be executed on a logically synchronous system. View details
    CAPA: An Architecture For Operating Cluster Networks With High Availability
    Bingzhe Liu
    Mukarram Tariq
    Omid Alipourfard
    Rich Alimi
    Deepak Arulkannan
    Virginia Beauregard
    Patrick Conner
    Brighten Godfrey
    Xander Lin
    Mayur Patel
    Joon Ong
    Amr Sabaa
    Alex Smirnov
    Manish Verma
    Prerepa Viswanadham
    Google, Google, 1600 Amphitheatre Pkwy, Mountain View, CA 94043(2023)
    Preview abstract Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“ layer. We evaluate CAPA based on case studies of outages prevented, counterfactual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years View details
    Code Generation for Data-Dependent Stencils
    Mohammed Essadki
    Bertrand Michel
    Bruno Maugars
    Oleksandr Zinenko
    Nicolas Vasilache
    CGO, IEEE(2023)
    Preview abstract Numerical simulation often resorts to iterative in-place stencils such as the Gauss-Seidel or Successive Overrelaxation (SOR) methods. Writing high performance implementations of such stencils requires significant effort and time; it also involves non-local transformations beyond the stencil kernel itself. While automated code generation is a mature technology for image processing stencils, convolutions and out-of place iterative stencils (such as the Jacobi method), the optimization of in-place stencils requires manual craftsmanship. Building on recent advances in tensor compiler construction, we propose the first domain-specific code generator for iterative in-place stencils. Starting from a generic tensor compiler implemented in the MLIR framework, tensor abstractions are incrementally refined and lowered down to parallel, tiled, fused and vectorized code. We used our generator to implement a realistic, implicit solver for structured meshes, and demonstrate results competitive with an industrial computational fluid dynamics framework. We also compare with stand-alone stencil kernels for dense tensors. View details
    Constant RMR System-wide Failure Resilient Durable Locks with Dynamic Joining
    Anup Joshi
    Prasad Jayanti
    ACM Symposium on Parallelism in Algorithms and Architectures(2023)
    Preview abstract We design a Recoverable Mutual Exclusion (RME) algorithm for the system-wide failure model. Our algorithm requires only O(1) space per process, and achieves O(1) worst-case RMR complexity in both the CC and DSM models. Furthermore, in contrast to existing RME algorithms which can only support a pre-declared set of n threads with names from 1 to n, our algorithm can be accessed by arbitrarily many dynamically allocated threads of arbitrary names. View details
    Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)
    Shashwat Silas
    Dave Clausen
    File and Storage Technologies (FAST), USENIX(2023)
    Preview abstract Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures. We conduct a practically-minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs, which show excellent performance in simulations, and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers. View details
    Brief Announcement: Efficient Recoverable Writable-CAS
    Prasad Jayanti
    Sucharita Jayanti
    Submitting to Principles of Distributed Computing (PODC)(2023)
    Preview abstract We present DuraCAS, a durable, i.e., recoverably linearizable and detectable implementation of the CAS (compare-and-swap) primitive. DuraCAS is writable, meaning it supports a Write() operation along with CAS() and Read(); has constant time complexity per operation; allows for dynamic joining, meaning newly created processes (a.k.a. threads) of arbitrary names can join the protocol and access our implementation; and has adaptive space complexity, meaning the space use scales in the number of processes n that actually use the objects, as opposed to previous protocols whose space complexity depends on N, the maximum number of processes that the protocol is designed for. Furthermore, DuraCAS, requires only O(m + n) space to support m objects that get accessed by n processes, improving on the state-of-the-art O(m + N2). To our knowledge, DuraCAS is the first durable CAS algorithm that allows for dynamic joining, and is the first to exhibit adaptive space complexity. View details