Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
 
        
        Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
        Sort By
        
        
    
    
  
    1 - 15 of 364 publications
  
  
            
        
          
            
              Fast ACS: Low-Latency File-Based Ordered Message Delivery at Scale
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Anil Raghunath Iyer
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Neel Bagora
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chang Yu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Olivier Pomerleau
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vivek Kumar
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Prunthaban Kanthakumar
                      
                    
                  
              
            
          
          
          
          
            Usenix Annual Technical Conference (2025)
          
          
        
        
        
          
              Preview abstract
          
          
              Low-latency message delivery is crucial for real-time systems. Data originating from a producer must be delivered to consumers, potentially distributed in clusters across metropolitan and continental boundaries. With the growing scale of computing, there can be several thousand consumers of the data. Such systems require a robust messaging system capable of transmitting messages containing data across clusters and efficiently delivering them to consumers. The system must offer guarantees like ordering and at-least-once delivery while avoiding overload on consumers, allowing them to consume messages at their own pace.
This paper presents the design of Fast ACS (an abbreviation for Ads Copy Service), a file-based ordered message delivery system that leverages a combination of two-sided (inter-cluster) and one-sided (intra-cluster) communication primitives—namely, Remote Procedure Call and Remote Direct Memory Access, respectively—to deliver messages. The system has been successfully deployed to dozens of production clusters and scales to accommodate several thousand consumers within each cluster, which amounts to Tbps-scale intra-cluster consumer traffic at peak. Notably, Fast ACS delivers messages to consumers across the globe within a few seconds or even sub-seconds (p99) based on the message volume and consumer scale, at a low resource cost.
              
  
View details
          
        
      
    
        
          
            
              Continuous Evaluation: Using CI Techniques For Experimentation At Scale
            
          
        
        
          
            
              
                
                  
                    
                
              
            
          
          
          
          
    
    
    
    
    
            IEEE, pp. 157-157
          
          
        
        
        
          
              Preview abstract
          
          
              Continuous Integration (CI) is an essential software development practice that establishes processes to minimize bugs and errors in production. In a similar vein, experimentation of software products is vital for evaluating user satisfaction, quality, performance and other key business metrics. Experimentation allows product owners to evaluate the user impact of changes. This can help make informed decisions regarding feature launches. Experimentation also allows developers to tweak internal processes and algorithms to maximize the impact of new features and changes. Additionally, it can sometimes detect errors not detected by CI.
Unlike CI systems, experimentation platforms are meant to closely imitate production and usually run the system under test (SUT) against a large scale of input. Despite this, experimentation platforms have a lot in common with CI systems. The mechanisms for continuously integrating and testing changes can be modified and applied to experimentation platforms.
Google Search's experimentation platform started as a command line tool many years ago. Over time, this tool has evolved into a platform that serves the evaluation needs for many of Google's products like Search, Assistant, YouTube, Play, Lens, etc., running thousands of large experiments every day.
In this workshop, we will present the evolution of Google Search's experimentation platform and how it was transformed from a simple CLI tool into a platform that works at scale, fulfills continuous experimentation needs and provides many CI-like functionalities to its users.
              
  
View details
          
        
      
    
        
          
            
              The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Pratik Fegade
                      
                    
                  
              
            
          
          
          
          
            Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2025), pp. 5-17
          
          
        
        
        
          
              Preview abstract
          
          
              A chief enabler of large-scale deep learning is the distribution of computation across multiple interconnected hardware accelerators. In order to unlock the maximum possible performance, a compiler must first select a reasonable strategy to parallelize a model's operations. Since neural network architectures admit multiple flavors of parallelism, determining the proper strategy for each instruction is a critical (albeit non-trivial) task. To solicit new ideas toward solving this challenging combinatorial optimization problem, we organized the ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning, a multi-month competition focused on advancing the state-of-the-art for model partitioning algorithms. In this paper, we offer a retrospective of this event, including the basic problem formulation, key challenges & opportunities, our new benchmark suite, and the quality of submissions received.
              
  
View details
          
        
      
    
        
          
            
              Federated Variational Inference: Towards Improved Personalization and Generalization
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Elahe Vedadi
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Josh Dillon
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Philip Mansfield
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Karan Singhal
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Arash Afkanpour
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Warren Morningstar
                      
                    
                  
              
            
          
          
          
          
            AAAI Federated Learning on the Edge Symposium (2024)
          
          
        
        
        
          
              Preview abstract
          
          
              Conventional federated learning algorithms train a single global model by leveraging all participating clients' data. However, due to heterogeneity in client generative distributions and predictive models, these approaches may not appropriately approximate the predictive process, converge to an optimal state, or generalize to new clients. We study personalization and generalization in stateless cross-device federated learning setups assuming heterogeneity in client data distributions and predictive models. We first propose a hierarchical generative model and formalize it using Bayesian Inference. We then approximate this process using Variational Inference to train our model efficiently. We call this algorithm Federated Variational Inference (FedVI). We use PAC-Bayes analysis to provide generalization bounds for FedVI. We evaluate our model on FEMNIST and CIFAR-100 image classification and show that FedVI beats the state-of-the-art on both tasks.
              
  
View details
          
        
      
    
        
          
            
              How we use GenAI in SRE
            
          
        
        
          
            
              
                
                  
                    
                
              
            
          
          
          
          
    
    
    
    
    
            CommitConf, Madrid (2024)
          
          
        
        
        
          
              Preview abstract
          
          
              Google services are powered by the largest network of computers in the world. Site Reliability Engineers (SRE) make sure that the whole stack is cool: datacenters are safe, well provisioned; we have fallback mechanisms, and data integrity; to make sure we design our stack properly, using the right storage, replication and software trade-offs.
Generative AI is a great tool to make us super-effective: having access to tools to generate our most toily configurations, to classify risks and events, to manage large swaths of machines with agents or to automate complex workflows cheaply.
This talk will cover the journey that SRE started years ago to become a truly AI-First discipline and the latest advancements in tooling, practices and workflows.
              
  
View details
          
        
      
    
        
          
            
              Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O Samples
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Mangpo Phothilimthana
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Soroush Ghodrati
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Selene Moon
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            ASPLOS 2024, Association for Computing Machinery
          
          
        
        
        
          
              Preview abstract
          
          
              Representative modeling of I/O activity is crucial when designing large-scale distributed storage systems. Particularly important use cases are counterfactual “what-if” analyses that assess the impact of anticipated or hypothetical new storage policies or hardware prior to deployment. We propose Thesios, a methodology to accurately synthesize such hypothetical full-resolution I/O traces by carefully combining down-sampled I/O traces collected from multiple disks attached to multiple storage servers. Applying this approach to real-world traces that a real ready routinely sampled at Google, we show that our synthesized traces achieve 95–99.5% accuracy in read/write request numbers, 90–97% accuracy in utilization, and 80–99.8% accuracy in read latency compared to metrics collected from actual disks. We demonstrate how The-sios enables diverse counterfactual I/O trace synthesis and analyses of hypothetical policy, hardware, and server changes through four case studies: (1) studying the effects of changing disk’s utilization, fullness, and capacity, (2) evaluating new data placement policy, (3) analyzing the impact on power and performance of deploying disks with reduced rotations-per-minute (RPM), and (4) understanding the impact of increased buffer cache size on a storage server. Without Thesios, such counterfactual analyses would require costly and potentially risky A/B experiments in production.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              We present Prequal (\emph{Probing to Reduce Queuing and Latency}), a load balancer
for distributed multi-tenant systems. Prequal aims to minimize
real-time request latency in the presence of heterogeneous server
capacities and non-uniform, time-varying antagonist load. It actively probes
server load to leverage the \emph{power of $d$ choices}
paradigm, extending it with asynchronous and reusable probes. Cutting
against received wisdom, Prequal does not balance CPU load, but instead
selects servers according to estimated latency and active requests-in-flight
(RIF). We explore its major design features on a testbed system
and evaluate it on YouTube, where it has been deployed for more than two years. Prequal has dramatically decreased tail latency, error rates, and resource use, enabling YouTube and
other production systems at Google to run at much higher utilization.
              
  
View details
          
        
      
    
        
          
            
              BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Justin Levandoski
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Garrett Casto
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Mingge Deng
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Rushabh Desai
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Thibaud Hottelier
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Amir Hormati
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Anoop Johnson
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jeff Johnson
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Dawid Kurzyniec
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Prem Ramanathan
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Gaurav Saxena
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vidya Shanmugam
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yuri Volobuev
                      
                    
                  
              
            
          
          
          
          
            SIGMOD (2024)
          
          
        
        
        
          
              Preview abstract
          
          
              BigQuery’s cloud-native disaggregated architecture has allowed Google Cloud to evolve the system to meet several customer needs across the analytics and AI/ML workload spectrum. A key customer requirement for BigQuery centers around the unification of data lake and enterprise data warehousing workloads. This approach combines: (1) the need for core data management primitives, e.g., security, governance, common runtime metadata, performance acceleration, ACID transactions, provided by an enterprise data warehouses coupled with (2) harnessing the flexibility of the open source format and analytics ecosystem along with new workload types such as AI/ML over unstructured data on object storage. In addition, there is a strong requirement to support BigQuery as a multi-cloud offering given cloud customers are opting for a multi-cloud footprint by default.
This paper describes BigLake, an evolution of BigQuery toward a multi-cloud lakehouse to address these customer requirements in novel ways. We describe three main innovations in this space. We first present BigLake tables, making open-source table formats (e.g., Apache Parquet, Iceberg) first class citizens, providing fine-grained governance enforcement and performance acceleration over these formats to BigQuery and other open-source analytics engines. Next, we cover the design and implementation of BigLake Object tables that allow BigQuery to integrate AI/ML for inferencing and processing over unstructured data. Finally, we present Omni, a platform for deploying BigQuery on non-GCP clouds, focusing on the infrastructure and operational innovations we made to provide an enterprise lakehouse product regardless of the cloud provider hosting the data.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              Vortex is an exabyte scale structured storage system built for streaming and batch analytics. It supports high-throughput batch and stream ingestion. For the user, it supports both batch-oriented and stream-based processing on the ingested data.
              
  
View details
          
        
      
    
        
          
            
              Distributed Tracing for InterPlanetary File System
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Marshall David Miller
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Rachel Han
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Haorui Guo
                      
                    
                  
              
            
          
          
          
          
            2024 International Symposium on Parallel Computing and Distributed Systems (PCDS), IEEE, pp. 1-5
          
          
        
        
        
          
              Preview abstract
          
          
              The InterPlanetary File System (IPFS) is on its way to becoming the backbone of the next generation of the web. However, it suffers from several performance bottlenecks, particularly on the content retrieval path, which are often difficult to debug. This is because content retrieval involves multiple peers on the decentralized network and the issue could lie anywhere in the network. Traditional debugging tools are insufficient to help web developers who face the challenge of slow loading websites and detrimental user experience. This limits the adoption and future scalability of IPFS.
In this paper, we aim to gain valuable insights into how content retrieval requests propagate within the IPFS network as well as identify potential performance bottlenecks which could lead to opportunities for improvement. We propose a custom tracing framework that generates and manages traces for crucial events that take place on each peer during content retrieval. The framework leverages event semantics to build a timeline of each protocol involved in the retrieval, helping developers pinpoint problems. Additionally, it is resilient to malicious behaviors of the peers in the decentralized environment.
We have implemented this framework on top of an existing IPFS implementation written in Java called Nabu. Our evaluation shows that the framework can identify network delays and issues with each peer involved in content retrieval requests at a very low overhead.
              
  
View details
          
        
      
    
        
          
            
              On-Chain Crypto-Secured Credential Verification On Permissioned Blockchain
            
          
        
        
          
            
              
                
                  
                    
                
              
            
          
          
          
          
    
    
    
    
    
            2024 IEEE International Conference on Blockchain and Distributed Systems Security (ICBDS), IEEE, pp. 1-6
          
          
        
        
        
          
              Preview abstract
          
          
              Verifying credentials, such as educational degrees, professional licenses, and permits, is a crucial yet challenging task for organizations globally. Traditional verification methods often rely on third-party vendors, introducing vulnerabilities like bias, security breaches, and privacy risks. While blockchain technology offers a promising solution for credential management, existing approaches often store sensitive credential data off-chain in centralized databases or InterPlanetary File System (IPFS), leaving them susceptible to data breaches and loss.
This paper presents a novel, privacy-preserving credential verification system built on a permissioned blockchain network. This system, implemented using the Hyperledger Fabric framework, offers several key advantages over traditional methods, including enhanced security and improved privacy. By leveraging cryptographic techniques, the system ensures the robust and privacypreserving storage of credentials directly on the blockchain. This eliminates the reliance on vulnerable off-chain storage and mitigates associated risks. Furthermore, our analysis of a common credential dataset demonstrates the practical feasibility and cost-effectiveness of our solution, suggesting its widespread adoption. By addressing the limitations of both traditional and existing blockchain-based approaches, our system provides a robust, secure, and efficient solution for credential management in diverse sectors.
              
  
View details
          
        
      
    
        
          
            
              Transparent Migration of Datastore to Firestore
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Ed Davisson
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Tilo Dickopp
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        David Gay
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Eric Karasuda
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ram Kesavan
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Vadim Yushprakh
                      
                    
                  
              
            
          
          
          
          
            2024
          
          
        
        
        
          
              Preview abstract
          
          
              Datastore was one of Google's first cloud databases, launched initially as part of App Engine, and built over Google's internal Megastore database system. Firestore was launched in 2019, both a re-implementation of Datastore over Google's Spanner database system and a new, mobile and web-friendly Firestore API. Spanner was chosen as the storage engine of Firestore in particular for technical reasons—it provides unrestricted transaction capabilities, strong consistency guarantees, and other improvements over Megastore.
To provide these improvements to all our customers, and simplify our overall system, a non-disruptive, zero-downtime migration was executed of all Datastore databases (stored in Megastore) to Firestore databases (stored in Spanner). This migration took a couple of years to design and plan, and about three to execute. This paper describes both the core engine for migrating databases, and various practical problems that were solved to make this journey successful. As of the writing of this paper, all (over one million) databases have been successfully migrated.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              We discuss distributed reset control of bittide systems. In a bittide system, multiple processors communicate over a network. The processors remain in logical synchrony by controlling the timing of frame transmissions. The protocol for doing this relies upon an underlying dynamic control system, where each node makes only local observations and performs no direct coordination with other nodes.  In this paper we develop a control algorithm based on the idea of reset control, which allows all nodes to maintain small buffer offsets while also requiring very little state information at each node. This offers the potential for simplified boot processes and failure handling.
              
  
View details
          
        
      
    
        
          
            
              Progressive Partitioning for Parallelized Query Execution in  Google’s Napa
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Junichi Tatemura
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yanlai Huang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jim Chen
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yupu Zhang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Kevin Lai
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Divyakant Agrawal
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Brad Adelberg
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Shilpa Kolhar
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Indrajit Roy
                      
                    
                  
              
            
          
          
          
          
            49th International Conference on Very Large Data Bases, VLDB (2023), pp. 3475-3487
          
          
        
        
        
          
              Preview abstract
          
          
              Napa powers Google's critical data warehouse needs. It utilizes Log-Structured Merge Tree (LSM) for real-time data ingestion and achieves sub-second query latency for billions of queries per day. Napa handles a wide variety of query workloads: from full-table scans, to range scans, and multi-key lookups. Our design challenge is to handle this diverse query workload that runs concurrently. In particular, a large percentage of our query volume consists of external reporting queries characterized by multi-key lookups with strict sub-second query latency targets.  
Query parallelization, which is achieved by processing a query in parallel by partitioning the input data (i.e., the SIMD model of computation), is an important technique to meet the low latency targets. Traditionally, the effectiveness of parallelization of a query is highly dependent on the alignment with the data partitioning established at write time. Unfortunately, such a write-time partitioning scheme cannot handle the highly variable parallelization requirements that are needed on a per-query basis. 
The key to Napa’s success is its ability to adapt its query parallelization requirements on a per-query basis. This paper describes an index-based approach to perform data partitioning for queries that have sub-second latency requirements. Napa’s  approach is progressive in that it can provide good partitioning within the time budgeted for partitioning. Since the end-to-end query time also includes the time to perform partitioning there is a tradeoff in terms of the time spent for partitioning and the resulting evenness of the partitioning. Our approach balances these opposing considerations to provide sub-second querying for billions of queries each day. We use production data to  establish the effectiveness of Napa’s approach across easy to handle workloads  to the most pathological conditions.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              In the new “gig” economy, a user plays the role of a consumer as well as a service provider. As a service provider, drivers travelling from a source to a destination may opportunistically pickup and drop-off packages along the way if that does not add significantly to their trip distance or time. This gives rise to a new business offering called Package Delivery as a Service (PDaaS) that brokers package pickups and deliveries at one end and connects them to drivers on the other end, thus creating an ecosystem of supply and demand. The dramatic cost savings of such a service model come from the fact that the driver is already en-route to their destination and the package delivery adds a small overhead to an already pre-planned trip. From a technical perspective, this problem introduces new technical challenges that are uncommon in the literature. The driver may want to optimise for distance or time. Furthermore, new packages arrive for delivery all the time and are assigned to various drivers continuously. This means that the algorithm has to work in an environment that is dynamic, thereby precluding most standard road network precomputation efforts. Furthermore, the number of packages that are available for delivery could be in the hundreds of thousands, which has to be quickly pruned down for the algorithm to scale. The paper proposes a variation called dual Dijkstra’s that combines a forward and a backward scan in order to find delivery options that satisfy the constraints specified by the driver. The new dual heuristic improves the standard single Dijkstra’s approach by narrowing down the search space, thus resulting in significant speed-ups over the standard algorithms. Furthermore, a combination of dual Dijkstra’s with a heuristic landmark approach results in a dramatic speed-up compared to the baseline algorithms. Experimental results show that the proposed approach can offer drivers a choice of packages to deliver under specified constraints of time or distance, and providing sub-second response time despite the complexity of the problem involved. As the number of packages in the system increases, the matchmaking process becomes easier resulting in faster response times. The scalability of the PDaaS infrastructure is demonstrated using extensive experimental results.
              
  
View details
          
        
      
    