Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 366 publications
Preview abstract
This talk addresses the challenges of operating Google's monitoring systems at scale, handling terabytes of telemetry data and preventing overload from diverse workloads. We'll explore how Google's internal client library and Monarch, its planet-scale time-series database, work together for cost-effective data collection. Key principles include a distributed push model, dynamic client-side data reduction, centralized retention, and periodic metric analysis. The session will then bridge these concepts to the open-source world, discussing our work with OpenTelemetry's OpAMP protocol to achieve similar scalable and efficient telemetry collection. Attendees will gain insights into adapting these principles for cost savings and learn about our collaboration with the OpAMP SIG to benefit the broader community.
View details
Fast ACS: Low-Latency File-Based Ordered Message Delivery at Scale
Anil Raghunath Iyer
Neel Bagora
Chang Yu
Olivier Pomerleau
Vivek Kumar
Prunthaban Kanthakumar
Usenix Annual Technical Conference (2025)
Preview abstract
Low-latency message delivery is crucial for real-time systems. Data originating from a producer must be delivered to consumers, potentially distributed in clusters across metropolitan and continental boundaries. With the growing scale of computing, there can be several thousand consumers of the data. Such systems require a robust messaging system capable of transmitting messages containing data across clusters and efficiently delivering them to consumers. The system must offer guarantees like ordering and at-least-once delivery while avoiding overload on consumers, allowing them to consume messages at their own pace.
This paper presents the design of Fast ACS (an abbreviation for Ads Copy Service), a file-based ordered message delivery system that leverages a combination of two-sided (inter-cluster) and one-sided (intra-cluster) communication primitives—namely, Remote Procedure Call and Remote Direct Memory Access, respectively—to deliver messages. The system has been successfully deployed to dozens of production clusters and scales to accommodate several thousand consumers within each cluster, which amounts to Tbps-scale intra-cluster consumer traffic at peak. Notably, Fast ACS delivers messages to consumers across the globe within a few seconds or even sub-seconds (p99) based on the message volume and consumer scale, at a low resource cost.
View details
Continuous Evaluation: Using CI Techniques For Experimentation At Scale
IEEE (2025), pp. 157-157
Preview abstract
Continuous Integration (CI) is an essential software development practice that establishes processes to minimize bugs and errors in production. In a similar vein, experimentation of software products is vital for evaluating user satisfaction, quality, performance and other key business metrics. Experimentation allows product owners to evaluate the user impact of changes. This can help make informed decisions regarding feature launches. Experimentation also allows developers to tweak internal processes and algorithms to maximize the impact of new features and changes. Additionally, it can sometimes detect errors not detected by CI.
Unlike CI systems, experimentation platforms are meant to closely imitate production and usually run the system under test (SUT) against a large scale of input. Despite this, experimentation platforms have a lot in common with CI systems. The mechanisms for continuously integrating and testing changes can be modified and applied to experimentation platforms.
Google Search's experimentation platform started as a command line tool many years ago. Over time, this tool has evolved into a platform that serves the evaluation needs for many of Google's products like Search, Assistant, YouTube, Play, Lens, etc., running thousands of large experiments every day.
In this workshop, we will present the evolution of Google Search's experimentation platform and how it was transformed from a simple CLI tool into a platform that works at scale, fulfills continuous experimentation needs and provides many CI-like functionalities to its users.
Note: This presentation was a part of CCIW workshop a ISCT 2025. Please download slides to see full presentation.
View details
The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning
Pratik Fegade
Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2025), pp. 5-17
Preview abstract
A chief enabler of large-scale deep learning is the distribution of computation across multiple interconnected hardware accelerators. In order to unlock the maximum possible performance, a compiler must first select a reasonable strategy to parallelize a model's operations. Since neural network architectures admit multiple flavors of parallelism, determining the proper strategy for each instruction is a critical (albeit non-trivial) task. To solicit new ideas toward solving this challenging combinatorial optimization problem, we organized the ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning, a multi-month competition focused on advancing the state-of-the-art for model partitioning algorithms. In this paper, we offer a retrospective of this event, including the basic problem formulation, key challenges & opportunities, our new benchmark suite, and the quality of submissions received.
View details
Preview abstract
Optimizing large language model (LLM) training on dritibuted domain-specific accelerator systems presents significant challenges due to its complex optimization space and reliance on manual processes, resulting in slow development and underutilize resources. Existing optimization methods, however, rely on time-consuming manual tuning or resource-intensive black-box searches, which struggle to keep pace with the rapidly evolving LLM domain. To address this, we introduce the ASAP, an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training. It is a multi-agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, analytical roofline models, and a knowledge base of best practices and successful past optimizations. Our proposed design can automate the diagnosis of performance bottlenecks and intelligently generates optimized sharding configurations with reasoning, and effectively improve the distributed LLM training efficiency. This approach promises to significantly reduce manual effort, shorten iteration cycles, and enhance accelerator utilization, offering a scalable and explanable methodology for AI-assisted performance engineering in large-scale machine learning.
View details
Federated Variational Inference: Towards Improved Personalization and Generalization
Elahe Vedadi
Josh Dillon
Philip Mansfield
Karan Singhal
Arash Afkanpour
Warren Morningstar
AAAI Federated Learning on the Edge Symposium (2024)
Preview abstract
Conventional federated learning algorithms train a single global model by leveraging all participating clients' data. However, due to heterogeneity in client generative distributions and predictive models, these approaches may not appropriately approximate the predictive process, converge to an optimal state, or generalize to new clients. We study personalization and generalization in stateless cross-device federated learning setups assuming heterogeneity in client data distributions and predictive models. We first propose a hierarchical generative model and formalize it using Bayesian Inference. We then approximate this process using Variational Inference to train our model efficiently. We call this algorithm Federated Variational Inference (FedVI). We use PAC-Bayes analysis to provide generalization bounds for FedVI. We evaluate our model on FEMNIST and CIFAR-100 image classification and show that FedVI beats the state-of-the-art on both tasks.
View details
How we use GenAI in SRE
CommitConf, Madrid (2024)
Preview abstract
Google services are powered by the largest network of computers in the world. Site Reliability Engineers (SRE) make sure that the whole stack is cool: datacenters are safe, well provisioned; we have fallback mechanisms, and data integrity; to make sure we design our stack properly, using the right storage, replication and software trade-offs.
Generative AI is a great tool to make us super-effective: having access to tools to generate our most toily configurations, to classify risks and events, to manage large swaths of machines with agents or to automate complex workflows cheaply.
This talk will cover the journey that SRE started years ago to become a truly AI-First discipline and the latest advancements in tooling, practices and workflows.
View details
Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O Samples
Mangpo Phothilimthana
Soroush Ghodrati
Selene Moon
ASPLOS 2024, Association for Computing Machinery
Preview abstract
Representative modeling of I/O activity is crucial when designing large-scale distributed storage systems. Particularly important use cases are counterfactual “what-if” analyses that assess the impact of anticipated or hypothetical new storage policies or hardware prior to deployment. We propose Thesios, a methodology to accurately synthesize such hypothetical full-resolution I/O traces by carefully combining down-sampled I/O traces collected from multiple disks attached to multiple storage servers. Applying this approach to real-world traces that a real ready routinely sampled at Google, we show that our synthesized traces achieve 95–99.5% accuracy in read/write request numbers, 90–97% accuracy in utilization, and 80–99.8% accuracy in read latency compared to metrics collected from actual disks. We demonstrate how The-sios enables diverse counterfactual I/O trace synthesis and analyses of hypothetical policy, hardware, and server changes through four case studies: (1) studying the effects of changing disk’s utilization, fullness, and capacity, (2) evaluating new data placement policy, (3) analyzing the impact on power and performance of deploying disks with reduced rotations-per-minute (RPM), and (4) understanding the impact of increased buffer cache size on a storage server. Without Thesios, such counterfactual analyses would require costly and potentially risky A/B experiments in production.
View details
Preview abstract
We present Prequal (\emph{Probing to Reduce Queuing and Latency}), a load balancer
for distributed multi-tenant systems. Prequal aims to minimize
real-time request latency in the presence of heterogeneous server
capacities and non-uniform, time-varying antagonist load. It actively probes
server load to leverage the \emph{power of $d$ choices}
paradigm, extending it with asynchronous and reusable probes. Cutting
against received wisdom, Prequal does not balance CPU load, but instead
selects servers according to estimated latency and active requests-in-flight
(RIF). We explore its major design features on a testbed system
and evaluate it on YouTube, where it has been deployed for more than two years. Prequal has dramatically decreased tail latency, error rates, and resource use, enabling YouTube and
other production systems at Google to run at much higher utilization.
View details
BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse
Justin Levandoski
Garrett Casto
Mingge Deng
Rushabh Desai
Thibaud Hottelier
Amir Hormati
Anoop Johnson
Jeff Johnson
Dawid Kurzyniec
Prem Ramanathan
Gaurav Saxena
Vidya Shanmugam
Yuri Volobuev
SIGMOD (2024)
Preview abstract
BigQuery’s cloud-native disaggregated architecture has allowed Google Cloud to evolve the system to meet several customer needs across the analytics and AI/ML workload spectrum. A key customer requirement for BigQuery centers around the unification of data lake and enterprise data warehousing workloads. This approach combines: (1) the need for core data management primitives, e.g., security, governance, common runtime metadata, performance acceleration, ACID transactions, provided by an enterprise data warehouses coupled with (2) harnessing the flexibility of the open source format and analytics ecosystem along with new workload types such as AI/ML over unstructured data on object storage. In addition, there is a strong requirement to support BigQuery as a multi-cloud offering given cloud customers are opting for a multi-cloud footprint by default.
This paper describes BigLake, an evolution of BigQuery toward a multi-cloud lakehouse to address these customer requirements in novel ways. We describe three main innovations in this space. We first present BigLake tables, making open-source table formats (e.g., Apache Parquet, Iceberg) first class citizens, providing fine-grained governance enforcement and performance acceleration over these formats to BigQuery and other open-source analytics engines. Next, we cover the design and implementation of BigLake Object tables that allow BigQuery to integrate AI/ML for inferencing and processing over unstructured data. Finally, we present Omni, a platform for deploying BigQuery on non-GCP clouds, focusing on the infrastructure and operational innovations we made to provide an enterprise lakehouse product regardless of the cloud provider hosting the data.
View details
Preview abstract
Vortex is an exabyte scale structured storage system built for streaming and batch analytics. It supports high-throughput batch and stream ingestion. For the user, it supports both batch-oriented and stream-based processing on the ingested data.
View details
Distributed Tracing for InterPlanetary File System
Marshall David Miller
Rachel Han
Haorui Guo
2024 International Symposium on Parallel Computing and Distributed Systems (PCDS), IEEE, pp. 1-5
Preview abstract
The InterPlanetary File System (IPFS) is on its way to becoming the backbone of the next generation of the web. However, it suffers from several performance bottlenecks, particularly on the content retrieval path, which are often difficult to debug. This is because content retrieval involves multiple peers on the decentralized network and the issue could lie anywhere in the network. Traditional debugging tools are insufficient to help web developers who face the challenge of slow loading websites and detrimental user experience. This limits the adoption and future scalability of IPFS.
In this paper, we aim to gain valuable insights into how content retrieval requests propagate within the IPFS network as well as identify potential performance bottlenecks which could lead to opportunities for improvement. We propose a custom tracing framework that generates and manages traces for crucial events that take place on each peer during content retrieval. The framework leverages event semantics to build a timeline of each protocol involved in the retrieval, helping developers pinpoint problems. Additionally, it is resilient to malicious behaviors of the peers in the decentralized environment.
We have implemented this framework on top of an existing IPFS implementation written in Java called Nabu. Our evaluation shows that the framework can identify network delays and issues with each peer involved in content retrieval requests at a very low overhead.
View details
On-Chain Crypto-Secured Credential Verification On Permissioned Blockchain
2024 IEEE International Conference on Blockchain and Distributed Systems Security (ICBDS), IEEE, pp. 1-6
Preview abstract
Verifying credentials, such as educational degrees, professional licenses, and permits, is a crucial yet challenging task for organizations globally. Traditional verification methods often rely on third-party vendors, introducing vulnerabilities like bias, security breaches, and privacy risks. While blockchain technology offers a promising solution for credential management, existing approaches often store sensitive credential data off-chain in centralized databases or InterPlanetary File System (IPFS), leaving them susceptible to data breaches and loss.
This paper presents a novel, privacy-preserving credential verification system built on a permissioned blockchain network. This system, implemented using the Hyperledger Fabric framework, offers several key advantages over traditional methods, including enhanced security and improved privacy. By leveraging cryptographic techniques, the system ensures the robust and privacypreserving storage of credentials directly on the blockchain. This eliminates the reliance on vulnerable off-chain storage and mitigates associated risks. Furthermore, our analysis of a common credential dataset demonstrates the practical feasibility and cost-effectiveness of our solution, suggesting its widespread adoption. By addressing the limitations of both traditional and existing blockchain-based approaches, our system provides a robust, secure, and efficient solution for credential management in diverse sectors.
View details
Transparent Migration of Datastore to Firestore
Ed Davisson
Tilo Dickopp
David Gay
Eric Karasuda
Ram Kesavan
Vadim Yushprakh
2024
Preview abstract
Datastore was one of Google's first cloud databases, launched initially as part of App Engine, and built over Google's internal Megastore database system. Firestore was launched in 2019, both a re-implementation of Datastore over Google's Spanner database system and a new, mobile and web-friendly Firestore API. Spanner was chosen as the storage engine of Firestore in particular for technical reasons—it provides unrestricted transaction capabilities, strong consistency guarantees, and other improvements over Megastore.
To provide these improvements to all our customers, and simplify our overall system, a non-disruptive, zero-downtime migration was executed of all Datastore databases (stored in Megastore) to Firestore databases (stored in Spanner). This migration took a couple of years to design and plan, and about three to execute. This paper describes both the core engine for migrating databases, and various practical problems that were solved to make this journey successful. As of the writing of this paper, all (over one million) databases have been successfully migrated.
View details
Preview abstract
We discuss distributed reset control of bittide systems. In a bittide system, multiple processors communicate over a network. The processors remain in logical synchrony by controlling the timing of frame transmissions. The protocol for doing this relies upon an underlying dynamic control system, where each node makes only local observations and performs no direct coordination with other nodes. In this paper we develop a control algorithm based on the idea of reset control, which allows all nodes to maintain small buffer offsets while also requiring very little state information at each node. This offers the potential for simplified boot processes and failure handling.
View details