Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11309 publications
    Managing and Securing Google's Fleet of Multi-Node Servers
    Richard Hanley
    Havard Skinnemoen
    Andrés Lagar-Cavilla
    Michael Wong
    Jeff Andersen
    Kishan Prasad
    Patrick Leis
    Shiva Rao
    Chris Koch
    Jad Baydoun
    Anna Sapek
    Communications of the ACM, 69:3 (2026), pp. 82 - 92
    Preview abstract Server hardware and software co-design for a secure, efficient cloud. View details
    Preview abstract Voice activity detection (VAD) plays a vital role in enabling applications such as speech recognition. We analyze the impact of window size on the accuracy of three VAD algorithms: Silero, WebRTC, and Root Mean Square (RMS) across a set of diverse real-world digital audio streams. We additionally explore the use of hysteresis on top of each VAD output. Our results offer practical references for optimizing VAD systems. Silero significantly outperforms WebRTC and RMS, and hysteresis provides a benefit for WebRTC. View details
    Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
    Yu Jiang
    Hanwen Jiang
    Vincent Chu
    Brandon Y. Feng
    Zhangyang Wang
    Qixing Huang
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (2026)
    Preview abstract With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach. View details
    Preview abstract Despite advances in high performance computing, accurate numerical simulations of global atmospheric dynamics remain a challenge. The resolution required to fully resolve the vast range scales as well as the strong coupling with—often not fully-understood—physics renders such simulations computationally infeasible over time horizons relevant for long-term climate risk assessment. While data-driven parameterizations have shown some promise of alleviating these obstacles, the scarcity of high-quality training data and their lack of long-term stability typically hinders their ability to capture the risk of rare extreme events. In this work we present a general strategy for training variational (probabilistic) neural network models to non-intrusively correct under-resolved long-time simulations of turbulent climate systems. The approach is based on the paradigm introduced by Barthel Sorensen et al. (2024, https://doi.org/10.1029/2023ms004122) which involves training a post-processing correction operator on under-resolved simulations nudged toward a high-fidelity reference. Our variational framework enables us to learn the dynamics of the underlying system from very little training data and thus drastically improve the extrapolation capabilities of the previous deterministic state-of-the art—even when the statistics of that training data are far from converged. We investigate and compare three recently introduced variational network architectures and illustrate the benefits of our approach on an anisotropic quasi-geostrophic flow. For this prototype model our approach is able to not only accurately capture global statistics, but also the anistropic regional variation and the statistics of multiple extreme event metrics—demonstrating significant improvement over previously introduced deterministic architectures. View details
    Preview abstract This study examines the psychological and ethical implications of generative-AI chatbot use among youth, introducing the CTRL framework (Cognitive Trust, Reliance, and Learning Diminution) to explain how repeated use fosters cognitive offloading and reduced verification behavior. Survey data from 420 participants analyzed through factor analysis and structural equation modeling reveal that higher trust predicts greater reliance and diminished critical evaluation, alongside elevated concerns around privacy and academic integrity. Findings highlight the need for AI literacy and responsible design to mitigate unintended cognitive impacts. View details
    Preview abstract Semantic data models express high-level business concepts and metrics, capturing the business logic needed to query a database correctly. Most data modeling solutions are built as layers above SQL query engines, with bespoke query languages or APIs. The layered approach means that semantic models can’t be used directly in SQL queries. This paper focuses on an open problem in this space – can we define semantic models in SQL, and make them naturally queryable in SQL? In parallel, graph query is becoming increasingly popular, including in SQL. SQL/PGQ extends SQL with an embedded subset of the GQL graph query language, adding property graph views and making graph traversal queries easy. We explore a surprising connection: semantic data models are graphs, and defining graphs is a data modeling problem. In both domains, users start by defining a graph model, and need query language support to easily traverse edges in the graph, which means doing joins in the underlying data. We propose some useful SQL extensions that make it easier to use higher-level data model abstractions in queries. Users can define a “semantic data graph” view of their data, encapsulating the complex business logic required to query the underlying tables correctly. Then they can query that semantic graph model easily with SQL. Our SQL extensions are useful independently, simplifying many queries – particularly, queries with joins. We make declared foreign key relationships usable for joins at query time – a feature that seems obvious but is notably missing in standard SQL. In combination, these extensions provide a practical approach to extend SQL incrementally, bringing semantic modeling and graph query together with the relational model and SQL. View details
    From Correctness to Collaboration: A Human-Centered Taxonomy of AI Agent Behavior in Software Engineering
    Sherry Y. Shi
    Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA ’26), ACM, New York, NY, USA (2026)
    Preview abstract The ongoing transition of Large Language Models in software engineering from code generators into autonomous agents requires a shift in how we define and measure success. While models are becoming more capable, the industry lacks a clear understanding of the behavioral norms that make an agent effective in collaborative software development in the enterprise. This work addresses this gap by presenting a taxonomy of desirable agent behaviors, synthesized from 91 sets of user-defined rules for coding agents. We identify four core expectations: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solve Problems Effectively, and Collaborate with the User. These findings offer a concrete vocabulary for agent behavior, enabling researchers to move beyond correctness-only benchmarks and design evaluations that reflect the realities of professional software development in large enterprises. View details
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Victor May
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Preview abstract The rapid expansion of the Internet of Things (IoT) and smart home ecosystems has led to a fragmented landscape of user data management across consumer electronics (CE) such as Smart TVs, gaming consoles, and set-top boxes. Current onboarding processes on these devices are characterized by high friction due to manual data entry and opaque data-sharing practices. This paper introduces the User Data Sharing System (UDSS), a platform-agnostic framework designed to facilitate secure, privacy-first PII (Personally Identifiable Information) exchange between device platforms and third-party applications. Our system implements a Contextual Scope Enforcement (CSE) mechanism that programmatically restricts data exposure based on user intent—specifically distinguishing between Sign-In and Sign-Up workflows. Unlike cloud-anchored identity standards such as FIDO2/WebAuthn, UDSS is designed for shared, device-centric CE environments where persistent user-to-device bind-ing cannot be assumed. We further propose a tiered access model that balances developer needs with regulatory compliance (GDPR/CCPA). A proof-of-concept implementation on a reference ARMv8 Linux-based middleware demonstrates that UDSS reduces user onboarding latency by 65% and measurably reduces PII over-exposure risk through protocol-enforced data minimization. This framework provides a standardized approach to identity management in the heterogeneous CE market. View details
    Preview abstract This framework manages AI agents by establishing behavioral boundaries and a persistent identity. It uses a multi-layered stack, combining safety rules with brand guidelines, to shape an agent's reasoning. Features include authority decay to limit power if confidence drops and memory segmentation to prevent data tampering. Centralized oversight ensures these digital representatives remain aligned with company policies through continuous monitoring and testing. View details
    Compact Conformal Subgraphs
    Kamesh Munagala
    Aravindan Vijayaraghavan
    ICML (2026)
    Preview abstract Conformal prediction provides rigorous uncertainty guarantees for model outputs but can produce prohibitively large prediction sets in structured domains such as routing, planning, or sequential recommendation. We introduce \emph{graph-based conformal compression}, a framework for constructing compact subgraphs that preserve the statistical validity of conformal prediction while reducing structural complexity. We study a formulation that selects a smallest subgraph capturing a prescribed fraction of conditional probability mass, and reduce to a weighted version of densest $k$-subgraphs in hypergraphs, in the regime where the subgraph has a large fraction of edges. We design efficient approximation algorithms that achieve constant factor coverage and size trade-offs. Our results highlight an algorithmic regime, distinct from classical densest-$k$-subgraph hardness settings, where the problem can be approximated efficiently, bridging conformal prediction with combinatorial graph compression. We finally validate our algorithmic approach on synthetic and real-world instances of trip planning and navigation, showing in each case that our approach handily beats natural baselines. View details
    Preview abstract Source-to-source compilers may perform inefficiently by executing transpilation passes on scripts that do not contain the specific language features a pass is designed to transform, potentially leading to redundant processing. A compiler can analyze a script to generate a per-script feature map, for example, by identifying language features in its abstract syntax tree (AST). Before executing a transpilation pass, the compiler can check this map and may bypass the pass for that script if the specific feature targeted by the pass is not present. This feature map can also be dynamically updated throughout the compilation process as other passes transform the code. This method of conditional pass execution based on content-aware analysis may reduce redundant AST traversals, which could decrease overall compilation time and computational resource consumption. View details
    Preview abstract Optimizing large-language model (LLM) training and serving on large-sacle distributed systems with hundreds and thousands of accelerators is always a challenging task due to the fast evloving LLMs, strong domain expertise required, and various optimization goals from different worklaods. Existing methods rely on either handcrafted optimization performed by human experts, which is tedious and time-consuming or resource-intensive black-box searches, which lack the extensibility to keep pace with evolving models and hardware. To address this, we introduce PROMPTS, a novel multi-agent framework that complements traditional search methods with expert-informed reasoning. It automates the diagnosis of performance bottlenecks by synthesizing profiler data and leverages a knowledge base to propose optimized sharding configurations with detailed justifications. Across eight real-world production workloads, PROMPTS demonstrated remarkable efficiency and accuracy, delivering performance improvements of up to 434%. These workloads spanned diverse model architectures, hardware platforms, computational scales, and various stages of the machine learning lifecycle (pre-training, serving, and post-training). In every case, the configuration adopted by human engineers was identified within the agent's top three proposals from a single invocation. Furthermore, the agent's top-ranked recommendation was the one ultimately adopted in 87.5% of cases, showcasing its ability to not only find optimized solutions, but also to correctly prioritize them. Our work establishes PROMPTS as a scalable, extensible, and explainable methodology for AI-assisted performance engineering in large-scale ML systems. View details
    The Perfection Paradox: From Architect to Curator in AI-Assisted API Design
    JJ Geewax
    David R Karger
    Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26), ACM, Barcelona, Spain, TBD
    Preview abstract Enterprise API design is often bottlenecked by the tension between rapid feature delivery and the rigorous maintenance of usability standards. We present an industrial case study evaluating an AI-assisted design workflow trained on API Improvement Proposals(AIPs). Through a controlled study with 16 industry experts, we compared AI-generated API specifications against human-authored ones. While quantitative results indicated AI superiority in 10 of 11 usability dimensions and an 87% reduction in authoring time, qualitative analysis revealed a paradox: experts frequently misidentified AI work as human (19% accuracy) yet described the designs as unsettlingly “perfect.” We characterize this as a “Perfection Paradox”—where hyper-consistency signals a lack of pragmatic human judgment. We discuss the implications of this perfection paradox, proposing a shift in the human designer’s role from the “drafter” of specifications to the “curator” of AI-generated patterns. View details
    Preview abstract We introduce ALPS (Activation-based Length Prediction for Scheduling), a method for predicting LLM generation length from prefill activations before any tokens are generated. Unlike existing approaches that require model fine-tuning or complex entropy-weighted pooling, ALPS uses a simple linear probe on the last-token activation at intermediate layers. We discover that generation length is encoded in prefill representations: a ridge regression probe achieves R-squared > 0.85 across three model families. Validation across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B demonstrates: (1) intermediate layers generally perform well, with some architectural variation; (2) simple last-token extraction outperforms complex pooling strategies; (3) activations improve substantially over surface-feature baselines (24 percentage points over input length plus lexical features). The best models achieve R-squared = 0.943 (Gemma), R-squared = 0.880 (Llama), and R-squared = 0.857 (Qwen) with MAE of 38-80 tokens. All test prompts terminated naturally (100% EOS), eliminating truncation confounds. While our evaluation uses 200 curated prompts—sufficient for demonstrating the phenomenon but requiring broader validation—cross-validation confirms generalization beyond training data. ALPS enables practical applications including budget-constrained inference, request scheduling, and resource allocation. The probe adds negligible overhead (~16KB direction vector, single dot product), making ALPS practical for production deployment. View details
    ×