Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11321 publications
    Neural general circulation models for modeling precipitation
    Stephan Hoyer
    Dmitrii Kochkov
    Janni Yuval
    Ian Langmore
    Science Advances (2026)
    Preview abstract Climate models struggle to accurately simulate precipitation, particularly extremes and the diurnal cycle. While hybrid models combining machine learning and physics have emerged with the premise of improving precipitation simulations, none have proven sufficiently skillful or stable enough to outperform existing models in simulating precipitation. Here, we present the first hybrid model that is trained directly on precipitation observations. The model runs at 2.8 degrees resolution and is built on the differentiable NeuralGCM framework. This model is stable for decadal simulations and demonstrates significant improvements over existing GCMs, ERA5 reanalysis, and a Global Cloud-Resolving Model in simulating precipitation. Our approach yields reduced biases, a more realistic precipitation distribution, improved representation of extremes, and a more accurate diurnal cycle. Furthermore, it outperforms the ECMWF ensemble for mid-range weather forecasting. This advance paves the way for more reliable simulations of current climate and for the ability to fully utilize the abundance of existing observations to further improve GCMs. View details
    Preview abstract While non-verbal behaviors and expressive movements are essential for natural human-robot interaction, existing methods often overlook a crucial element: the human’s internal cognitive state. Consequently, proactive multi-agent systems frequently interrupt humans at inopportune moments, leading to cognitive overload and decreased task performance. This paper introduces a framework for generating “cognitively aligned” multi-agent interactions, enhancing the ability of robotic systems to contextually defer communications during moments of high human mental workload. We present the design and implementation of a closed-loop architecture that explores the interplay between autonomous task execution and real-time neurophysiological focus. Utilizing a consumer-grade Brain-Computer Interface (BCI), our approach continuously monitors Electroencephalography (EEG) spectral band powers while a human performs a cognitive-load-inducing task. We propose a workload-driven pipeline where an HTTP-based signaling mechanism places a primary agent’s sensory inputs and audio outputs into a holding state upon detecting high cognitive load. This allows secondary agents to seamlessly process complex, delegated tasks in the background. Once the human’s cognitive state returns to a baseline, the primary agent releases the queued agent message. Our preliminary results demonstrate the feasibility of leveraging real-time signal processing, Large Language Models (LLMs), and physical robotic embodiments to create interrupt-aware, non-intrusive multi-agent systems. View details
    Preview abstract Object-Counting for remote-sensing (RS) imagery is raising increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - an adaptation of existing work for Open Vocabulary Counting (OVC) approach from general computer vision to the RS domain. We show that our model is capable of accurate counting of novel object classes, that are unseen during training, based solely on textual and/or visual conditioning. View details
    Improved Differentially Private Algorithms for Rank Aggregation
    Phanu Vajanopath
    Quentin Hillebrand
    Vorapong Suppakitpaisarn
    AAAI (2026)
    Preview abstract Rank aggregation is a task of combining the rankings of items from multiple users into a single ranking that best represents the users' rankings. Alabi et al. (AAAI'22) presents differentially-private (DP) polynomial-time approximation schemes (PTASes) and 5-approximation algorithms with certain additive errors for the Kemeny rank aggregation problem in both central and local models. In this paper, we present improved DP PTASes with smaller additive error in the central model. Furthermore, we are first to study the footrule rank aggregation problem under DP. We give a near-optimal algorithm for this problem; as a corollary, this leads to 2-approximation algorithms with the same additive error as the 5-approximation algorithms of Alabi et al. for the Kemeny rank aggregation problem in both central and local models. View details
    ALF: Advertiser Large Foundation Model for Multi-Modal Advertiser Understanding
    Sunny Rajagopalan
    Alireza Golestaneh
    Shubhra Chandra
    Min Zhou
    Jonathan Vronsky
    Songbai Yan
    2026
    Preview abstract We present ALF (Advertiser Large Foundation model), a multi-modal transformer architecture for understanding advertiser behavior and intent across text, image, video and structured data modalities. Through contrastive learning and multi-task optimization, ALF creates unified advertiser representations that capture both content and behavioral patterns. Our model achieves state-of-the-art performance on critical tasks including fraud detection, policy violation identification, and advertiser similarity matching. In production deployment, ALF reduces false positives by 90\% while maintaining 99.8\% precision on abuse detection tasks. The architecture's effectiveness stems from its novel combination of multi-modal transformations, intersample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs. View details
    Preview abstract This framework manages AI agents by establishing behavioral boundaries and a persistent identity. It uses a multi-layered stack, combining safety rules with brand guidelines, to shape an agent's reasoning. Features include authority decay to limit power if confidence drops and memory segmentation to prevent data tampering. Centralized oversight ensures these digital representatives remain aligned with company policies through continuous monitoring and testing. View details
    Preview abstract Mid-air gestures in Extended Reality (XR) often lead to fatigue, discomfort and imprecision, limiting their suitability for extended use. Surface-based interactions offer a compelling alternative, providing improved accuracy, speed, and comfort. However, current egocentric vision-based methods struggle with reliable surface inputs due to challenges in hand tracking and surface-plane estimation from oblique and occluded viewing angles. To this extent, we introduce SurfaceXR, a novel sensor fusion approach that combines headset based hand tracking with micro-vibration data sampled from commodity smartwatch IMUs to enable precise and robust inputs on arbitrary surfaces. Our system is designed with flexibility in mind - it can function using only hand tracking, only IMU sensing, or optimally with both modalities combined. Our user study across 12 participants validates SurfaceXR's effectiveness in augmenting surface touch tracking and 8 class hand-surface gesture recognition, demonstrating significant improvements over single-modality approaches. Enabled by SurfaceXR, we demonstrate a series of interactive apps for both AR and VR, ranging from on-surface sketching, text entry and gesture based navigation. View details
    ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
    Guy Tennenholtz
    Jihwan Jeong
    The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL-26), Rabat, Morocco (2026)
    Preview abstract LLM-based user simulators are a scalable solution for improving conversational AI, but a critical realism gap undermines their effectiveness. To close this gap, we introduce a framework for building and validating high-fidelity simulators. We present a novel dataset of human-AI shopping conversations designed to capture a wide spectrum of user experiences. To measure fidelity, we propose a hybrid evaluation protocol that combines statistical alignment with a learned, discriminator-based Human-Likeness Score. Our most sophisticated simulator, trained via reinforcement learning with iterative critique, achieves a significant leap in realism. Critically, we demonstrate through counterfactual validation that our simulator—trained exclusively on optimal interactions—realistically adapts its behavior to suboptimal system responses, mirroring real user reactions and marking a key advance in creating reliable simulators for robust AI development. View details
    Preview abstract This writeup defines the Hydration Proxy Pattern, a framework for building stateful conversational data systems over stateless LLM APIs. It describes a platform-agnostic approach to decoupling persistence from the AI provider through secure server-side intermediation and hybrid storage tiers. The abstract provides a blueprint for managing the "Persistence Gap" in enterprise AI integrations, detailing high-level strategies for session history management, streaming, and multi-stage semantic grounding without disclosing specific internal implementation details. View details
    Pixel Watch: Robust Heart Rate Sensing from Multipath PPG and On-Device Deep Learning Trained on 10,000 hours of Free-Living and Fitness Data
    Megan Walker
    Yojan Patel
    Shyam Tailor
    Matt Wimmer
    Brennan Garrett
    Dan Howe
    Hamed Vavadi
    Tien Le
    Steve Diamond
    Oleksiy Vyalov
    Vik Sharma
    Pete Richards
    Tracy Giest
    Erika Siegel
    Tuan Phan
    Sam Mravca
    Derrick Vickers
    Benjamin Stone
    Katarina Vukosavljević
    Justin Phillips
    YongSuk Cho
    Stefanie Hollidge
    Antony Siahaan
    Soren Brage
    Shwetak Patel
    Robert Harle
    IEEE Sensors Letters (2026)
    Preview abstract The Pixel Watch 2 (PW2) is the first Google smartwatch to combine multipath photoplethysmography (PPG) with deep learning-based heart rate inference, designed to significantly improve sensing accuracy during motion-heavy activities. The device processes 10 optical channels using an on-device, 15-layer temporally dilated convolutional neural network (~300K parameters) to yield a 1 Hz heart rate output. Crucial to this model's performance was its training on a massive dataset comprising 10,000 hours of data from 962 participants, curated from a broader corpus of controlled and free-living activities. We evaluated the PW2's sensing performance across two independent validation sets: an in-house fitness dataset (229 participants, 250 hours) and an external free-living dataset (27 participants, 1000+ hours). The system achieved 95% Limits of Agreement of -10.34 to 8.66 BPM during exercise and -6.57 to 7.48 BPM during free-living activities, demonstrating substantially tighter error margins than previous Google devices. Finally, we discuss key design lessons, emphasizing that large-scale deep learning was instrumental in fully leveraging multipath PPG hardware over traditional signal processing approaches. View details
    Taming the Variants Multi-Architecture Continuous Testing at Google
    Sushmita Azad
    Chandrakanth Chittappa
    Ali Esmaeeli
    Laura Macaddino
    Sam Manfreda
    David Margolin
    Dharma Naidu
    Sabuj Pattanayek
    Sachin Sable
    Ruslan Sakevych
    Dushyant Acharya
    Adrian Berding
    Kevin Crossan
    Wolff Dobson
    Abhay Singh
    19th IEEE International Conference on Software Testing, Verification and Validation (ICST) 2026, Daejeon, Republic of Korea, IEEE
    Preview abstract Enterprises are increasingly adopting multiple general-purpose computer architectures in the data center. This leads to new testing challenges as it creates demand to qualify the software for the additional architectures. Naively double-testing all software for both architectures is costly and unnecessary. Further, reconfiguring CI/CD to take advantage of the new architecture can be non-trivial at scale. This paper introduces CI/CD variants and an optimized testing cycle to solve these twin challenges. We empirically evaluate our solution's impact on human and machine expenses using 44k projects at Google on real production data. First, we estimate saving ~25% of machine expenses at the negligible cost of a few delayed breakage detections per day. Second, we estimate a 90+% reduction in human cost for migrating the configuration. All features described in this paper are now Generally Available at Google and we report this as an empirical case study in scaling CI/CD to new architectures. View details
    CrossCheck: Input Validation for WAN Control Systems
    Rishabh Iyer
    Isaac Keslassy
    Sylvia Ratnasamy
    Networked Systems Design and Implementation (NSDI) (2026) (to appear)
    Preview abstract We present CrossCheck, a system that validates inputs to the Software-Defined Networking (SDN) controller in a Wide Area Network (WAN). By detecting incorrect inputs—often stemming from bugs in the SDN control infrastructure—CrossCheck alerts operators before they trigger network outages. Our analysis at a large-scale WAN operator identifies invalid inputs as a leading cause of major outages, and we show how CrossCheck would have prevented those incidents. We deployed CrossCheck as a shadow validation system for four weeks in a production WAN, during which it accurately detected the single incident of invalid inputs that occurred while sustaining a 0% false positive rate under normal operation, hence imposing little additional burden on operators. In addition, we show through simulation that CrossCheck reliably detects a wide range of invalid inputs (e.g., detecting demand perturbations as small as 5% with 100% accuracy) and maintains a near-zero false positive rate for realistic levels of noisy, missing, or buggy telemetry data (e.g., sustaining zero false positives with up to 30% of corrupted telemetry data). View details
    Preview abstract We introduce AASE (Activation-based AI Safety Enforcement), a framework for post-perception safety monitoring in large language models. Unlike pre-perception approaches that analyze input or output text, AASE monitors the model's internal activation patterns—what the model "understands" rather than what text it processes or generates—enabling detection of safety-relevant states before harmful outputs are produced. The framework comprises three techniques: Activation Fingerprinting (AF) for harmful content detection, Agent Action Gating (AAG) for prompt injection defense, and Activation Policy Compliance (APC) for enterprise policy enforcement. We introduce paired contrastive training to isolate safety-relevant signals from confounding factors such as topic and style, addressing signal entanglement in polysemantic activations. Validation across 7 models from 3 architecture families shows strong class separation: Gemma-2-9B achieves AUC 1.00 with 7.2σ separation across all probes; AAG achieves AUC ≥0.88 across all models on the InjecAgent benchmark; APC achieves 0.97-1.00 AUC across three enterprise policies. Model size correlates with probe quality—Gemma-2-9B (7.2σ separation) outperforms Gemma-2-2B (4.3σ). All techniques survive INT4 quantization with minimal separation degradation. AASE is 9× faster than Llama Guard 3 (33ms vs 306ms) with higher TPR (88% vs 50%) at a tunable threshold that trades FPR for detection sensitivity, adding only 0.002ms probe overhead to existing inference. View details
    Preview abstract Modern user interfaces are complex composites, with elements originating from various sources, such as the operating system, apps, a web browser, or websites. Many security and privacy models implicitly depend on users correctly identifying an element's source, a concept we term ''surface attribution.'' Through two large-scale vignette-based surveys (N=4,400 and N=3,057), we present the first empirical measurement of this ability. We find that users struggle, correctly attributing UI source only 55% of the time on desktop and 53% on mobile. Familiarity and strong brand cues significantly improve accuracy, whereas UI positioning, a long-held security design concept especially for browsers, has minimal impact. Furthermore, simply adding a ''Security & Privacy'' brand cue to Android permission prompts failed to improve attribution. These findings demonstrate a fundamental gap in users' mental models, indicating that relying on them to distinguish trusted UI is a fragile security paradigm. View details
    Preview abstract Optical health sensing algorithms, such as SpO2, sleep monitoring, and metabolic health sensing, critically depend on the accurate measurement of optical emission from Light Emitting Diodes (LEDs) transmitted through user tissue and detected by a photodiode (PD). A significant challenge to the reliability of these measurements is the inherent degradation of LED optical emission intensity over time due to device aging. This degradation can confound the physiological changes being monitored. Our work quantifies the impact of LED aging on sensor signal integrity, specifically examining the Current Transfer Ratio (CTR), which is a key metric defining the ratio of received photocurrent to the LED drive current used for transmission in various health sensing algorithms. We investigate the degradation characteristics across LEDs of different wavelengths. Our findings indicate a relative CTR change due to degradation ranging from 1% to 8% within 100 hours of continuous operation which translates to approximately 3.5 to 7 years of device lifetime. Furthermore, we explore the non-linearity of this degradation and the observed initial ”overshoot” phenomenon in the CTR during aging. We discuss how understanding these dynamics could inform the development of robust specifications for different physiological sensing algorithms. Finally, we present several potential solutions to mitigate the effects of LED aging. During the product design phase, integrating a calibrating photodiode or compensating circuitry around the LED can help preemptively address degradation. In the application space, run-time calibration strategies employing two differently degraded optical paths offer a promising approach to maintain measurement accuracy. View details
    ×