Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11322 publications
Reasoning-Driven Synthetic Data Generation and Evaluation
Tim R. Davidson
Benoit Seguin
Transactions on Machine Learning Research (2026)
Preview abstract Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution — limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount. View details
CrossCheck: Input Validation for WAN Control Systems
Rishabh Iyer
Isaac Keslassy
Sylvia Ratnasamy
Networked Systems Design and Implementation (NSDI) (2026) (to appear)
Preview abstract We present CrossCheck, a system that validates inputs to the Software-Defined Networking (SDN) controller in a Wide Area Network (WAN). By detecting incorrect inputs—often stemming from bugs in the SDN control infrastructure—CrossCheck alerts operators before they trigger network outages. Our analysis at a large-scale WAN operator identifies invalid inputs as a leading cause of major outages, and we show how CrossCheck would have prevented those incidents. We deployed CrossCheck as a shadow validation system for four weeks in a production WAN, during which it accurately detected the single incident of invalid inputs that occurred while sustaining a 0% false positive rate under normal operation, hence imposing little additional burden on operators. In addition, we show through simulation that CrossCheck reliably detects a wide range of invalid inputs (e.g., detecting demand perturbations as small as 5% with 100% accuracy) and maintains a near-zero false positive rate for realistic levels of noisy, missing, or buggy telemetry data (e.g., sustaining zero false positives with up to 30% of corrupted telemetry data). View details
Neural general circulation models for modeling precipitation
Stephan Hoyer
Dmitrii Kochkov
Janni Yuval
Ian Langmore
Science Advances (2026)
Preview abstract Climate models struggle to accurately simulate precipitation, particularly extremes and the diurnal cycle. While hybrid models combining machine learning and physics have emerged with the premise of improving precipitation simulations, none have proven sufficiently skillful or stable enough to outperform existing models in simulating precipitation. Here, we present the first hybrid model that is trained directly on precipitation observations. The model runs at 2.8 degrees resolution and is built on the differentiable NeuralGCM framework. This model is stable for decadal simulations and demonstrates significant improvements over existing GCMs, ERA5 reanalysis, and a Global Cloud-Resolving Model in simulating precipitation. Our approach yields reduced biases, a more realistic precipitation distribution, improved representation of extremes, and a more accurate diurnal cycle. Furthermore, it outperforms the ECMWF ensemble for mid-range weather forecasting. This advance paves the way for more reliable simulations of current climate and for the ability to fully utilize the abundance of existing observations to further improve GCMs. View details
ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
Guy Tennenholtz
Jihwan Jeong
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL-26), Rabat, Morocco (2026), pp. 5270-5304
Preview abstract LLM-based user simulators are a scalable solution for improving conversational AI, but a critical realism gap undermines their effectiveness. To close this gap, we introduce a framework for building and validating high-fidelity simulators. We present a novel dataset of human-AI shopping conversations designed to capture a wide spectrum of user experiences. To measure fidelity, we propose a hybrid evaluation protocol that combines statistical alignment with a learned, discriminator-based Human-Likeness Score. Our most sophisticated simulator, trained via reinforcement learning with iterative critique, achieves a significant leap in realism. Critically, we demonstrate through counterfactual validation that our simulator—trained exclusively on optimal interactions—realistically adapts its behavior to suboptimal system responses, mirroring real user reactions and marking a key advance in creating reliable simulators for robust AI development. View details
Preview abstract There are growing concerns about AI-generated image-based sexual abuse (AI-IBSA), also known as nonconsensual sexualized ′deepfakes.′ Empirical research on AI-IBSA, however, remains very limited. This study surveyed 7231 respondents across Australia, the United Kingdom, and the United States to investigate community attitudes and perceptions on AI-IBSA. Through a vignette study, we explored the relationship between public familiarity with AI-IBSA, normative concerns about consent, and context-dependent judgments that vary based on the target's identity relational status, and how the content was used. Our findings reveal strong condemnation of AI-IBSA, yet respondents demonstrated low familiarity with the technology and their views varied depending on particular contexts. AI-IBSA targeting intimate partners was viewed as more unacceptable than targeting celebrities, and content created solely for personal use was seen as less unacceptable than content intended for distribution. The study highlights the need for approaches that go beyond technical fixes and punitive measures, advocating for a multifaceted response that integrates ethical data governance, digital sexual literacy, and restorative justice approaches. View details
Preview abstract Object-Counting for remote-sensing (RS) imagery is raising increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - an adaptation of existing work for Open Vocabulary Counting (OVC) approach from general computer vision to the RS domain. We show that our model is capable of accurate counting of novel object classes, that are unseen during training, based solely on textual and/or visual conditioning. View details
Preview abstract Large Language Models utilizing reasoning techniques improve task performance but incur significant latency and token costs due to verbose generation. Existing automatic prompt optimization(APO) frameworks target task accuracy exclusively at the expense of generating long reasoning traces. We propose Cost-Regularized Optimization of Prompts (CROP), an APO method that introduces regularization on response length by generating textual feedback in addition to standard accuracy feedback. This forces the optimization process to produce prompts that elicit concise responses containing only critical information and reasoning. We evaluate our approach on complex reasoning datasets, specifically GSM8K, LogiQA and BIG-Bench Hard. We achieved an 80.6% reduction in token consumption while maintaining competitive accuracy, seeing only a nominal decline in performance. This presents a pragmatic solution for deploying token-efficient and cost-effective agentic AI systems in production pipelines. View details
Neural general circulation models for modeling precipitation
Stephan Hoyer
Dmitrii Kochkov
Janni Yuval
Ian Langmore
Science Advances (2026)
Preview abstract Climate models struggle to accurately simulate precipitation, particularly extremes and the diurnal cycle. While hybrid models combining machine learning and physics have emerged with the premise of improving precipitation simulations, none have proven sufficiently skillful or stable enough to outperform existing models in simulating precipitation. Here, we present the first hybrid model that is trained directly on precipitation observations. The model runs at 2.8 degrees resolution and is built on the differentiable NeuralGCM framework. This model is stable for decadal simulations and demonstrates significant improvements over existing GCMs, ERA5 reanalysis, and a Global Cloud-Resolving Model in simulating precipitation. Our approach yields reduced biases, a more realistic precipitation distribution, improved representation of extremes, and a more accurate diurnal cycle. Furthermore, it outperforms the ECMWF ensemble for mid-range weather forecasting. This advance paves the way for more reliable simulations of current climate and for the ability to fully utilize the abundance of existing observations to further improve GCMs. View details
Preview abstract Validating conversational artificial intelligence (AI) for regulated medical software applications may present challenges, as static test datasets and manual review may be limited in identifying emergent, conversational anomalies. A multi-agent AI system may be configured in a closed-loop for automated validation. The system can, for example, utilize an end user persona simulator agent to generate prompts for a target model and a domain /regulatory expert adjudicator agent to evaluate the target model’s responses against a configurable rubric. A meta-analysis agent can analyze anomalies to identify underlying vulnerabilities, which may then be used to programmatically synthesize new adversarial personas. This adaptive process can generate evidence to support regulatory compliance and continuous performance monitoring for medical software algorithms systems. View details
Preview abstract Large Language Models (LLMs) such as ChatGPT can infer personal attributes from seemingly innocuous text, raising privacy risks beyond memorized data leakage. While prior work has demonstrated these risks, little is known about how users estimate and respond. We conducted a survey with 240 U.S. participants who judged text snippets for inference risks, reported concern levels, and attempted rewrites to block inference. We compared their rewrites with those generated by ChatGPT and Rescriber, a state-of-the-art sanitization tool. Results show that participants struggled to anticipate inference, performing a little better than chance. User rewrites were effective in just 28% of cases - better than Rescriber but worse than ChatGPT. We examined our participants’ rewriting strategies, and observed that while paraphrasing was the most common strategy it is also the least effective; instead abstraction and adding ambiguity were more successful. Our work highlights the importance of inference-aware design in LLM interactions. View details
A Framework for Interactive Machine Learning and Enhanced Conversational Systems
Jerry Young
Richard Abisla
Sanjay Batra
Mikki Phan
Nature, Springer-Verlag (2026)
Preview abstract Conversational systems are increasingly prevalent, yet current versions often fail to support the full range of human speech, including variations in speed, rhythm, syntax, grammar, articulation, and resonance. This reduces their utility for individuals with dysarthria, apraxia, dysphonia, and other language and speech-related disabilities. Building on research that emphasizes the need for specialized datasets and model training tools, our study uses a scaffolded approach to understand the ideal model training and voice recording process. Our findings highlight two distinct user flows for improving model training and provide six guidelines for future conversational system-related co-design frameworks. This study offers important insights on creating more effective conversational systems by emphasizing the need to integrate interactive machine learning into training strategies. View details
Performance analysis of updated Sleep Tracking algorithms across Google and Fitbit wearable devices
Arno Charton
Linda Lei
Siddhant Swaroop
Marius Guerard
Michael Dixon
Logan Niehaus
Shao-Po Ma
Logan Schneider
Ross Wilkinson
Ryan Gillard
Conor Heneghan
Pramod Rudrapatna
Mark Malhotra
Shwetak Patel
Google, Google, 1600 Amphitheatre Parkway Mountain View, CA 94043 (2026) (to appear)
Preview abstract Background: The general public has increasingly adopted consumer wearables for sleep tracking over the past 15 years, but reports on performance versus gold standards such as polysomnogram (PSG), high quality sleep diaries and at-home portable EEG systems still show potential for improved performance. Two aspects in particular are worthy of consideration: (a) improved recognition of sleep sessions (times when a person is in bed and has attempted to sleep), and (b) improved accuracy on recognizing sleep stages relative to an accepted standard such as PSG. Aims: This study aimed to: 1) provide an update on the methodology and performance of a system for correctly recognizing valid sleep sessions, and 2) detail an updated description of how sleep stages are calculated using accelerometer and inter-beat intervals Methods: Novel machine learning algorithms were developed to recognize sleep sessions and sleep stages using accelerometer sensors and inter-beat intervals derived from the watch or tracker photoplethysmogram. Algorithms were developed on over 3000 nights of human-scored free-living sleep sessions from a representative population of 122 subjects, and then tested on an independent validation set of 47 users. Within sleep sessions, an algorithm was developed to recognize periods when the user was attempting to sleep (Time-Attempting-To-Sleep = TATS). For sleep stage estimation, an algorithm was trained on human expert-scored polysomnograms, and then tested on 50 withheld subject nights for its ability to recognize Wake, Light (N1/N2), Deep (N3) and REM sleep relative to expert scored labels. Results: For sleep session estimation, the algorithm had at least 95% overlap on TATS with human consensus scoring for 94% of nights from healthy sleepers. For sleep stage estimation, comparing with the current Fitbit algorithm, Cohen’s kappa for four-class determination of sleep stage increased from an average of 0.56 (std 0.13) to 0.63 (std 0.12), and average accuracy increased from 71% (std 0.10) to 77% (std 0.078) Conclusion: A set of new algorithms has been developed and tested on Fitbit and Pixel Watches and is capable of providing robust and accurate measurement of sleep in free-living environments. View details
Preview abstract The remarkable success of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in 2D computer vision has catalyzed significant research into their adaptation for the complex domain of 3D analysis. However, a fundamental dichotomy exists between the regular, dense grid of 2D images and the irregular, sparse nature of 3D data formats such as point clouds and meshes. This paper provides a comprehensive survey and a novel intellectual framework for navigating this burgeoning field. Our core contribution is a new taxonomy that organizes adaptation strategies into three distinct families: (1) Data-centric methods, which project 3D data into 2D formats to leverage off-the-shelf 2D models; (2) Architecture-centric methods, which design intrinsic network modules to directly process 3D data; and (3) Hybrid methods, which synergistically combine pre-trained 2D features with 3D modeling processing pipelines to benefit from both rich visual priors and explicit geometric reasoning. Through this taxonomic lens, we conduct a systematic review and qualitative synthesis of the field. We illuminate the fundamental trade-offs between these families concerning computational complexity, reliance on large-scale pre-training, and the preservation of geometric inductive biases. Based on this analysis, we identify and discuss critical open challenges and chart promising future research directions, including the development of 3D foundation models, advancements in self-supervised learning for geometric data, and the deeper integration of multi-modal signals. This survey serves as an essential resource and roadmap for researchers seeking to understand and advance the state-of-the-art in 3D computer vision. View details
Type-Aware Ranking of Urban Similarity from Aerial Imagery
Idan Kligvasser
Yotam Intrator
Yuval Desheh
Aviad Barzilai
Niv Efron
Ehud Rivlin
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops (2026), pp. 821-829
Preview abstract Estimating and ranking cross-city similarity from aerial imagery is a fundamental challenge in remote sensing and geospatial representation learning. Urban environments differ widely in road layout, marking conventions, and infrastructure design, yet standard visual representations often struggle to disentangle these meaningful structural variations from superficial appearances. In this work, we propose a type-aware contrastive learning framework that measures urban similarity by explicitly modeling distinct infrastructure elements. Leveraging open-vocabulary retrieval, we construct a globally diverse dataset of road-related features, such as intersections, crosswalks, and bus lanes, and train a type-conditioned Vision Transformer that fuses visual features with CLIP-derived semantic embeddings. Crucially, we introduce an adaptive per-type contrastive loss that dynamically emphasizes infrastructure categories with high discriminative power while down-weighting less informative types. To quantify city-level similarity, we aggregate per-type cosine similarities via a lightweight classifier to generate a global city-to-city similarity matrix. Experiments demonstrate that this type-aware approach significantly improves clustering quality and successfully generalizes to unseen cities, establishing a scalable, interpretable foundation for comparative urban analysis. View details
Preview abstract While non-verbal behaviors and expressive movements are essential for natural human-robot interaction, existing methods often overlook a crucial element: the human’s internal cognitive state. Consequently, proactive multi-agent systems frequently interrupt humans at inopportune moments, leading to cognitive overload and decreased task performance. This paper introduces a framework for generating “cognitively aligned” multi-agent interactions, enhancing the ability of robotic systems to contextually defer communications during moments of high human mental workload. We present the design and implementation of a closed-loop architecture that explores the interplay between autonomous task execution and real-time neurophysiological focus. Utilizing a consumer-grade Brain-Computer Interface (BCI), our approach continuously monitors Electroencephalography (EEG) spectral band powers while a human performs a cognitive-load-inducing task. We propose a workload-driven pipeline where an HTTP-based signaling mechanism places a primary agent’s sensory inputs and audio outputs into a holding state upon detecting high cognitive load. This allows secondary agents to seamlessly process complex, delegated tasks in the background. Once the human’s cognitive state returns to a baseline, the primary agent releases the queued agent message. Our preliminary results demonstrate the feasibility of leveraging real-time signal processing, Large Language Models (LLMs), and physical robotic embodiments to create interrupt-aware, non-intrusive multi-agent systems. View details
×