Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11244 publications
    Type-Aware Ranking of Urban Similarity from Aerial Imagery
    Idan Kligvasser
    Yotam Intrator
    Yuval Desheh
    Aviad Barzilai
    Niv Efron
    Ehud Rivlin
    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops (2026), pp. 821-829
    Preview abstract Estimating and ranking cross-city similarity from aerial imagery is a fundamental challenge in remote sensing and geospatial representation learning. Urban environments differ widely in road layout, marking conventions, and infrastructure design, yet standard visual representations often struggle to disentangle these meaningful structural variations from superficial appearances. In this work, we propose a type-aware contrastive learning framework that measures urban similarity by explicitly modeling distinct infrastructure elements. Leveraging open-vocabulary retrieval, we construct a globally diverse dataset of road-related features, such as intersections, crosswalks, and bus lanes, and train a type-conditioned Vision Transformer that fuses visual features with CLIP-derived semantic embeddings. Crucially, we introduce an adaptive per-type contrastive loss that dynamically emphasizes infrastructure categories with high discriminative power while down-weighting less informative types. To quantify city-level similarity, we aggregate per-type cosine similarities via a lightweight classifier to generate a global city-to-city similarity matrix. Experiments demonstrate that this type-aware approach significantly improves clustering quality and successfully generalizes to unseen cities, establishing a scalable, interpretable foundation for comparative urban analysis. View details
    Preview abstract The management of a hybrid workforce comprising human and autonomous computational agents may be challenged by the use of separate systems for human capital and software assets, which can create a governance gap. A system can provide a unified framework for managing a hybrid workforce. For example, the system may utilize a labor service mesh to analyze and route tasks to either a human intent tier or an agentic execution tier. A potential principle of the system is structural symmetry, where computational agents can be assigned digital identities and managed through a lifecycle process that may parallel human resource functions, such as onboarding, performance evaluation, and structured offboarding. This integrated approach can facilitate a unified system of record and governance model for an organization's intelligence capacity. View details
    Preview abstract Deep-learning methods have boosted the analytical power of Raman spectroscopy, yet they still require large, task-specific, labeled datasets and often fail to transfer across application domains. The study explores pre-trained encoders as a solution. Pre-trained encoders have significantly impacted Natural Language Processing and Computer Vision with their ability to learn transferable representations that can be applied to a variety of datasets, significantly reducing the amount of time and data required to create capable models. The following work puts forward a new approach that applies these benefits to Raman Spectroscopy. The proposed approach, RSPTE (Raman Spectroscopy Pre-Trained Encoder), is designed to learn generalizable spectral representations without labels. RSPTE employs a novel domain adaptation strategy using unsupervised Barlow Twins decorrelation objectives to learn fundamental spectral patterns from multi-domain Raman Spectroscopy datasets containing samples from medicine, biology, and mineralogy. Transferability is demonstrated through evaluation on several models created by fine-tuning RSPTE for different application domains: Medicine (detection of Melanoma and COVID), Biology (Pathogen Identification), and Agriculture. As an example, using only 20% of the dataset, models trained with RSPTE achieve accuracies ranging 50%–86% (depending on the dataset used) while without RSPTE the range is 9%–57%. Using the full dataset, accuracies with RSPTE range 81%–97%, and without pretraining 51%–97%. Current methods and state-of-the-art models in Raman Spectroscopy are compared to RSPTE for context, and RSPTE exhibits competitive results, especially with less data as well. These results provide evidence that the proposed RSPTE model can effectively learn and transfer generalizable spectral features across different domains, achieving accurate results with less data in less time (both data collection time and training time). View details
    Performance analysis of updated Sleep Tracking algorithms across Google and Fitbit wearable devices
    Arno Charton
    Linda Lei
    Siddhant Swaroop
    Marius Guerard
    Michael Dixon
    Logan Niehaus
    Shao-Po Ma
    Logan Schneider
    Ross Wilkinson
    Ryan Gillard
    Conor Heneghan
    Pramod Rudrapatna
    Mark Malhotra
    Shwetak Patel
    Google, Google, 1600 Amphitheatre Parkway Mountain View, CA 94043 (2026) (to appear)
    Preview abstract Background: The general public has increasingly adopted consumer wearables for sleep tracking over the past 15 years, but reports on performance versus gold standards such as polysomnogram (PSG), high quality sleep diaries and at-home portable EEG systems still show potential for improved performance. Two aspects in particular are worthy of consideration: (a) improved recognition of sleep sessions (times when a person is in bed and has attempted to sleep), and (b) improved accuracy on recognizing sleep stages relative to an accepted standard such as PSG. Aims: This study aimed to: 1) provide an update on the methodology and performance of a system for correctly recognizing valid sleep sessions, and 2) detail an updated description of how sleep stages are calculated using accelerometer and inter-beat intervals Methods: Novel machine learning algorithms were developed to recognize sleep sessions and sleep stages using accelerometer sensors and inter-beat intervals derived from the watch or tracker photoplethysmogram. Algorithms were developed on over 3000 nights of human-scored free-living sleep sessions from a representative population of 122 subjects, and then tested on an independent validation set of 47 users. Within sleep sessions, an algorithm was developed to recognize periods when the user was attempting to sleep (Time-Attempting-To-Sleep = TATS). For sleep stage estimation, an algorithm was trained on human expert-scored polysomnograms, and then tested on 50 withheld subject nights for its ability to recognize Wake, Light (N1/N2), Deep (N3) and REM sleep relative to expert scored labels. Results: For sleep session estimation, the algorithm had at least 95% overlap on TATS with human consensus scoring for 94% of nights from healthy sleepers. For sleep stage estimation, comparing with the current Fitbit algorithm, Cohen’s kappa for four-class determination of sleep stage increased from an average of 0.56 (std 0.13) to 0.63 (std 0.12), and average accuracy increased from 71% (std 0.10) to 77% (std 0.078) Conclusion: A set of new algorithms has been developed and tested on Fitbit and Pixel Watches and is capable of providing robust and accurate measurement of sleep in free-living environments. View details
    Preview abstract Enterprise service centers, particularly in domains like People Operations, are critical hubs of organizational knowledge work. They face a persistent difficulty in disseminating the tacit, case-specific expertise of senior agents, which can lead to inconsistent service and slower onboarding for new hires. While existing Knowledge Management (KM) and Case-Based Reasoning (CBR) systems have improved the retrieval of historically similar cases, they inadvertently shift the cognitive burden of synthesizing this information to the time-constrained agent. This paper introduces the Dynamic Case Precedent (DCP) architecture, a novel socio-technical framework designed to address this gap. The DCP architecture moves beyond simple precedent recommendation to automated precedent synthesis. It achieves this by integrating a semantic retrieval model with the large-context reasoning capabilities of a generative Large Language Model (LLM). We propose a three-pillar framework—(1) Contextual Similarity Indexing, (2) Generative Insight Synthesis, and (3) Human-in-the-Loop Refinement. By analyzing multiple relevant historical cases to generate a concise summary of resolution patterns, the DCP architecture aims to reduce agent cognitive load, accelerate proficiency, and improve service consistency. This conceptual framework offers a new model for human-AI collaboration, framing the AI not as a mere information tool, but as an active partner in sensemaking. View details
    Preview abstract This whitepaper seeks to elucidate implications that the capabilities of developing quantum architectures have on blockchain vulnerabilities and mitigation strategies. First, we provide new resource estimates for breaking the 256-bit Elliptic Curve Discrete Logarithm Problem, the core of modern blockchain cryptography. We demonstrate that Shor's algorithm for this problem can execute with either <1200 logical qubits and <90 million Toffoli gates or <1450 logical qubits and <70 million Toffoli gates. In the interest of responsible disclosure, we use a zero-knowledge proof to validate these results without disclosing attack vectors. On superconducting architectures with 1e-3 physical error rates and planar connectivity, those circuits can execute in minutes using fewer than half a million physical qubits. We introduce a critical distinction between fast-clock (such as superconducting and photonic) and slow-clock (such as neutral atom and ion trap) architectures. Our analysis reveals that the first fast-clock CRQCs would enable on-spend attacks on public mempool transactions of some cryptocurrencies. We survey major cryptocurrency vulnerabilities through this lens, identifying systemic risks associated with advanced features in some blockchains such as smart contracts, Proof-of-Stake consensus, and Data Availability Sampling, as well as the enduring concern of abandoned assets. We argue that technical solutions would benefit from accompanying public policy and discuss various frameworks of digital salvage to regulate the recovery or destruction of dormant assets while preventing adversarial seizure. We also discuss implications for other digital assets and tokenization as well as challenges and successful examples of the ongoing transition to Post-Quantum Cryptography (PQC). Finally, we urge all vulnerable cryptocurrency communities to join the ongoing migration to PQC without delay. View details
    Preview abstract Object-Counting for remote-sensing (RS) imagery is raising increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - an adaptation of existing work for Open Vocabulary Counting (OVC) approach from general computer vision to the RS domain. We show that our model is capable of accurate counting of novel object classes, that are unseen during training, based solely on textual and/or visual conditioning. View details
    Preview abstract Voice activity detection (VAD) plays a vital role in enabling applications such as speech recognition. We analyze the impact of window size on the accuracy of three VAD algorithms: Silero, WebRTC, and Root Mean Square (RMS) across a set of diverse real-world digital audio streams. We additionally explore the use of hysteresis on top of each VAD output. Our results offer practical references for optimizing VAD systems. Silero significantly outperforms WebRTC and RMS, and hysteresis provides a benefit for WebRTC. View details
    Preview abstract The remarkable success of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in 2D computer vision has catalyzed significant research into their adaptation for the complex domain of 3D analysis. However, a fundamental dichotomy exists between the regular, dense grid of 2D images and the irregular, sparse nature of 3D data formats such as point clouds and meshes. This paper provides a comprehensive survey and a novel intellectual framework for navigating this burgeoning field. Our core contribution is a new taxonomy that organizes adaptation strategies into three distinct families: (1) Data-centric methods, which project 3D data into 2D formats to leverage off-the-shelf 2D models; (2) Architecture-centric methods, which design intrinsic network modules to directly process 3D data; and (3) Hybrid methods, which synergistically combine pre-trained 2D features with 3D modeling processing pipelines to benefit from both rich visual priors and explicit geometric reasoning. Through this taxonomic lens, we conduct a systematic review and qualitative synthesis of the field. We illuminate the fundamental trade-offs between these families concerning computational complexity, reliance on large-scale pre-training, and the preservation of geometric inductive biases. Based on this analysis, we identify and discuss critical open challenges and chart promising future research directions, including the development of 3D foundation models, advancements in self-supervised learning for geometric data, and the deeper integration of multi-modal signals. This survey serves as an essential resource and roadmap for researchers seeking to understand and advance the state-of-the-art in 3D computer vision. View details
    Visual Planning: Let’s Think Only with Images
    Han Zhou
    Caiqi Zhang
    Anna Korhonen
    Chengzu Li
    Yi Xu
    Ivan Vulic
    International Conference on Learning Representations (ICLR) (2026)
    Preview abstract Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have significantly enhanced machine reasoning across diverse tasks. However, these models predominantly rely on language as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial, geometric, or physical dynamics. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of textual mediation. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We then introduce a novel two-stage reinforcement learning framework empowered by GRPO for post-training large vision models, resulting in substantial improvements in planning accuracy and generalization across both seen and novel scenarios, validated in representative visual navigation tasks, FrozenLake and Maze. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference. View details
    GenAI on Google Cloud: Enterprise Generative AI Systems and AI Agents
    Ayo Adedeji
    Lavi Nigam
    Stephanie Gervasi
    O'Reilly Media, Inc. (2026)
    Preview abstract In today's AI landscape, success depends not just on prompting large language models but on orchestrating them into intelligent systems that are scalable, compliant, and cost-effective. GenAI on Google Cloud is your hands-on guide to bridging that gap. Whether you're an ML engineer or an enterprise leader, this book offers a practical game plan for taking agentic systems from prototype to production. Written by practitioners with deep experience in AgentOps, data engineering, and GenAI infrastructure, this guide takes you through real-world workflows from data prep and deployment to orchestration and integration. With concrete examples, field-tested frameworks, and honest insights, you'll learn how to build agentic systems that deliver measurable business value. > Bridge the production gap that stalls 90% of vertical AI initiatives using systematic deployment frameworks > Navigate AgentOps complexities through practical guidance on orchestration, evaluation, and responsible AI practices > Build robust multimodal systems for text, images, and video using proven agent architectures > Optimize for scale with strategies for cost management, performance tuning, and production monitoring View details
    The Perfection Paradox: From Architect to Curator in AI-Assisted API Design
    JJ Geewax
    David R Karger
    Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26), ACM, Barcelona, Spain, TBD
    Preview abstract Enterprise API design is often bottlenecked by the tension between rapid feature delivery and the rigorous maintenance of usability standards. We present an industrial case study evaluating an AI-assisted design workflow trained on API Improvement Proposals(AIPs). Through a controlled study with 16 industry experts, we compared AI-generated API specifications against human-authored ones. While quantitative results indicated AI superiority in 10 of 11 usability dimensions and an 87% reduction in authoring time, qualitative analysis revealed a paradox: experts frequently misidentified AI work as human (19% accuracy) yet described the designs as unsettlingly “perfect.” We characterize this as a “Perfection Paradox”—where hyper-consistency signals a lack of pragmatic human judgment. We discuss the implications of this perfection paradox, proposing a shift in the human designer’s role from the “drafter” of specifications to the “curator” of AI-generated patterns. View details
    Robust Wireless Resource Allocation Against Adversarial Jamming
    Christos Tsoufis
    Dionysia Triantafyllopoulou
    Klaus Moessner
    ICC (2026)
    Preview abstract We study the problem of allocating access point bandwidth to users of a wireless network in the presence of adversarial jamming. Specifically, we consider a setting in which the network designer acts first and allocates access point bandwidth to the users of the network, before an adversary applies a jamming strategy to reduce the bandwidth of a subset (or all) of the access points. We consider a strong adversary who has complete information and can optimize the jamming strategy, subject to power budget constraints. In turn, the network designer must allocate the resources in anticipation of the adversary's actions. We explain that our model gives rise to a special network interdiction model, which differs from the standard setting in two ways: The first is that the interdictor is given the benefit of responding, rather than leading the game. The second is that the interdiction is fractional and performed at the node level of the network. The interdiction then propagates to all edges incident to the access point. In terms of technical results, we provide an allocation algorithm that is based on linear programming duality and show that the algorithm can solve the problem optimally, assuming knowledge of the adversary's budget constraints. We conduct experiments on synthetic data to show the extent to which the algorithm improves the total utilized bandwidth over the algorithm that optimizes bandwidth allocation while being oblivious to the adversary's existence. View details
    Preview abstract Global shared service centers are critical to modern enterprise operations but struggle to provide consistent, timely support across linguistic boundaries. This paper introduces the Glossary-Grounded Universal Queue (GGUQ), a socio-technical framework designed to bridge the gap between the operational goal of a unified global service queue and the reality of a multilingual workforce. The GGUQ is a real-time, workflow-embedded communication architecture that leverages Large Language Models (LLMs) to provide high-fidelity, two-way translation directly within an agent's enterprise platform. The framework's key innovation is a "glossary-grounded" approach, where translation prompts are programmatically injected with a curated repository of enterprise-specific terminology. This ensures a level of contextual and terminological integrity unachievable by generic machine translation tools. By detailing the GGUQ's three-pillar architecture—Dynamic Translation, Glossary-Grounded Integrity, and Resilient Operations—we propose a new model for computer-mediated communication in global enterprises. This framework aims to move beyond federated, language-siloed support models to enable a true "follow-the-sun" operational capability, promoting both organizational efficiency and a more inclusive employee experience. View details
    Preview abstract Large Language Models utilizing reasoning techniques improve task performance but incur significant latency and token costs due to verbose generation. Existing automatic prompt optimization(APO) frameworks target task accuracy exclusively at the expense of generating long reasoning traces. We propose Cost-Regularized Optimization of Prompts (CROP), an APO method that introduces regularization on response length by generating textual feedback in addition to standard accuracy feedback. This forces the optimization process to produce prompts that elicit concise responses containing only critical information and reasoning. We evaluate our approach on complex reasoning datasets, specifically GSM8K, LogiQA and BIG-Bench Hard. We achieved an 80.6% reduction in token consumption while maintaining competitive accuracy, seeing only a nominal decline in performance. This presents a pragmatic solution for deploying token-efficient and cost-effective agentic AI systems in production pipelines. View details
    ×