Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11203 publications
    Preview abstract Global shared service centers are critical to modern enterprise operations but struggle to provide consistent, timely support across linguistic boundaries. This paper introduces the Glossary-Grounded Universal Queue (GGUQ), a socio-technical framework designed to bridge the gap between the operational goal of a unified global service queue and the reality of a multilingual workforce. The GGUQ is a real-time, workflow-embedded communication architecture that leverages Large Language Models (LLMs) to provide high-fidelity, two-way translation directly within an agent's enterprise platform. The framework's key innovation is a "glossary-grounded" approach, where translation prompts are programmatically injected with a curated repository of enterprise-specific terminology. This ensures a level of contextual and terminological integrity unachievable by generic machine translation tools. By detailing the GGUQ's three-pillar architecture—Dynamic Translation, Glossary-Grounded Integrity, and Resilient Operations—we propose a new model for computer-mediated communication in global enterprises. This framework aims to move beyond federated, language-siloed support models to enable a true "follow-the-sun" operational capability, promoting both organizational efficiency and a more inclusive employee experience. View details
    Preview abstract We introduce AMS (Activation-based Model Scanner), a tool for verifying whether a language model is safe to deploy by analyzing its internal activation patterns. While "uncensored" and maliciously fine-tuned models pose increasing risks, current detection methods rely on behavioral testing that is slow, incomplete, and easily evaded. AMS takes a fundamentally different approach: measuring the geometric structure of safety-relevant concepts in the model's activation space. Safe models exhibit strong class separation (4-8σ) between harmful and benign content; models with removed or degraded safety training show collapsed separation (<2σ). Using contrastive prompt pairs and direction vector analysis, AMS performs model-level verification rather than prompt-level classification. We validate AMS across 14 model configurations spanning 3 architecture families (Llama, Gemma, Qwen), 3 quantization levels (FP16, INT8, INT4), and multiple model categories (instruction-tuned, base, abliterated, uncensored). In our validation set: (1) all four instruction-tuned models pass with 3.8-8.4σ separation; (2) three tested uncensored models (Dolphin, Lexi, LLama-3-8b-Uncensored) flagged as CRITICAL with 1.1-1.3σ on harmful content; (3) an abliterated Llama variant flagged as WARNING (3.33σ); (4) Llama base model shows 0.69σ, confirming absence of safety training; (5) quantization has minimal impact (<5% drift). One model labeled "uncensored" (DarkIdol) unexpectedly passed, suggesting either mislabeling or a technique that preserves activation geometry. AMS also provides identity verification via direction vector comparison. Scanning completes in 10-40 seconds per model on GPU hardware. We discuss threshold calibration, limitations of our validation scope, and directions for broader evaluation. View details
    Preview abstract As the ECMAScript specification evolves, industrial-scale JavaScript compilers face the challenge of supporting modern language syntax while maintaining compatibility for diverse execution environments. Traditionally, compilers solve this by running transpilation passes in a monolithic pipeline, where the transpilation passes are chosen to execute strictly based on a target language level. This results in significant computational waste, as compilers perform expensive Abstract Syntax Tree (AST) traversals to lower features that may not exist in the actual input source code. We present a static analysis improvement that conditionally executes transpiler passes based on accurately tracking and dynamically maintaining the exact set of language features seen in the compilation unit throughout the transpilation process. It is implemented in the production Google Closure Compiler. By populating and maintaining a FeatureSet at every JavaScript script-level, it dynamically skips running the unnecessary lowering passes. We detail the architectural safeguards - including strategic pass ordering and dynamic validation of the transpiled code for feature-correctness. Evaluation of this improvement on large-scale production applications produced a considerable reduction in compilation time and saved compute and memory usage. View details
    Preview abstract Communicating spatial tasks via text or speech creates ``a mental mapping gap'' that limits an agent’s expressiveness. Inspired by co-speech gestures in face-to-face conversation, we propose \textsc{AgentHands}, an LLM-powered XR system that equips agents with hands to render responses clearer and more engaging. Guided by a design taxonomy distilled from a formative study (N=10), we implement a novel pipeline to generate and render a hand agent that augments conversational responses with synchronized, space-aware, and interactive hand gestures: using a meta-instruction, \textsc{AgentHands} generates verbal responses embedded with \textit{GestureEvents} aligned to specific words; each event specifies gesture type and parameters. At runtime, a parser converts events into time-stamped poses and motions, driving an animation system that renders expressive hands synchronized with speech. In a within-subjects study (N=12), \textsc{AgentHands} increased engagement and made spatially grounded conversations easier to follow compared to a speech-only baseline. View details
    Peeking Ahead of the Field Study: Exploring VLM Personas as Support Tools for Embodied Studies in HCI
    Xinyue Gui
    Ding Xia
    Mark Colley
    Yuan Li
    Vishal Chauhan
    Anubhav Anubhav
    Ehsan Javanmardi
    Stela Hanbyeol Seo
    Chia-Ming Chang
    Manabu Tsukada
    Takeo Igarashi
    Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI 26)
    Preview abstract Field studies are irreplaceable but costly, time-consuming, and error-prone, which need careful preparation. Inspired by rapid-prototyping in manufacturing, we propose a fast, low-cost evaluation method using Vision-Language Model (VLM) personas to simulate outcomes comparable to field results. While LLMs show human-like reasoning and language capabilities, autonomous vehicle (AV)-pedestrian interaction requires spatial awareness, emotional empathy, and behavioral generation. This raises our research question: To what extent can VLM personas mimic human responses in field studies? We conducted parallel studies: 1) one real-world study with 20 participants, and 2) one video-study using 20 VLM personas, both on a street-crossing task. We compared their responses and interviewed five HCI researchers on potential applications. Results show that VLM personas mimic human response patterns (e.g., average crossing times of 5.25 s vs. 5.07 s) lack the behavioral variability and depth. They show promise for formative studies, field study preparation, and human data augmentation. View details
    Preview abstract This talk addresses the challenges of operating Google's monitoring systems at scale, handling terabytes of telemetry data and preventing overload from diverse workloads. We'll explore how Google's internal client library and Monarch, its planet-scale time-series database, work together for cost-effective data collection. Key principles include a distributed push model, dynamic client-side data reduction, centralized retention, and periodic metric analysis. The session will then bridge these concepts to the open-source world, discussing our work with OpenTelemetry's OpAMP protocol to achieve similar scalable and efficient telemetry collection. Attendees will gain insights into adapting these principles for cost savings and learn about our collaboration with the OpAMP SIG to benefit the broader community. View details
    Preview abstract The current pursuit of robust machine intelligence is largely predicated on a substrate independent, computational functionalist view of cognition, where sufficiently complex computational processing is expected to eventually yield generalized reasoning. This paper explores the ontological distinctions between these computational frameworks and biological cognition, specifically how these differences impact the capacity for semantic understanding. By analyzing phenomena such as the "reversal curse" where models fail to generalize the symmetry in identity relations (A=B implies B=A), and performance on novel reasoning benchmarks (e.g., ARC-AGI), this paper examines whether current model limitations are transient artifacts of scale or indicative of a distinct architectural category. Integrating Stevan Harnad’s “symbol grounding problem” with Evan Thompson’s biological model of “intrinsic normativity,” I investigate whether robust general intelligence might require sense-making: a process distinct from information processing, whereby an agent’s internal states are causally coupled with its environment via survival or system-wide stakes which grounds symbols in meaning. Current Large Language Models (LLMs) appear to lack this intrinsic normativity, and consequently may operate primarily as epistemic instruments rather than ontic agents. By introducing the concept of “ontic grounding”, this paper presents a potential framework for distinguishing between the simulation of reasoning and true understanding, which could have implications for AI safety and governance. View details
    Performance analysis of updated Sleep Tracking algorithms across Google and Fitbit wearable devices
    Arno Charton
    Linda Lei
    Siddhant Swaroop
    Marius Guerard
    Michael Dixon
    Logan Niehaus
    Shao-Po Ma
    Logan Schneider
    Ross Wilkinson
    Ryan Gillard
    Conor Heneghan
    Pramod Rudrapatna
    Mark Malhotra
    Shwetak Patel
    Google, Google, 1600 Amphitheatre Parkway Mountain View, CA 94043 (2026) (to appear)
    Preview abstract Background: The general public has increasingly adopted consumer wearables for sleep tracking over the past 15 years, but reports on performance versus gold standards such as polysomnogram (PSG), high quality sleep diaries and at-home portable EEG systems still show potential for improved performance. Two aspects in particular are worthy of consideration: (a) improved recognition of sleep sessions (times when a person is in bed and has attempted to sleep), and (b) improved accuracy on recognizing sleep stages relative to an accepted standard such as PSG. Aims: This study aimed to: 1) provide an update on the methodology and performance of a system for correctly recognizing valid sleep sessions, and 2) detail an updated description of how sleep stages are calculated using accelerometer and inter-beat intervals Methods: Novel machine learning algorithms were developed to recognize sleep sessions and sleep stages using accelerometer sensors and inter-beat intervals derived from the watch or tracker photoplethysmogram. Algorithms were developed on over 3000 nights of human-scored free-living sleep sessions from a representative population of 122 subjects, and then tested on an independent validation set of 47 users. Within sleep sessions, an algorithm was developed to recognize periods when the user was attempting to sleep (Time-Attempting-To-Sleep = TATS). For sleep stage estimation, an algorithm was trained on human expert-scored polysomnograms, and then tested on 50 withheld subject nights for its ability to recognize Wake, Light (N1/N2), Deep (N3) and REM sleep relative to expert scored labels. Results: For sleep session estimation, the algorithm had at least 95% overlap on TATS with human consensus scoring for 94% of nights from healthy sleepers. For sleep stage estimation, comparing with the current Fitbit algorithm, Cohen’s kappa for four-class determination of sleep stage increased from an average of 0.56 (std 0.13) to 0.63 (std 0.12), and average accuracy increased from 71% (std 0.10) to 77% (std 0.078) Conclusion: A set of new algorithms has been developed and tested on Fitbit and Pixel Watches and is capable of providing robust and accurate measurement of sleep in free-living environments. View details
    Preview abstract How many T gates are needed to approximate an arbitrary n-qubit quantum state to within a given precision ϵ? Improving prior work of Low, Kliuchnikov and Schaeffer, we show that the optimal asymptotic scaling is Θ(sqrt{2^n log(1/ε)} + log(1/ε)) if we allow an unlimited number of ancilla qubits. We also show that this is the optimal T-count for implementing an arbitrary diagonal n-qubit unitary to within error ϵ. We describe an application to batched synthesis of single-qubit unitaries: we can approximate a tensor product of m = O(log log(1/ϵ)) arbitrary single-qubit unitaries to within error ϵ with the same asymptotic T-count as is required to approximate just one single-qubit unitary. View details
    Preview abstract Audio Description ( AD) provides essential access to visual media for blind and low vision ( BLV) audiences. Yet current AD production tools remain largely inaccessible to BLV video creators, who possess valuable expertise but face barriers due to visually- driven interfaces. We present ADCanvas, a multimodal authoring system that supports non- visual control over audio description ( AD) creation. ADCanvas combines conversational interaction with keyboard- based playback control and a plain- text, screen reader– accessible editor to support end- to- end AD authoring and visual question answering ( VQA). Combining screen- reader- friendly controls with a multimodal LLM agent, ADCanvas supports live VQA, script generation, and AD modification. Through a user study with 12 BLV video creators, we find that users adopt the conversational agent as an informational aide and drafting assistant, while maintaining agency through verification and editing. For example, participants saw themselves as curators who received information from the model and filtered it down for their audience. Our findings offer design implications for accessible media tools, including precise editing controls, accessibility support for creative ideation, and configurable rules for human- AI collaboration. View details
    Preview abstract Semantic data models express high-level business concepts and metrics, capturing the business logic needed to query a database correctly. Most data modeling solutions are built as layers above SQL query engines, with bespoke query languages or APIs. The layered approach means that semantic models can’t be used directly in SQL queries. This paper focuses on an open problem in this space – can we define semantic models in SQL, and make them naturally queryable in SQL? In parallel, graph query is becoming increasingly popular, including in SQL. SQL/PGQ extends SQL with an embedded subset of the GQL graph query language, adding property graph views and making graph traversal queries easy. We explore a surprising connection: semantic data models are graphs, and defining graphs is a data modeling problem. In both domains, users start by defining a graph model, and need query language support to easily traverse edges in the graph, which means doing joins in the underlying data. We propose some useful SQL extensions that make it easier to use higher-level data model abstractions in queries. Users can define a “semantic data graph” view of their data, encapsulating the complex business logic required to query the underlying tables correctly. Then they can query that semantic graph model easily with SQL. Our SQL extensions are useful independently, simplifying many queries – particularly, queries with joins. We make declared foreign key relationships usable for joins at query time – a feature that seems obvious but is notably missing in standard SQL. In combination, these extensions provide a practical approach to extend SQL incrementally, bringing semantic modeling and graph query together with the relational model and SQL. View details
    Reasoning-Driven Synthetic Data Generation and Evaluation
    Tim R. Davidson
    Benoit Seguin
    Transactions on Machine Learning Research (2026)
    Preview abstract Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution — limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount. View details
    On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
    Yehonathan Refael
    Amit Aides
    Aviad Barzilai
    Vered Silverman
    Bolous Jaber
    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops (2026), pp. 886-894
    Preview abstract Open-vocabulary object detection (OVD) models offer remarkable flexibility applications by enabling object detection from arbitrary text queries. Still, the zero-shot performance of the pre-trained models is hampered by the inherent semantic ambiguity of natural language, result to low precision, leading to insufficient crucial downstream applications. For instance, in the remote sensing (RS) domain, a query for "ship" can yield varied and contextually irrelevant results. To address this, for real time applications, we propose a novel cascaded architecture that synergizes the broad capabilities of a large, pre-trained OVD model with a lightweight, few-shot classifier. Our approach utilizes the frozen weights of the zero-shot model to generate initial, high-recall object-embedding proposals, which are then refined by a compact classifier trained in real-time on a handful of user-annotated examples. The core of our contribution is an efficient one step active learning strategy for selecting the most informative samples for user annotation. Our method identifies (extremely) small amount of an uncertain candidates near the theoretical decision boundary using density estimation and then applies clustering to ensure a diverse training set. This targeted sampling enables our cascaded system to elevate performance on standard remote sensing benchmarks. Our work thus presents a practical and resource-efficient framework for adapting foundational models to specific user needs, drastically reducing annotation overhead while achieving high accuracy without costly full-model fine-tuning. View details
    Preview abstract Validating conversational artificial intelligence (AI) for regulated medical software applications may present challenges, as static test datasets and manual review may be limited in identifying emergent, conversational anomalies. A multi-agent AI system may be configured in a closed-loop for automated validation. The system can, for example, utilize an end user persona simulator agent to generate prompts for a target model and a domain /regulatory expert adjudicator agent to evaluate the target model’s responses against a configurable rubric. A meta-analysis agent can analyze anomalies to identify underlying vulnerabilities, which may then be used to programmatically synthesize new adversarial personas. This adaptive process can generate evidence to support regulatory compliance and continuous performance monitoring for medical software algorithms systems. View details
    Robust Wireless Resource Allocation Against Adversarial Jamming
    Christos Tsoufis
    Dionysia Triantafyllopoulou
    Klaus Moessner
    ICC (2026)
    Preview abstract We study the problem of allocating access point bandwidth to users of a wireless network in the presence of adversarial jamming. Specifically, we consider a setting in which the network designer acts first and allocates access point bandwidth to the users of the network, before an adversary applies a jamming strategy to reduce the bandwidth of a subset (or all) of the access points. We consider a strong adversary who has complete information and can optimize the jamming strategy, subject to power budget constraints. In turn, the network designer must allocate the resources in anticipation of the adversary's actions. We explain that our model gives rise to a special network interdiction model, which differs from the standard setting in two ways: The first is that the interdictor is given the benefit of responding, rather than leading the game. The second is that the interdiction is fractional and performed at the node level of the network. The interdiction then propagates to all edges incident to the access point. In terms of technical results, we provide an allocation algorithm that is based on linear programming duality and show that the algorithm can solve the problem optimally, assuming knowledge of the adversary's budget constraints. We conduct experiments on synthetic data to show the extent to which the algorithm improves the total utilized bandwidth over the algorithm that optimizes bandwidth allocation while being oblivious to the adversary's existence. View details
    ×