Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11349 publications
Preview abstract Semantic data models express high-level business concepts and metrics, capturing the business logic needed to query a database correctly. Most data modeling solutions are built as layers above SQL query engines, with bespoke query languages or APIs. The layered approach means that semantic models can’t be used directly in SQL queries. This paper focuses on an open problem in this space – can we define semantic models in SQL, and make them naturally queryable in SQL? In parallel, graph query is becoming increasingly popular, including in SQL. SQL/PGQ extends SQL with an embedded subset of the GQL graph query language, adding property graph views and making graph traversal queries easy. We explore a surprising connection: semantic data models are graphs, and defining graphs is a data modeling problem. In both domains, users start by defining a graph model, and need query language support to easily traverse edges in the graph, which means doing joins in the underlying data. We propose some useful SQL extensions that make it easier to use higher-level data model abstractions in queries. Users can define a “semantic data graph” view of their data, encapsulating the complex business logic required to query the underlying tables correctly. Then they can query that semantic graph model easily with SQL. Our SQL extensions are useful independently, simplifying many queries – particularly, queries with joins. We make declared foreign key relationships usable for joins at query time – a feature that seems obvious but is notably missing in standard SQL. In combination, these extensions provide a practical approach to extend SQL incrementally, bringing semantic modeling and graph query together with the relational model and SQL. View details
Beyond PII: How Users Perceive and Attempt to Mitigate Implicit LLM Inference
Synthia Wang
Nick Feamster
Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI), Association for Computing Machinery
Preview abstract Large Language Models (LLMs) such as ChatGPT can infer personal attributes from seemingly innocuous text, raising privacy risks beyond memorized data leakage. While prior work has demonstrated these risks, little is known about how users estimate and respond. We conducted a survey with 240 U.S. participants who judged text snippets for inference risks, reported concern levels, and attempted rewrites to block inference. We compared their rewrites with those generated by ChatGPT and Rescriber, a state-of-the-art sanitization tool. Results show that participants struggled to anticipate inference, performing a little better than chance. User rewrites were effective in just 28% of cases - better than Rescriber but worse than ChatGPT. We examined our participants’ rewriting strategies, and observed that while paraphrasing was the most common strategy it is also the least effective; instead abstraction and adding ambiguity were more successful. Our work highlights the importance of inference-aware design in LLM interactions. View details
The Perfection Paradox: From Architect to Curator in AI-Assisted API Design
JJ Geewax
David R Karger
Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26), ACM, Barcelona, Spain, TBD
Preview abstract Enterprise API design is often bottlenecked by the tension between rapid feature delivery and the rigorous maintenance of usability standards. We present an industrial case study evaluating an AI-assisted design workflow trained on API Improvement Proposals(AIPs). Through a controlled study with 16 industry experts, we compared AI-generated API specifications against human-authored ones. While quantitative results indicated AI superiority in 10 of 11 usability dimensions and an 87% reduction in authoring time, qualitative analysis revealed a paradox: experts frequently misidentified AI work as human (19% accuracy) yet described the designs as unsettlingly “perfect.” We characterize this as a “Perfection Paradox”—where hyper-consistency signals a lack of pragmatic human judgment. We discuss the implications of this perfection paradox, proposing a shift in the human designer’s role from the “drafter” of specifications to the “curator” of AI-generated patterns. View details
Preview abstract The emergence of Agentic AI—autonomous systems capable of reasoning, decision-making, and multi-step execution—represents a paradigm shift in enterprise technology. Moving beyond simple generative tasks, these agents offer the potential to solve long-standing industry pain points, with over 90% of enterprises planning integration within the next three years. However, the transition from successful proof-of-concept (PoC) to a resilient, production-grade system presents significant hurdles. This article categorizes these challenges into three primary domains: Technical and Engineering Hurdles: Issues such as "entangled workflows" that complicate debugging, the struggle to maintain output quality and mitigate hallucinations, and the unpredictability caused by shifting underlying models or data sources. People, Process, and Ecosystem Hurdles: The high operational costs and unclear ROI of large models, the necessity of a new "Agent Ops" skillset, the complexity of integrating agents with disparate enterprise systems, and a rapidly evolving regulatory landscape. The Pace of Change and Security risks: The technical debt incurred by shifting software frameworks and the expanded attack surface created by autonomous agents. The article concludes that successful deployment requires a shift from informal "vibe-testing" to rigorous engineering discipline. By adopting code-first frameworks, establishing robust evaluation metrics (KPIs), and prioritizing functional deployment over theoretical optimization, organizations can effectively manage the lifecycle of Agentic AI and realize its transformative business value. View details
Preview abstract Mid-air gestures in Extended Reality (XR) often lead to fatigue, discomfort and imprecision, limiting their suitability for extended use. Surface-based interactions offer a compelling alternative, providing improved accuracy, speed, and comfort. However, current egocentric vision-based methods struggle with reliable surface inputs due to challenges in hand tracking and surface-plane estimation from oblique and occluded viewing angles. To this extent, we introduce SurfaceXR, a novel sensor fusion approach that combines headset based hand tracking with micro-vibration data sampled from commodity smartwatch IMUs to enable precise and robust inputs on arbitrary surfaces. Our system is designed with flexibility in mind - it can function using only hand tracking, only IMU sensing, or optimally with both modalities combined. Our user study across 12 participants validates SurfaceXR's effectiveness in augmenting surface touch tracking and 8 class hand-surface gesture recognition, demonstrating significant improvements over single-modality approaches. Enabled by SurfaceXR, we demonstrate a series of interactive apps for both AR and VR, ranging from on-surface sketching, text entry and gesture based navigation. View details
Performance analysis of updated Sleep Tracking algorithms across Google and Fitbit wearable devices
Arno Charton
Linda Lei
Siddhant Swaroop
Marius Guerard
Michael Dixon
Logan Niehaus
Shao-Po Ma
Logan Schneider
Ross Wilkinson
Ryan Gillard
Conor Heneghan
Pramod Rudrapatna
Mark Malhotra
Shwetak Patel
Google, Google, 1600 Amphitheatre Parkway Mountain View, CA 94043 (2026) (to appear)
Preview abstract Background: The general public has increasingly adopted consumer wearables for sleep tracking over the past 15 years, but reports on performance versus gold standards such as polysomnogram (PSG), high quality sleep diaries and at-home portable EEG systems still show potential for improved performance. Two aspects in particular are worthy of consideration: (a) improved recognition of sleep sessions (times when a person is in bed and has attempted to sleep), and (b) improved accuracy on recognizing sleep stages relative to an accepted standard such as PSG. Aims: This study aimed to: 1) provide an update on the methodology and performance of a system for correctly recognizing valid sleep sessions, and 2) detail an updated description of how sleep stages are calculated using accelerometer and inter-beat intervals Methods: Novel machine learning algorithms were developed to recognize sleep sessions and sleep stages using accelerometer sensors and inter-beat intervals derived from the watch or tracker photoplethysmogram. Algorithms were developed on over 3000 nights of human-scored free-living sleep sessions from a representative population of 122 subjects, and then tested on an independent validation set of 47 users. Within sleep sessions, an algorithm was developed to recognize periods when the user was attempting to sleep (Time-Attempting-To-Sleep = TATS). For sleep stage estimation, an algorithm was trained on human expert-scored polysomnograms, and then tested on 50 withheld subject nights for its ability to recognize Wake, Light (N1/N2), Deep (N3) and REM sleep relative to expert scored labels. Results: For sleep session estimation, the algorithm had at least 95% overlap on TATS with human consensus scoring for 94% of nights from healthy sleepers. For sleep stage estimation, comparing with the current Fitbit algorithm, Cohen’s kappa for four-class determination of sleep stage increased from an average of 0.56 (std 0.13) to 0.63 (std 0.12), and average accuracy increased from 71% (std 0.10) to 77% (std 0.078) Conclusion: A set of new algorithms has been developed and tested on Fitbit and Pixel Watches and is capable of providing robust and accurate measurement of sleep in free-living environments. View details
Preview abstract Despite advances in high performance computing, accurate numerical simulations of global atmospheric dynamics remain a challenge. The resolution required to fully resolve the vast range scales as well as the strong coupling with—often not fully-understood—physics renders such simulations computationally infeasible over time horizons relevant for long-term climate risk assessment. While data-driven parameterizations have shown some promise of alleviating these obstacles, the scarcity of high-quality training data and their lack of long-term stability typically hinders their ability to capture the risk of rare extreme events. In this work we present a general strategy for training variational (probabilistic) neural network models to non-intrusively correct under-resolved long-time simulations of turbulent climate systems. The approach is based on the paradigm introduced by Barthel Sorensen et al. (2024, https://doi.org/10.1029/2023ms004122) which involves training a post-processing correction operator on under-resolved simulations nudged toward a high-fidelity reference. Our variational framework enables us to learn the dynamics of the underlying system from very little training data and thus drastically improve the extrapolation capabilities of the previous deterministic state-of-the art—even when the statistics of that training data are far from converged. We investigate and compare three recently introduced variational network architectures and illustrate the benefits of our approach on an anisotropic quasi-geostrophic flow. For this prototype model our approach is able to not only accurately capture global statistics, but also the anistropic regional variation and the statistics of multiple extreme event metrics—demonstrating significant improvement over previously introduced deterministic architectures. View details
Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
Yu Jiang
Hanwen Jiang
Vincent Chu
Brandon Y. Feng
Zhangyang Wang
Qixing Huang
IEEE/CVF Conference on Computer Vision and Pattern Recognition (2026)
Preview abstract With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach. View details
Preview abstract Enterprise service centers, particularly in domains like People Operations, are critical hubs of organizational knowledge work. They face a persistent difficulty in disseminating the tacit, case-specific expertise of senior agents, which can lead to inconsistent service and slower onboarding for new hires. While existing Knowledge Management (KM) and Case-Based Reasoning (CBR) systems have improved the retrieval of historically similar cases, they inadvertently shift the cognitive burden of synthesizing this information to the time-constrained agent. This paper introduces the Dynamic Case Precedent (DCP) architecture, a novel socio-technical framework designed to address this gap. The DCP architecture moves beyond simple precedent recommendation to automated precedent synthesis. It achieves this by integrating a semantic retrieval model with the large-context reasoning capabilities of a generative Large Language Model (LLM). We propose a three-pillar framework—(1) Contextual Similarity Indexing, (2) Generative Insight Synthesis, and (3) Human-in-the-Loop Refinement. By analyzing multiple relevant historical cases to generate a concise summary of resolution patterns, the DCP architecture aims to reduce agent cognitive load, accelerate proficiency, and improve service consistency. This conceptual framework offers a new model for human-AI collaboration, framing the AI not as a mere information tool, but as an active partner in sensemaking. View details
Preview abstract Systems for escalating interactions from automated agents to human agents can create inefficiencies, for example, by transferring unstructured transcripts. An intermediary system can employ a generative artificial intelligence synthesis engine to process the context of an automated interaction upon an escalation trigger. The engine may analyze the dialogue transcript, user metadata, and the automated agent's internal state to perform semantic abstraction, diagnose potential failure points, and infer a possible resolution. The system can then generate a structured briefing for the human agent, which could include a concise summary, a failure diagnosis, or a recommended next action presented as an interactive element. This process may facilitate a more efficient handoff and contribute to an improved escalation workflow by providing the human agent with synthesized, contextual information. View details
Preview abstract The field of Human-Computer Interaction is approaching a critical inflection point, moving beyond the era of static, deterministic systems into a new age of self-evolving systems. We introduce the concept of Adaptive generative interfaces that move beyond static artifacts to autonomously expand their own feature sets at runtime. Rather than relying on fixed layouts, these systems utilize generative methods to morph and grow in real-time based on a user’s immediate intent. The system operates through three core mechanisms: Directed synthesis (generating new features from direct commands), Inferred synthesis (generating new features for unmet needs via inferred commands), and Real-time adaptation (dynamically restructuring the interface's visual and functional properties at runtime). To empirically validate this paradigm, we executed a within-subject (repeated measures) comparative study (N=72) utilizing 'Penny,' a digital banking prototype. The experimental design employed a counterbalanced Latin Square approach to mitigate order effects, such as learning bias and fatigue, while comparing Deterministic interfaces baseline against an Adaptive generative interfaces. Participant performance was verified through objective screen-capture evidence, with perceived usability quantified using the industry-standard System Usability Scale (SUS). The results demonstrated a profound shift in user experience: the Adaptive generative version achieved a System Usability Scale (SUS) score of 84.38 ('Excellent'), significantly outperforming the Deterministic version’s score of 53.96 ('Poor'). With a statistically significant mean difference of 30.42 points (p < 0.0001) and a large effect size (d=1.04), these findings confirm that reducing 'navigation tax' through adaptive generative interfaces directly correlates with a substantial increase in perceived usability. We conclude that deterministic interfaces are no longer sufficient to manage the complexity of modern workflows. The future of software lies not in a fixed set of pre-shipped features, but in dynamic capability sets that grow, adapt, and restructure themselves in real-time to meet the specific intent of the user. This paradigm shift necessitates a fundamental transformation in product development, requiring designers to transcend traditional, linear workflows and evolve into 'System Builders'—architects of the design principles and rules that facilitate this new age of self-evolving software. View details
Preview abstract We introduce a new context-enriched time series forecasting benchmark TimesX. TimesX contains a wide selection of high-quality real-world time series and diverse textual contexts from an automated generating pipeline, which helps address three main issues of existing benchmarks: (1) poor generalization due to low data volume and data being synthetic, (2) restricted forms of context, and (3) an inability to mitigate data leakage. We conduct a thorough empirical study of current multimodal solutions on TimesX. Our results suggest that most multimodal solutions that work well on existing benchmarks may fail on TimesX. In contrast, simple ensemble methods that leverage the rich textual context can outperform strong unimodal baselines and other multimodal baselines. ** Below this is what was submitted to ITP. ** We create a real world multimodal time-series forecasting benchmark that encompasses diverse domains and regions. Each time-series is annotated by various kinds of contexts like metadata, date and holiday information, dynamic events related to the time-series. This is sufficiently more advanced than other available benchmarks which rely wither on static metadata alone or synthetic examples. This forms a test bed for multimodal forecasting. We also present some baseline results showing that ensembles of publicly available LLMs and time-series foundation models can demonstrate non-trivial performance on this bechmark. View details
Preview abstract Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model’s advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide. View details
Bi-level Hierarchical Neural Contextual Bandits for Online Recommendation
Yunzhe Qi
Yikun Ban
Allan Stewart
Chuanwei Ruan
Jiachuan He
Shishir Kumar Prasad
Haixun Wang
Jingrui He
Transactions on Machine Learning Research (2026)
Preview abstract Contextual bandit algorithms aim to identify the optimal choice among a set of candidate arms, based on their contextual information. Among others, the neural contextual bandit algorithms have demonstrated generally superior performance compared to traditional linear and kernel-based methods. Nevertheless, neural methods are not inherently suitable to handle a large number of candidate arms due to their high computational cost when performing neural exploration. Motivated by the widespread availability of arm category information (e.g., movie genres, retailer types), we formulate contextual bandits into a bi-level recommendation problem based on the accessible arm category information, and propose a novel neural bandit framework, named H2N-Bandit, which utilizes a bi-level hierarchical neural structure to mitigate the substantial computational cost found in conventional neural bandit methods. To demonstrate its effectiveness, we provide the regret bound for H2N-Bandit under the over-parameterized neural bandit settings. Furthermore, to illustrate its efficiency, we conduct extensive experiments on multiple real-world public data sets with various specifications, showing that H2N-Bandit can significantly reduce the computational cost over existing non-linear methods while achieving better or comparable performances against state-of-the-art baselines. View details
Preview abstract This talk addresses the challenges of operating Google's monitoring systems at scale, handling terabytes of telemetry data and preventing overload from diverse workloads. We'll explore how Google's internal client library and Monarch, its planet-scale time-series database, work together for cost-effective data collection. Key principles include a distributed push model, dynamic client-side data reduction, centralized retention, and periodic metric analysis. The session will then bridge these concepts to the open-source world, discussing our work with OpenTelemetry's OpAMP protocol to achieve similar scalable and efficient telemetry collection. Attendees will gain insights into adapting these principles for cost savings and learn about our collaboration with the OpAMP SIG to benefit the broader community. View details
×