Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11319 publications
    Preview abstract The rapid adoption of agentic systems powered by large language models (LLMs) introduces significant security challenges distinct from plain conversational models, particularly concerning prompt injection and tool misuse due to their dynamic personas and real- world tool interactions. This paper investigates the effectiveness of hardened security prompting in a task-oriented multi-agent framework, using a coding assistant as a representative case study. We com- pare a baseline ”unhardened” agent against a ”hard- ened” version equipped with explicit security guide- lines applied across all sub-agents. Our evaluation across 150+ single-turn and 32 multi-turn attack sce- narios demonstrates that prompt hardening dramat- ically improves resilience. With a simple, approxi- mately 500-token security hardener, single-turn fail- ure rates dropped from 19.48% to 2.60%, while multi- turn failure rates decreased from 75.00% to 46.88%. Furthermore, we show that successfully bypassing the hardened agent requires significantly more adversar- ial effort and a greater number of chat turns. How- ever, the analysis also reveals a critical shift in vul- nerability taxonomy: as direct attacks fail, adver- saries exploit the agent’s core functionality via ”Func- tional Wrappers” (Intent Obfuscation), highlighting a residual risk that necessitates a shift in the defen- sive paradigm from static filters to dynamic runtime state and intent analysis. View details
    Preview abstract Being able to understand the security and privacy (S&P) concerns of IoT users brings benefits to both developers and users. To learn about users' views, we examine Amazon IoT reviews - one of the biggest IoT markets. This work presents a state-of-the-art methodology to identify and categorize reviews in which users express S&P concerns. We developed an automated pipeline by fine-tuning GPT-3.5-Turbo to build two models: the Classifier-Rationalizer-Categorizer and the Thematic Mapper. By leveraging dynamic few-shot prompting and the model's large context size, our pipeline achieved over 97% precision and recall, significantly outperforming keyword-based and classical ML methods. We applied our pipeline to 91K Amazon reviews about fitness trackers, smart speakers and cameras, over multiple years. We found that on average 5% contained S&P concerns, while security camera exhibited the highest prevalence at 10%. Our method detected significantly more S&P-relevant reviews than prior works: 15x more for fitness trackers, 29% more for smart speakers, and 70% more for cameras. Our longitudinal analysis reveals that concerns like surveillance and data control have persisted for years, suggesting limited industry progress. We demonstrate that across all device types, users consistently demand more precise control over what data is collected and shared. We uncover challenges in multi-user and multi-device interactions, identifying two previously unreported themes concerning inadequate controls for account separation and data access. These findings, ranging from broad persistent trends to specific instances of customer loss, offer actionable insights for developers to improve user satisfaction and trust. View details
    Performance analysis of updated Sleep Tracking algorithms across Google and Fitbit wearable devices
    Arno Charton
    Linda Lei
    Siddhant Swaroop
    Marius Guerard
    Michael Dixon
    Logan Niehaus
    Shao-Po Ma
    Logan Schneider
    Ross Wilkinson
    Ryan Gillard
    Conor Heneghan
    Pramod Rudrapatna
    Mark Malhotra
    Shwetak Patel
    Google, Google, 1600 Amphitheatre Parkway Mountain View, CA 94043 (2026) (to appear)
    Preview abstract Background: The general public has increasingly adopted consumer wearables for sleep tracking over the past 15 years, but reports on performance versus gold standards such as polysomnogram (PSG), high quality sleep diaries and at-home portable EEG systems still show potential for improved performance. Two aspects in particular are worthy of consideration: (a) improved recognition of sleep sessions (times when a person is in bed and has attempted to sleep), and (b) improved accuracy on recognizing sleep stages relative to an accepted standard such as PSG. Aims: This study aimed to: 1) provide an update on the methodology and performance of a system for correctly recognizing valid sleep sessions, and 2) detail an updated description of how sleep stages are calculated using accelerometer and inter-beat intervals Methods: Novel machine learning algorithms were developed to recognize sleep sessions and sleep stages using accelerometer sensors and inter-beat intervals derived from the watch or tracker photoplethysmogram. Algorithms were developed on over 3000 nights of human-scored free-living sleep sessions from a representative population of 122 subjects, and then tested on an independent validation set of 47 users. Within sleep sessions, an algorithm was developed to recognize periods when the user was attempting to sleep (Time-Attempting-To-Sleep = TATS). For sleep stage estimation, an algorithm was trained on human expert-scored polysomnograms, and then tested on 50 withheld subject nights for its ability to recognize Wake, Light (N1/N2), Deep (N3) and REM sleep relative to expert scored labels. Results: For sleep session estimation, the algorithm had at least 95% overlap on TATS with human consensus scoring for 94% of nights from healthy sleepers. For sleep stage estimation, comparing with the current Fitbit algorithm, Cohen’s kappa for four-class determination of sleep stage increased from an average of 0.56 (std 0.13) to 0.63 (std 0.12), and average accuracy increased from 71% (std 0.10) to 77% (std 0.078) Conclusion: A set of new algorithms has been developed and tested on Fitbit and Pixel Watches and is capable of providing robust and accurate measurement of sleep in free-living environments. View details
    Agentic Coding Needs Proactivity, Not Just Autonomy
    Georgios Evangelopoulos
    (2026) (to appear)
    Preview abstract Coding agents are rapidly changing the landscape of software development, moving from inline com- pletion to autonomous systems that edit repositories, open pull requests, respond to issues, and run scheduled or webhook triggered routines across the development life cycle. The next generation is increasingly described as proactive and long-horizon: agents should notice relevant changes before the developer asks, connect signals across tools, decide when to interrupt, and carry preferences across sessions. Yet the field lacks a precise account of what proactivity means for software development, how it differs from autonomy, what acceptance criteria proactive long-horizon tasks should satisfy, and which metrics determine whether unsolicited agent behavior is useful rather than merely active. We argue that proactive coding agents should be evaluated by the quality and improvement of their insight policy: the policy that decides what matters next, what evidence supports it, whether to surface it, and how to adapt after feedback. We re-anchor this view in mixed initiative interaction, introduce a three level taxonomy (Reactive, Scheduled, and Situation Aware), compare contemporary coding agents against five operational criteria, and sketch an active user simulation protocol with three evaluation targets: Insight Decision Quality (IDQ), Context Grounding Score (CGS), and Learning Lift (LL). View details
    Preview abstract In a prior column, we wrote about how measuring productivity can be viewed as a form of modeling and that all models are wrong, but some are useful. That discussion centered on the idea of ensuring that a productivity model was inclusive of multiple metrics and that those metrics covered the various facets of productivity and covered each facet reasonably well. In that article, we set aside the question of what makes a good individual productivity metric that can be combined with others into a (hopefully) useful model of productivity. In this article, we’ll share some things we consider when building an individual metric, including an example of a novel metric we built in the aftermath of the COVID pandemic. View details
    Reasoning-Driven Synthetic Data Generation and Evaluation
    Tim R. Davidson
    Benoit Seguin
    Transactions on Machine Learning Research (2026)
    Preview abstract Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution — limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount. View details
    Preview abstract Contrail microphysical simulations and climate simulations have indicated that contrail cirrus cause a substantial fraction of aviation’s climate impact. While the approximations and parameter selections in these simulations have been well-validated over the past two decades, the heat trapping of contrails has not been observed using satellite data beyond a few hours. This is because contrails lose their linear shape after a few hours, making them difficult to distinguish from natural cirrus clouds. Here we provide satellite-driven analysis of long-lived heat trapping by contrails over North and South America. We aggregate a dataset of GOES-16 estimated outgoing longwave radiation and advected trace density of flight paths, and apply causal inference to discern the effect of contrails while controlling for radiative and cloud confounders. As a means of validation, we also generate synthetic datasets with known ground truth, and confirm that applying the causal inference method is able to recover the synthetic ground truth. Since this method yields an estimate which has some differences from both “instantaneous radiative forcing” (iRF) and “effective radiative forcing” (ERF) estimates which have been reported in the literature so far, we introduce the new term “observational radiative forcing, 12 hours” (oRF12). Our analysis estimates the longwave oRF12 from contrails over the Americas averaged 47.9 gigajoules per flight kilometer (95% CI: 31 to 52 GJ/km) during April 2019 to April 2020. View details
    SNPeek: Side-Channel Analysis for Privacy Applications on Confidential VMs
    Ruiyi Zhang
    Albert Cheu
    Adria Gascon
    Michael Schwarz
    Octavian Suciu
    Network and Distributed System Security (NDSS) (2026)
    Preview abstract Confidential virtual machines (CVMs) based on trusted execution environments (TEEs) enable new privacy-preserving solutions. But CVMs are not a privacy panacea, as they are vulnerable to side-channel attacks that may compromise confidentially of workloads. In this work, we develop the FARFETCH’D framework to help developers evaluate side-channel assisted privacy attacks that are broadly applicable to CVMs. The privacy reduction due to these attacks heavily depend on the execution environment and the workload, which varies vastly:What are avail-able attack primitives? How does the particular privacy work-load behave?This makes manual investigation and efficiently mitigating software-based side channels a cumbersome and impossible task. FARFETCH’D solves this challenge by providing a set of configurable attack primitives that can execute on real CVM hardware and automated ML-based analysis pipelines. We evaluate the effectiveness of FARFETCH’D on privacy-preserving workloads. Our results show that our approach is effective at pinpointing the vulnerability of privacy apps against side channels and help evaluating mitigation based on oblivious memory and differential privacy. View details
    Preview abstract In "Elephants, Goldfish and the New Golden Age of Software Engineering," the author discusses how AI is changing knowledge work, especially software development. Written from the perspective of April 2026, the article points out that while AI speeds up coding, it can also quickly generate a lot of mistakes and messy code if it isn't carefully managed by human oversight and clear processes. The paper outlines a practical approach to working with AI, broken down into three main sections: * **Using AI as a Tool, Not a Toy:** The author notes that people often get poor results by asking AI to do everything in a single prompt. Instead, users should have back-and-forth conversations with AI to question assumptions, set clear grading rules, and guide the research. The main point is that humans must still provide the final judgment; AI is simply a way to speed up and record that thinking. * **The Elephant-Goldfish Model:** As AI creates more code than humans can easily read, written design documents become more important than the code itself. To keep AI on track, the author suggests a two-part method: * **The Elephant:** A long chat session where the human and AI discuss ideas and write a detailed design document *before* any code is written. This session holds all of the project's background information and decisions. * **The Goldfish:** A brand-new AI chat session with no memory. The human asks this "goldfish" to read the design document. If the goldfish cannot understand the plan based only on that document, the document needs more details. * Only after the design document is clear enough for the goldfish to understand does the human ask the AI to write the code based on those strict instructions. * **Managing AI and the Future of Work:** The author expects that regular employees will soon act like managers, overseeing multiple AI helpers. Because of this, workers need to learn basic management skills, like how to delegate tasks and set clear boundaries. Also, since AI will handle routine chores, humans will need to practice focusing for longer periods to do deeper, harder thinking. Ultimately, a worker's value will come from their planning and decision-making skills, rather than their ability to type code. View details
    Expert evaluation of LLM world models: A high-Tc superconductivity case study
    Haoyu Guo
    Maria Tikhanovskaya
    Paul Raccuglia
    Alexey Vlaskin
    Chris Co
    Scott Ellsworth
    Matthew Abraham
    Lizzie Dorfman
    Peter Armitage
    Chunhan Feng
    Antoine Georges
    Olivier Gingras
    Dominik Kiese
    Steve Kivelson
    Vadim Oganesyan
    Brad Ramshaw
    Subir Sachdev
    Senthil Todadri
    John Tranquada
    Eun-Ah Kim
    Proceedings of the National Academy of Sciences (2026)
    Preview abstract Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. This work evaluates the performance of six different LLM-based systems for answering scientific literature questions, including commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. We conduct a rigorous expert evaluation of the systems in the domain of high-temperature cuprate superconductors, a research area that involves material science, experimental physics, computation, and theoretical physics. We use an expert-curated database of 1726 scientific papers and a set of 67 expert-formulated questions. The evaluation employs a multi-faceted rubric assessing balanced perspectives, factual comprehensiveness, succinctness, evidentiary support, and image relevance. Our results demonstrate that RAG-based systems, powered by curated data and multimodal retrieval, outperform existing closed models across key metrics, particularly in providing comprehensive and well-supported answers, and in retrieving relevant visual information. This study provides valuable insights into designing and evaluating specialized scientific literature understanding systems, particularly with expert involvement, while also highlighting the importance of rich, domain-specific data in such systems. View details
    Mull-Tokens: Modality-Agnostic Latent Thinking
    Arijit Ray
    Chengzhi Mao
    Bryan A. Plummer
    Kate Saenko
    Ranjay Krishna
    Leonidas Guibas
    Vincent Chu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (Findings) (2026) (to appear)
    Preview abstract Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities. View details
    Preview abstract Optimizing large-language model (LLM) training and serving on large-sacle distributed systems with hundreds and thousands of accelerators is always a challenging task due to the fast evloving LLMs, strong domain expertise required, and various optimization goals from different worklaods. Existing methods rely on either handcrafted optimization performed by human experts, which is tedious and time-consuming or resource-intensive black-box searches, which lack the extensibility to keep pace with evolving models and hardware. To address this, we introduce PROMPTS, a novel multi-agent framework that complements traditional search methods with expert-informed reasoning. It automates the diagnosis of performance bottlenecks by synthesizing profiler data and leverages a knowledge base to propose optimized sharding configurations with detailed justifications. Across eight real-world production workloads, PROMPTS demonstrated remarkable efficiency and accuracy, delivering performance improvements of up to 434%. These workloads spanned diverse model architectures, hardware platforms, computational scales, and various stages of the machine learning lifecycle (pre-training, serving, and post-training). In every case, the configuration adopted by human engineers was identified within the agent's top three proposals from a single invocation. Furthermore, the agent's top-ranked recommendation was the one ultimately adopted in 87.5% of cases, showcasing its ability to not only find optimized solutions, but also to correctly prioritize them. Our work establishes PROMPTS as a scalable, extensible, and explainable methodology for AI-assisted performance engineering in large-scale ML systems. View details
    Towards AI as a Collaborative Partner: A Taxonomy of AI Agent Behavior in Software Engineering
    Sherry Y. Shi
    Proceedings of the 3rd ACM International Conference on AI-Powered Software (AIware '26), ACM, Montreal, QC, Canada (2026) (to appear)
    Preview abstract The ongoing transition of Large Language Models (LLMs) in software engineering from one-shot code generators into agentic partners requires a shift in how we define and measure success. While models are becoming more capable, the industry lacks a clear understanding of the behavioral norms that make an interactive software engineering (SWE) agent effective in collaborative software development in the enterprise. This work addresses this gap by presenting a taxonomy of desirable SWE agent behaviors, synthesized from 91 sets of developer-defined rules for SWE agents and validated through interviewing 15 experienced professional developers. In this taxonomy, we identify four core expectations: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solve Problems Effectively, and Collaborate with the Developer. These findings offer a concrete vocabulary for aligning SWE agent behavior with developer preferences, enabling researchers and practitioners to move beyond correctness-only benchmarks and start designing evaluations that reflect the socio-technical nature of professional software development in enterprises. View details
    From Correctness to Collaboration: A Human-Centered Taxonomy of AI Agent Behavior in Software Engineering
    Sherry Y. Shi
    Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA ’26), ACM, New York, NY, USA (2026)
    Preview abstract The ongoing transition of Large Language Models in software engineering from code generators into autonomous agents requires a shift in how we define and measure success. While models are becoming more capable, the industry lacks a clear understanding of the behavioral norms that make an agent effective in collaborative software development in the enterprise. This work addresses this gap by presenting a taxonomy of desirable agent behaviors, synthesized from 91 sets of user-defined rules for coding agents. We identify four core expectations: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solve Problems Effectively, and Collaborate with the User. These findings offer a concrete vocabulary for agent behavior, enabling researchers to move beyond correctness-only benchmarks and design evaluations that reflect the realities of professional software development in large enterprises. View details
    Preview abstract This disclosure describes systems and methods for a multi-agent framework that can automate and scale cognitive work. The framework can, for example, use a cognitive assembly line of specialized computational agents to perform tasks such as research and drafting. A beneficial component could be an adversarial review panel (ARP), which is a multi-agent review system where distinct agent personas critique a generated draft from varied perspectives. The structured feedback from the ARP can be used to automatically iterate on and refine the work product. This approach can improve the intellectual rigor of generated content and reduce the time required for production, which may allow human operators to focus on activities such as strategic oversight and final validation. View details
    ×