Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11243 publications
    Phoenix: Rowhammer Attacks on DDR5 with Self-Correcting Synchronization
    Michele Marazzi
    Kaveh Razavi
    Salman Qazi
    Diego Meyer
    Patrick Jattke
    IEEE Security & Privacy (S&P) (2026)
    Preview
    Preview abstract This writeup defines the Hydration Proxy Pattern, a framework for building stateful conversational data systems over stateless LLM APIs. It describes a platform-agnostic approach to decoupling persistence from the AI provider through secure server-side intermediation and hybrid storage tiers. The abstract provides a blueprint for managing the "Persistence Gap" in enterprise AI integrations, detailing high-level strategies for session history management, streaming, and multi-stage semantic grounding without disclosing specific internal implementation details. View details
    Preview abstract A growing body of qualitative research has identified contextual risk factors that elevate people’s chances of experiencing digital-safety attacks. However, the lack of quantitative data on the population level distribution of these risk factors prevents policymakers and tech companies from developing targeted, evidence-based interventions to improve digital safety. To address this gap, we surveyed 5,001 adults in the United States to analyze: (1) the frequency of and relationship between digital-safety attacks (e.g., scams, harassment, account hacking), and (2) how these attacks align with 10 contextual risk factors. Nearly half of our respondents identify as resource constrained, which significantly correlates with higher likelihood of experiencing four common attacks. We also present qualitative insights to expand our understanding of the factors beyond the existing literature (e.g., “prominence” included high-visibility roles in local communities). This study provides the first large-scale quantitative analysis correlating digital-safety attacks with contextual risk factors and demographics. View details
    Preview abstract Online financial scams represent a long-standing and serious threat for which people seek help. We present a study to understand people’s in situ motivations for engaging with scams and the help needs they express before, during, and after encountering a scam. We identify the main emotions scammers exploited (e.g., fear, hope) and characterize how they did so. We examine factors—such as financial insecurity and legal precarity—which elevate people’s risk of engaging with specific scams and experiencing harm. We indicate when people sought help and describe their help-seeking needs and emotions at different stages of the scam. We discuss how these needs could be met through the design of contextually-specific prevention, diagnostic, mitigation, and recovery interventions. View details
    Preview abstract Validating conversational artificial intelligence (AI) for regulated medical software applications may present challenges, as static test datasets and manual review may be limited in identifying emergent, conversational anomalies. A multi-agent AI system may be configured in a closed-loop for automated validation. The system can, for example, utilize an end user persona simulator agent to generate prompts for a target model and a domain /regulatory expert adjudicator agent to evaluate the target model’s responses against a configurable rubric. A meta-analysis agent can analyze anomalies to identify underlying vulnerabilities, which may then be used to programmatically synthesize new adversarial personas. This adaptive process can generate evidence to support regulatory compliance and continuous performance monitoring for medical software algorithms systems. View details
    From Correctness to Collaboration: A Human-Centered Taxonomy of AI Agent Behavior in Software Engineering
    Sherry Y. Shi
    Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA ’26), ACM, New York, NY, USA (2026)
    Preview abstract The ongoing transition of Large Language Models in software engineering from code generators into autonomous agents requires a shift in how we define and measure success. While models are becoming more capable, the industry lacks a clear understanding of the behavioral norms that make an agent effective in collaborative software development in the enterprise. This work addresses this gap by presenting a taxonomy of desirable agent behaviors, synthesized from 91 sets of user-defined rules for coding agents. We identify four core expectations: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solve Problems Effectively, and Collaborate with the User. These findings offer a concrete vocabulary for agent behavior, enabling researchers to move beyond correctness-only benchmarks and design evaluations that reflect the realities of professional software development in large enterprises. View details
    Preview abstract Multimodal large language models (LLMs) integrate and process information from multiple modalities such as text, images, audio, and video, enabling complex tasks such as audio translation and visual question answering. While powerful, this complexity introduces novel vulnerabilities to sophisticated adversarial attacks. This survey paper provides a comprehensive overview of this rapidly expanding field, systematically categorizing attacks that range from manipulations of single modalities (e.g., perturbed images or audio) to those exploiting cross-modal interactions. We overview how these attacks exploit weaknesses in model fusion, attention mechanisms, and representation learning and provided analyses on their potential for real-world consequences. View details
    Usability Hasn’t Peaked: Exploring How Expressive Design Overcomes the Usability Plateau
    Alyssa Sheehan
    Bianca Gallardo
    Ying Wang
    Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26), April 13–17, 2026, Barcelona, Spain (2026)
    Preview abstract Critics have argued that mobile usability has largely been optimized, and that only incremental gains are possible. We set out to explore if the newest generation of design systems, which promote greater flexibility and a return to design basics, could produce substantially more usable designs while maintaining or increasing aesthetic judgments. Through a study with 48 diverse participants completing tasks in 10 different applications, we found that in designs created following Material 3 Expressive guidelines, users fixated on the correct screen element for a task 33% faster, completed tasks 20% faster, and rated experiences more positively compared to versions designed using the previous Material design system. These improvements in performance and aesthetic ratings challenge the premise of a usability plateau and show that mobile usability has not peaked. We illustrate specific opportunities to make mobile experiences more usable by returning to design fundamentals while highlighting risks of added flexibility. View details
    Productionizing Quantum Mass Production
    Bill Huggins
    Nathan Wiebe
    arXiv for now (2026) (to appear)
    Preview abstract For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step. View details
    Preview abstract Deep-learning methods have boosted the analytical power of Raman spectroscopy, yet they still require large, task-specific, labeled datasets and often fail to transfer across application domains. The study explores pre-trained encoders as a solution. Pre-trained encoders have significantly impacted Natural Language Processing and Computer Vision with their ability to learn transferable representations that can be applied to a variety of datasets, significantly reducing the amount of time and data required to create capable models. The following work puts forward a new approach that applies these benefits to Raman Spectroscopy. The proposed approach, RSPTE (Raman Spectroscopy Pre-Trained Encoder), is designed to learn generalizable spectral representations without labels. RSPTE employs a novel domain adaptation strategy using unsupervised Barlow Twins decorrelation objectives to learn fundamental spectral patterns from multi-domain Raman Spectroscopy datasets containing samples from medicine, biology, and mineralogy. Transferability is demonstrated through evaluation on several models created by fine-tuning RSPTE for different application domains: Medicine (detection of Melanoma and COVID), Biology (Pathogen Identification), and Agriculture. As an example, using only 20% of the dataset, models trained with RSPTE achieve accuracies ranging 50%–86% (depending on the dataset used) while without RSPTE the range is 9%–57%. Using the full dataset, accuracies with RSPTE range 81%–97%, and without pretraining 51%–97%. Current methods and state-of-the-art models in Raman Spectroscopy are compared to RSPTE for context, and RSPTE exhibits competitive results, especially with less data as well. These results provide evidence that the proposed RSPTE model can effectively learn and transfer generalizable spectral features across different domains, achieving accurate results with less data in less time (both data collection time and training time). View details
    ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
    Jihwan Jeong
    Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL-26), Rabat, Morocco (2026), pp. 5270-5304
    Preview abstract LLM-based user simulators are a scalable solution for improving conversational AI, but a critical realism gap undermines their effectiveness. To close this gap, we introduce a framework for building and validating high-fidelity simulators. We present a novel dataset of human-AI shopping conversations designed to capture a wide spectrum of user experiences. To measure fidelity, we propose a hybrid evaluation protocol that combines statistical alignment with a learned, discriminator-based Human-Likeness Score. Our most sophisticated simulator, trained via reinforcement learning with iterative critique, achieves a significant leap in realism. Critically, we demonstrate through counterfactual validation that our simulator—trained exclusively on optimal interactions—realistically adapts its behavior to suboptimal system responses, mirroring real user reactions and marking a key advance in creating reliable simulators for robust AI development. View details
    Preview abstract We introduce AASE (Activation-based AI Safety Enforcement), a framework for post-perception safety monitoring in large language models. Unlike pre-perception approaches that analyze input or output text, AASE monitors the model's internal activation patterns—what the model "understands" rather than what text it processes or generates—enabling detection of safety-relevant states before harmful outputs are produced. The framework comprises three techniques: Activation Fingerprinting (AF) for harmful content detection, Agent Action Gating (AAG) for prompt injection defense, and Activation Policy Compliance (APC) for enterprise policy enforcement. We introduce paired contrastive training to isolate safety-relevant signals from confounding factors such as topic and style, addressing signal entanglement in polysemantic activations. Validation across 7 models from 3 architecture families shows strong class separation: Gemma-2-9B achieves AUC 1.00 with 7.2σ separation across all probes; AAG achieves AUC ≥0.88 across all models on the InjecAgent benchmark; APC achieves 0.97-1.00 AUC across three enterprise policies. Model size correlates with probe quality—Gemma-2-9B (7.2σ separation) outperforms Gemma-2-2B (4.3σ). All techniques survive INT4 quantization with minimal separation degradation. AASE is 9× faster than Llama Guard 3 (33ms vs 306ms) with higher TPR (88% vs 50%) at a tunable threshold that trades FPR for detection sensitivity, adding only 0.002ms probe overhead to existing inference. View details
    Preview abstract The management of a hybrid workforce comprising human and autonomous computational agents may be challenged by the use of separate systems for human capital and software assets, which can create a governance gap. A system can provide a unified framework for managing a hybrid workforce. For example, the system may utilize a labor service mesh to analyze and route tasks to either a human intent tier or an agentic execution tier. A potential principle of the system is structural symmetry, where computational agents can be assigned digital identities and managed through a lifecycle process that may parallel human resource functions, such as onboarding, performance evaluation, and structured offboarding. This integrated approach can facilitate a unified system of record and governance model for an organization's intelligence capacity. View details
    Expert evaluation of LLM world models: A high-Tc superconductivity case study
    Haoyu Guo
    Maria Tikhanovskaya
    Paul Raccuglia
    Alexey Vlaskin
    Chris Co
    Scott Ellsworth
    Matthew Abraham
    Lizzie Dorfman
    Peter Armitage
    Chunhan Feng
    Antoine Georges
    Olivier Gingras
    Dominik Kiese
    Steve Kivelson
    Vadim Oganesyan
    Brad Ramshaw
    Subir Sachdev
    Senthil Todadri
    John Tranquada
    Eun-Ah Kim
    Proceedings of the National Academy of Sciences (2026)
    Preview abstract Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. This work evaluates the performance of six different LLM-based systems for answering scientific literature questions, including commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. We conduct a rigorous expert evaluation of the systems in the domain of high-temperature cuprate superconductors, a research area that involves material science, experimental physics, computation, and theoretical physics. We use an expert-curated database of 1726 scientific papers and a set of 67 expert-formulated questions. The evaluation employs a multi-faceted rubric assessing balanced perspectives, factual comprehensiveness, succinctness, evidentiary support, and image relevance. Our results demonstrate that RAG-based systems, powered by curated data and multimodal retrieval, outperform existing closed models across key metrics, particularly in providing comprehensive and well-supported answers, and in retrieving relevant visual information. This study provides valuable insights into designing and evaluating specialized scientific literature understanding systems, particularly with expert involvement, while also highlighting the importance of rich, domain-specific data in such systems. View details
    Compact Conformal Subgraphs
    Kamesh Munagala
    Aravindan Vijayaraghavan
    ICML (2026)
    Preview abstract Conformal prediction provides rigorous uncertainty guarantees for model outputs but can produce prohibitively large prediction sets in structured domains such as routing, planning, or sequential recommendation. We introduce \emph{graph-based conformal compression}, a framework for constructing compact subgraphs that preserve the statistical validity of conformal prediction while reducing structural complexity. We study a formulation that selects a smallest subgraph capturing a prescribed fraction of conditional probability mass, and reduce to a weighted version of densest $k$-subgraphs in hypergraphs, in the regime where the subgraph has a large fraction of edges. We design efficient approximation algorithms that achieve constant factor coverage and size trade-offs. Our results highlight an algorithmic regime, distinct from classical densest-$k$-subgraph hardness settings, where the problem can be approximated efficiently, bridging conformal prediction with combinatorial graph compression. We finally validate our algorithmic approach on synthetic and real-world instances of trip planning and navigation, showing in each case that our approach handily beats natural baselines. View details
    ×