Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11252 publications
    Preview abstract How many T gates are needed to approximate an arbitrary n-qubit quantum state to within a given precision ϵ? Improving prior work of Low, Kliuchnikov and Schaeffer, we show that the optimal asymptotic scaling is Θ(sqrt{2^n log(1/ε)} + log(1/ε)) if we allow an unlimited number of ancilla qubits. We also show that this is the optimal T-count for implementing an arbitrary diagonal n-qubit unitary to within error ϵ. We describe an application to batched synthesis of single-qubit unitaries: we can approximate a tensor product of m = O(log log(1/ϵ)) arbitrary single-qubit unitaries to within error ϵ with the same asymptotic T-count as is required to approximate just one single-qubit unitary. View details
    Preview abstract When managing complex, unpredictable (non-deterministic) AI agents using simple, fixed control systems (like finite state machines), operational failures and accountability issues often arise. This document introduces a probabilistic governance and telemetry framework to resolve these problems. Instead of following a rigid sequence of steps, this framework defines a multi-dimensional operational boundary, a 'behavioral volume', and assigns the agent a goal. This allows the agent to use its own reasoning to achieve the goal while remaining within the defined boundaries. A separate telemetry layer monitors the agent's actions by calculating metrics, such as alignment scores and drift velocity, to measure how much the agent deviates from its intended behavior. This system provides a method for guiding, monitoring, and securing autonomous agents, effectively managing the performance and security of an unpredictable AI workforce in complex environments. View details
    Preview abstract We introduce KVCIS (KV-Cache Importance Scoring), a novel approach to KV-cache compression that predicts token importance from intermediate-layer activations before attention is computed. Unlike existing methods (H2O, StreamingLLM, Scissorhands) that make compression decisions based on attention scores computed during generation, KVCIS enables proactive compression at cache insertion time—determining how to store each token before paying the computational cost of attention. We discover a two-level importance structure in decoder-only transformers: the beginning-of-sequence (BOS) token acts as an "attention sink" receiving ~76% of attention, while the remaining ~24% is distributed across content tokens with 10-11× importance spread. A simple linear probe achieves R² = 0.998 overall and R² = 0.68–0.79 for discriminating among content tokens. Extensive validation across 3 model families (Llama, Mistral, Gemma), 8 layer depths, context lengths from 256 to 2048 tokens, and multiple downstream tasks demonstrates: 50% memory reduction with zero degradation on NarrativeQA (F1 = 0.064 matching baseline exactly), while uniform quantization degrades by 7.8% at the same compression ratio. KVCIS consistently achieves 5–8× better quality preservation than uniform quantization across all tested context lengths. The memory savings enable increased batch sizes and longer context support; the probe itself adds minimal overhead (~16KB direction vector, 0.06ms per token). This work extends activation-based probing from safety classification to inference optimization, demonstrating that intermediate-layer activations encode predictive signals about token importance for generation. View details
    Preview abstract LLM-based user simulators are a scalable solution for improving conversational AI, but a critical realism gap undermines their effectiveness. To close this gap, we introduce a framework for building and validating high-fidelity simulators. We present a novel dataset of human-AI shopping conversations designed to capture a wide spectrum of user experiences. To measure fidelity, we propose a hybrid evaluation protocol that combines statistical alignment with a learned, discriminator-based Human-Likeness Score. Our most sophisticated simulator, trained via reinforcement learning with iterative critique, achieves a significant leap in realism. Critically, we demonstrate through counterfactual validation that our simulator—trained exclusively on optimal interactions—realistically adapts its behavior to suboptimal system responses, mirroring real user reactions and marking a key advance in creating reliable simulators for robust AI development. View details
    Preview abstract In some multi-stage software build pipelines, downstream compiler errors may be reported against ephemeral, machine-generated intermediate artifacts rather than original, human-written source code, which can make remediation challenging. A system and method may address this by intercepting a downstream error, mapping its location back to the original source file, and programmatically injecting a dormant suppression tag into the original source code. During a subsequent build, an intermediate transpiler can propagate this tag into a newly generated intermediate artifact. In the intermediate file, the tag may become active and be recognized by the downstream compiler as a directive to suppress the specific error. This approach can facilitate an automated remediation process for certain build failures that avoids direct modification of ephemeral files and uses the original source code as a record for suppression. View details
    CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors
    Juro Gottweis
    CJ Park
    Salman Rahman
    Ahmed Metwally
    Hong Yu
    Ivor Rendulic
    Yuzhe Yang
    Petar Sirkovic
    Daniel McDuff
    Shwetak Patel
    Nicolas Stroppa
    Yubin Kim
    Mark Malhotra
    Orson Xu
    Sam Schmidgall
    Tim Althoff
    Elahe Vedadi
    Cynthia Breazeal
    Hae Won Park
    (2026)
    Preview
    Preview abstract The emergence of Agentic AI—autonomous systems capable of reasoning, decision-making, and multi-step execution—represents a paradigm shift in enterprise technology. Moving beyond simple generative tasks, these agents offer the potential to solve long-standing industry pain points, with over 90% of enterprises planning integration within the next three years. However, the transition from successful proof-of-concept (PoC) to a resilient, production-grade system presents significant hurdles. This article categorizes these challenges into three primary domains: Technical and Engineering Hurdles: Issues such as "entangled workflows" that complicate debugging, the struggle to maintain output quality and mitigate hallucinations, and the unpredictability caused by shifting underlying models or data sources. People, Process, and Ecosystem Hurdles: The high operational costs and unclear ROI of large models, the necessity of a new "Agent Ops" skillset, the complexity of integrating agents with disparate enterprise systems, and a rapidly evolving regulatory landscape. The Pace of Change and Security risks: The technical debt incurred by shifting software frameworks and the expanded attack surface created by autonomous agents. The article concludes that successful deployment requires a shift from informal "vibe-testing" to rigorous engineering discipline. By adopting code-first frameworks, establishing robust evaluation metrics (KPIs), and prioritizing functional deployment over theoretical optimization, organizations can effectively manage the lifecycle of Agentic AI and realize its transformative business value. View details
    Preview abstract We introduce AMS (Activation-based Model Scanner), a tool for verifying whether a language model is safe to deploy by analyzing its internal activation patterns. While "uncensored" and maliciously fine-tuned models pose increasing risks, current detection methods rely on behavioral testing that is slow, incomplete, and easily evaded. AMS takes a fundamentally different approach: measuring the geometric structure of safety-relevant concepts in the model's activation space. Safe models exhibit strong class separation (4-8σ) between harmful and benign content; models with removed or degraded safety training show collapsed separation (<2σ). Using contrastive prompt pairs and direction vector analysis, AMS performs model-level verification rather than prompt-level classification. We validate AMS across 14 model configurations spanning 3 architecture families (Llama, Gemma, Qwen), 3 quantization levels (FP16, INT8, INT4), and multiple model categories (instruction-tuned, base, abliterated, uncensored). In our validation set: (1) all four instruction-tuned models pass with 3.8-8.4σ separation; (2) three tested uncensored models (Dolphin, Lexi, LLama-3-8b-Uncensored) flagged as CRITICAL with 1.1-1.3σ on harmful content; (3) an abliterated Llama variant flagged as WARNING (3.33σ); (4) Llama base model shows 0.69σ, confirming absence of safety training; (5) quantization has minimal impact (<5% drift). One model labeled "uncensored" (DarkIdol) unexpectedly passed, suggesting either mislabeling or a technique that preserves activation geometry. AMS also provides identity verification via direction vector comparison. Scanning completes in 10-40 seconds per model on GPU hardware. We discuss threshold calibration, limitations of our validation scope, and directions for broader evaluation. View details
    ARM MTE Performance in Practice
    Taehyun Noh
    Yingchen Wang
    Tal Garfinkel
    Mahesh Madhav
    Mattan Erez
    Shravan Narayan
    Usenix Security (2026)
    Preview
    Preview abstract Source-to-source compilers may perform inefficiently by executing transpilation passes on scripts that do not contain the specific language features a pass is designed to transform, potentially leading to redundant processing. A compiler can analyze a script to generate a per-script feature map, for example, by identifying language features in its abstract syntax tree (AST). Before executing a transpilation pass, the compiler can check this map and may bypass the pass for that script if the specific feature targeted by the pass is not present. This feature map can also be dynamically updated throughout the compilation process as other passes transform the code. This method of conditional pass execution based on content-aware analysis may reduce redundant AST traversals, which could decrease overall compilation time and computational resource consumption. View details
    Preview abstract As the ECMAScript specification evolves, industrial-scale JavaScript compilers face the challenge of supporting modern language syntax while maintaining compatibility for diverse execution environments. Traditionally, compilers solve this by running transpilation passes in a monolithic pipeline, where the transpilation passes are chosen to execute strictly based on a target language level. This results in significant computational waste, as compilers perform expensive Abstract Syntax Tree (AST) traversals to lower features that may not exist in the actual input source code. We present a static analysis improvement that conditionally executes transpiler passes based on accurately tracking and dynamically maintaining the exact set of language features seen in the compilation unit throughout the transpilation process. It is implemented in the production Google Closure Compiler. By populating and maintaining a FeatureSet at every JavaScript script-level, it dynamically skips running the unnecessary lowering passes. We detail the architectural safeguards - including strategic pass ordering and dynamic validation of the transpiled code for feature-correctness. Evaluation of this improvement on large-scale production applications produced a considerable reduction in compilation time and saved compute and memory usage. View details
    MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR
    Sieun Kim
    Qianhui Zheng
    Ruoyu Xu
    Ravi Tejasvi
    Anuva Kulkarni
    Junyi Zhu
    2026
    Preview abstract In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g. faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to five concurrent sources (e.g. two voices + three instruments) with ca. 2 second processing latency. We validate MoXaRt through a technical evaluation on a new, complex dataset we collected, and a 22-participant user study. Our results demonstrate that MoXaRt significantly improves communication clarity—boosting listening comprehension in noisy conditions by 33.2% (p=0.0058)—and significantly reduces cognitive load (M=7.50 vs. M=3.36, p<0.001), paving the way for more perceptive and socially adept XR experiences. View details
    Preview abstract Validating conversational artificial intelligence (AI) for regulated medical software applications may present challenges, as static test datasets and manual review may be limited in identifying emergent, conversational anomalies. A multi-agent AI system may be configured in a closed-loop for automated validation. The system can, for example, utilize an end user persona simulator agent to generate prompts for a target model and a domain /regulatory expert adjudicator agent to evaluate the target model’s responses against a configurable rubric. A meta-analysis agent can analyze anomalies to identify underlying vulnerabilities, which may then be used to programmatically synthesize new adversarial personas. This adaptive process can generate evidence to support regulatory compliance and continuous performance monitoring for medical software algorithms systems. View details
    Improved Differentially Private Algorithms for Rank Aggregation
    Phanu Vajanopath
    Quentin Hillebrand
    Vorapong Suppakitpaisarn
    AAAI (2026)
    Preview abstract Rank aggregation is a task of combining the rankings of items from multiple users into a single ranking that best represents the users' rankings. Alabi et al. (AAAI'22) presents differentially-private (DP) polynomial-time approximation schemes (PTASes) and 5-approximation algorithms with certain additive errors for the Kemeny rank aggregation problem in both central and local models. In this paper, we present improved DP PTASes with smaller additive error in the central model. Furthermore, we are first to study the footrule rank aggregation problem under DP. We give a near-optimal algorithm for this problem; as a corollary, this leads to 2-approximation algorithms with the same additive error as the 5-approximation algorithms of Alabi et al. for the Kemeny rank aggregation problem in both central and local models. View details
    An experimental evaluation of an AI-powered interactive learning platform
    Nicole Miller
    Yael Haramaty
    Lidan Hackmon
    Lior Belinsky
    Abraham Oritz Tapia
    Lucy Tootill
    Scott Siebert
    Frontiers in Artificial Intelligence (2026) (to appear)
    Preview abstract Generative AI, which is capable of transforming static content into dynamic learning experiences, holds the potential to revolutionize student engagement in educational contexts. However, questions still remain around whether or not these tools are effective at facilitating student learning. In this research, we test the effectiveness of an AI-powered platform incorporating multiple representations and assessment through Learn Your Way, an experimental research platform that transforms textbook chapters into dynamic visual and audio representations. Through a between-subjects, mixed methods experiment with 60 US-based students, we demonstrate that students who used Learn Your Way had a more positive learning experience and had better learning outcomes compared to students learning the same content through a digital textbook. These findings indicate that AI-driven tools, capable of providing choice among interactive representations of content, constitute an effective and promising method for enhancing student learning. View details
    ×