Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11342 publications
Preview abstract When managing complex, unpredictable (non-deterministic) AI agents using simple, fixed control systems (like finite state machines), operational failures and accountability issues often arise. This document introduces a probabilistic governance and telemetry framework to resolve these problems. Instead of following a rigid sequence of steps, this framework defines a multi-dimensional operational boundary, a 'behavioral volume', and assigns the agent a goal. This allows the agent to use its own reasoning to achieve the goal while remaining within the defined boundaries. A separate telemetry layer monitors the agent's actions by calculating metrics, such as alignment scores and drift velocity, to measure how much the agent deviates from its intended behavior. This system provides a method for guiding, monitoring, and securing autonomous agents, effectively managing the performance and security of an unpredictable AI workforce in complex environments. View details
Preview abstract This writeup defines the Hydration Proxy Pattern, a framework for building stateful conversational data systems over stateless LLM APIs. It describes a platform-agnostic approach to decoupling persistence from the AI provider through secure server-side intermediation and hybrid storage tiers. The abstract provides a blueprint for managing the "Persistence Gap" in enterprise AI integrations, detailing high-level strategies for session history management, streaming, and multi-stage semantic grounding without disclosing specific internal implementation details. View details
Preview abstract Enterprise service delivery platforms, while vital for HR operations, create significant challenges in managing the risks of Personally Identifiable Information (PII) exposure. The integration of Generative AI offers new efficiencies but also amplifies these risks. Existing solutions—ranging from manual redaction and rule-based Data Loss Prevention (DLP) to inflexible data masking—fail to provide a nuanced, integrated approach. This paper introduces the Dual-Mode Privacy Guard (DMPG), a conceptual framework that establishes a model for Augmented Compliance. The framework provides a "defense-in-depth" strategy built on three pillars: (1) a Zero-Trust AI Foundation leveraging a verifiable, non-retention API gateway to ensure data privacy; (2) a proactive "Guardrail" that uses AI to detect and flag potential PII for human-in-the-loop review; and (3) an on-demand "Tool" that allows users to create securely anonymized data assets. By differentiating between proactive monitoring and reactive utility, the DMPG shifts the compliance paradigm from a manual burden to an AI-assisted process that enhances, rather than replaces, human oversight. This paper details the framework’s platform-agnostic architecture, using Salesforce as a reference implementation, and argues for its novelty as a model for operationalizing privacy principles within modern enterprise systems. View details
Compact Conformal Subgraphs
Kamesh Munagala
Aravindan Vijayaraghavan
ICML (2026)
Preview abstract Conformal prediction provides rigorous uncertainty guarantees for model outputs but can produce prohibitively large prediction sets in structured domains such as routing, planning, or sequential recommendation. We introduce \emph{graph-based conformal compression}, a framework for constructing compact subgraphs that preserve the statistical validity of conformal prediction while reducing structural complexity. We study a formulation that selects a smallest subgraph capturing a prescribed fraction of conditional probability mass, and reduce to a weighted version of densest $k$-subgraphs in hypergraphs, in the regime where the subgraph has a large fraction of edges. We design efficient approximation algorithms that achieve constant factor coverage and size trade-offs. Our results highlight an algorithmic regime, distinct from classical densest-$k$-subgraph hardness settings, where the problem can be approximated efficiently, bridging conformal prediction with combinatorial graph compression. We finally validate our algorithmic approach on synthetic and real-world instances of trip planning and navigation, showing in each case that our approach handily beats natural baselines. View details
ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
Guy Tennenholtz
Jihwan Jeong
The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL-26), Rabat, Morocco (2026)
Preview abstract LLM-based user simulators are a scalable solution for improving conversational AI, but a critical realism gap undermines their effectiveness. To close this gap, we introduce a framework for building and validating high-fidelity simulators. We present a novel dataset of human-AI shopping conversations designed to capture a wide spectrum of user experiences. To measure fidelity, we propose a hybrid evaluation protocol that combines statistical alignment with a learned, discriminator-based Human-Likeness Score. Our most sophisticated simulator, trained via reinforcement learning with iterative critique, achieves a significant leap in realism. Critically, we demonstrate through counterfactual validation that our simulator—trained exclusively on optimal interactions—realistically adapts its behavior to suboptimal system responses, mirroring real user reactions and marking a key advance in creating reliable simulators for robust AI development. View details
Preview abstract Source-to-source compilers may perform inefficiently by executing transpilation passes on scripts that do not contain the specific language features a pass is designed to transform, potentially leading to redundant processing. A compiler can analyze a script to generate a per-script feature map, for example, by identifying language features in its abstract syntax tree (AST). Before executing a transpilation pass, the compiler can check this map and may bypass the pass for that script if the specific feature targeted by the pass is not present. This feature map can also be dynamically updated throughout the compilation process as other passes transform the code. This method of conditional pass execution based on content-aware analysis may reduce redundant AST traversals, which could decrease overall compilation time and computational resource consumption. View details
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
Han Zhou
Shariq Iqbal
Ivan Vulić
Anna Korhonen
International Conference on Learning Representations (ICLR) (2026)
Preview abstract Large language models (LLMs), employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with {prompts} that declare their functionality, along with the {workflows} that orchestrate interactions within a structured flow. Designing prompts and workflows for multi-agent systems is inherently complex, especially when addressing a new task. It often demands expert-level knowledge and involves significant trial and error. Gaining a deep understanding of the factors that contribute to effective multi-agent systems is essential for automating the entire process. Motivated by this, we first conduct an in-depth analysis of the design spaces for multi-agent systems, focusing on the impact of prompts, scaling the number of agents, and common types of agentic modules. Our findings reveal that top-performing systems often emerge from simpler design spaces, where prompts play a critical role in enhancing agent functionality and enabling more effective scaling. Based on the insights, we propose Multi-Agent System Search (MASS), a multi-stage optimization framework that performs the optimization in a pruned design space, with prompts and an influential subset of modules. We show that MASS-optimized multi-agent systems outperform existing alterntives by a substantial margin. Based on the MASS-found systems, we finally propose design principles behind building effective multi-agent systems. View details
CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors
Juro Gottweis
CJ Park
Salman Rahman
Ahmed Metwally
Hong Yu
Ivor Rendulic
Yuzhe Yang
Petar Sirkovic
Daniel McDuff
Shwetak Patel
Nicolas Stroppa
Yubin Kim
Mark Malhotra
Orson Xu
Sam Schmidgall
Tim Althoff
Elahe Vedadi
Cynthia Breazeal
Hae Won Park
(2026)
Preview abstract The emergence of Agentic AI—autonomous systems capable of reasoning, decision-making, and multi-step execution—represents a paradigm shift in enterprise technology. Moving beyond simple generative tasks, these agents offer the potential to solve long-standing industry pain points, with over 90% of enterprises planning integration within the next three years. However, the transition from successful proof-of-concept (PoC) to a resilient, production-grade system presents significant hurdles. This article categorizes these challenges into three primary domains: Technical and Engineering Hurdles: Issues such as "entangled workflows" that complicate debugging, the struggle to maintain output quality and mitigate hallucinations, and the unpredictability caused by shifting underlying models or data sources. People, Process, and Ecosystem Hurdles: The high operational costs and unclear ROI of large models, the necessity of a new "Agent Ops" skillset, the complexity of integrating agents with disparate enterprise systems, and a rapidly evolving regulatory landscape. The Pace of Change and Security risks: The technical debt incurred by shifting software frameworks and the expanded attack surface created by autonomous agents. The article concludes that successful deployment requires a shift from informal "vibe-testing" to rigorous engineering discipline. By adopting code-first frameworks, establishing robust evaluation metrics (KPIs), and prioritizing functional deployment over theoretical optimization, organizations can effectively manage the lifecycle of Agentic AI and realize its transformative business value. View details
Usability Hasn’t Peaked: Exploring How Expressive Design Overcomes the Usability Plateau
Alyssa Sheehan
Bianca Gallardo
Ying Wang
Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26), April 13–17, 2026, Barcelona, Spain (2026)
Preview abstract Critics have argued that mobile usability has largely been optimized, and that only incremental gains are possible. We set out to explore if the newest generation of design systems, which promote greater flexibility and a return to design basics, could produce substantially more usable designs while maintaining or increasing aesthetic judgments. Through a study with 48 diverse participants completing tasks in 10 different applications, we found that in designs created following Material 3 Expressive guidelines, users fixated on the correct screen element for a task 33% faster, completed tasks 20% faster, and rated experiences more positively compared to versions designed using the previous Material design system. These improvements in performance and aesthetic ratings challenge the premise of a usability plateau and show that mobile usability has not peaked. We illustrate specific opportunities to make mobile experiences more usable by returning to design fundamentals while highlighting risks of added flexibility. View details
A Dynamic Numerical Model for Real-Time Estimation of Latent Cognitive States Using Oculomotor Metrics
Diako Mardanbegi
ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 22087-22091
Preview abstract Estimating internal cognitive states from oculomotor data is fundamentally challenging due to their context-dependency and the complex relationship between various metrics. This paper proposes a dynamic numerical framework to model a task-specific behavioral signature and monitor deviations from it in real-time. The model integrates key oculomotor and motion metrics into a differential equation, yielding a continuous score that serves as an estimate of a latent cognitive state. By using tunable, heuristic parameters, our approach offers a transparent and adaptable alternative to opaque machine learning models. The framework’s strength lies in its ability to pinpoint objective changes in behavior, providing a potential tool for interpreting events like the onset of fatigue, distraction, or cognitive load. View details
Preview abstract Biological neurons come in many shapes. High-fidelity generative modeling of their varied morphologies is challenging yet underexplored in neuroscience, and crucial for the subfield of connectomics. We introduce MoGen (Neuronal Morphology Generation), a flow matching model to generate high-resolution 3D point clouds of mouse cortex axon and dendrite fragments. This is enabled by an adaptation that injects local geometric context into a scalable latent transformer backbone, allowing for the generation of high-fidelity, realistic samples. To assess MoGen's generation quality, we propose a dedicated evaluation suite with interpretable geometric and topological features tailored to neuronal structures that we validate in a user study. MoGen's practical utility is showcased through controllable generation for visualization via smooth interpolation and a direct downstream application: we augment the training set of a shape plausibility classifier from a production connectomics neuron reconstruction pipeline with millions of generated samples, thereby improving classifier accuracy and reducing the number of remaining split and merge errors by 4.4%. We estimate this can reduce manual proofreading labor by over 157 person-years for reconstruction of a full mouse brain. View details
Improved Differentially Private Algorithms for Rank Aggregation
Phanu Vajanopath
Quentin Hillebrand
Vorapong Suppakitpaisarn
AAAI (2026)
Preview abstract Rank aggregation is a task of combining the rankings of items from multiple users into a single ranking that best represents the users' rankings. Alabi et al. (AAAI'22) presents differentially-private (DP) polynomial-time approximation schemes (PTASes) and 5-approximation algorithms with certain additive errors for the Kemeny rank aggregation problem in both central and local models. In this paper, we present improved DP PTASes with smaller additive error in the central model. Furthermore, we are first to study the footrule rank aggregation problem under DP. We give a near-optimal algorithm for this problem; as a corollary, this leads to 2-approximation algorithms with the same additive error as the 5-approximation algorithms of Alabi et al. for the Kemeny rank aggregation problem in both central and local models. View details
Preview abstract Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths. View details
Understanding U.S. Users' Security and Privacy Transparency Needs for Consumer-Facing Generative AI
Jiaxun Cao
Yu Dong
Chunxi Zhan
Rithvik Neti
Pardis Emami-Naeini
USENIX Symposium on Usable Privacy and Security (SOUPS) (2026)
Preview abstract Users increasingly rely on consumer-facing generative AI (GenAI) for tasks ranging from everyday needs to sensitive use cases. Yet, it remains unclear whether and how existing security and privacy (S&P) communications in GenAI tools shape users’ adoption decisions and experiences. Understanding how users seek, interpret, and evaluate S&P information is critical for designing usable transparency that users can trust and act on. We conducted semi-structured interviews and design sessions with 21 U.S. GenAI users. Our findings suggest that available S&P information rarely drove initial adoption in practice, as participants often perceived it as incomplete, ineffective, or not credible. Instead, they relied on rough proxies (e.g., popularity) to infer S&P practices. After adoption, S&P uncertainty constrained participants’ willingness to use GenAI tools, especially for high-stakes purposes, and, in some cases, contributed to discontinued use. Participants therefore called for transparency that supports decisions and actions through trustworthy information (e.g., independent evaluations) and usable interfaces (e.g., on-demand disclosure). We categorize participants’ desired design practices into five dimensions to facilitate systematic future investigation into best practices. We conclude with recommendations for researchers, designers, and policymakers to improve S&P transparency in consumer-facing GenAI. View details
×