Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11267 publications
    See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers
    Ding Xia
    Xinyue Gui
    Mark Colley
    Fan Gao
    Dongyuan Li
    Renhe Jiang
    Takeo Igarashi
    ACL 26 (2026)
    Preview abstract Automated vehicles lack natural communication channels with other road users, making external Human-Machine Interfaces (eHMIs) essential for conveying intent and maintaining trust in shared environments. However, most eHMI studies rely on developer-crafted message-action pairs, which are difficult to adapt to diverse and dynamic traffic contexts. A promising alternative is to use Large Language Models (LLMs) as action designers that generate context-conditioned eHMI actions, yet such designers lack perceptual verification and typically depend on fixed prompts or costly human-annotated feedback for improvement. We present See2Refine, a human-free, closed-loop framework that uses vision-language models (VLMs) for perceptual evaluation as automated visual feedback to improve an LLM-based eHMI action designer. Given a driving context and a candidate eHMI action, the VLM evaluates the perceived appropriateness of the action, and this feedback is used to iteratively revise the designer's outputs, enabling systematic refinement without human supervision. We evaluate our framework across three eHMI modalities (lightbar, eyes, and arm) and multiple LLM model sizes. Across settings, our framework consistently outperforms prompt-only LLM designers and manually specified baselines in both VLM-based metrics and human-subject evaluations. Results further indicate that the improvements generalize across modalities and that VLM evaluations are well aligned with human preferences, supporting the robustness and effectiveness of \systemName for scalable action design. View details
    Preview abstract Artificial intelligence is rapidly evolving, marked by the emergence of Large Language Model (LLM) agents – systems capable of complex reasoning, planning, and interaction with digital and physical environments. These agents, powered by advancements in LLMs, demonstrate remarkable capabilities across diverse domains, including finance, healthcare, web navigation, software development, and daily task assistance. Unlike traditional AI systems, LLM agents can perceive their surroundings, formulate multi-step plans, utilize external tools and APIs, access memory or knowledge bases, and execute actions to achieve specified goals. This ability to act upon the world, however, introduces significant safety and security challenges. The safety paradigms developed for traditional LLMs, primarily focused on mitigating harmful textual outputs (e.g., toxicity, bias), are insufficient for safeguarding LLM agents. Agents interacting with dynamic environments and executing actions present a broader attack surface and new categories of risk. These include performing unsafe operations, violating privacy constraints through improper data handling or access control failures, deviating from user objectives (task misalignment), and susceptibility to novel manipulation techniques like indirect prompt injection and memory poisoning. Ensuring the trustworthy operation of these powerful agents is paramount, especially as they are integrated into high-stakes applications. To address this critical challenge, we introduce VeriGuard, a novel framework designed to enhance the safety and reliability of LLM agents by interactively verifying their policies and the actions. VeriGuard integrates a verification module that intercepts code-based actions proposed by the agent. In the first step, VeriGuard will generates and verifies the policies. The policies are rigorously checked against a set of predefined safety and security specifications Then each action will be verified to make sure it will align with the agent specification. This interactive verification loop ensures that the agent's behavior remains within safe operational bounds, effectively preventing the execution of harmful or unintended operations. By verifying each step, VeriGuard provides a robust safeguard, substantially improving the trustworthiness of LLM agents in complex, real-world environments. View details
    Preview abstract Some artificial intelligence provisioning models that function as tools for human users or rely on labor arbitrage can present challenges for organizations, such as managing personnel rather than task outcomes and introducing data security risks. An architecture is described for an outcome-based synthetic labor market in which autonomous computational agents can be compensated based on verified task completion. The framework can leverage trusted execution environments to create secure hardware enclaves for processing sensitive data, which can render the data cryptographically inaccessible to a host system or agent provider. This approach can facilitate a secure, transactional market for autonomous professional execution, which may enable a shift from managing labor resources to procuring verified outcomes from a pool of specialized agents. View details
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Victor May
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Preview abstract Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths. View details
    Preview abstract How many T gates are needed to approximate an arbitrary n-qubit quantum state to within a given precision ϵ? Improving prior work of Low, Kliuchnikov and Schaeffer, we show that the optimal asymptotic scaling is Θ(sqrt{2^n log(1/ε)} + log(1/ε)) if we allow an unlimited number of ancilla qubits. We also show that this is the optimal T-count for implementing an arbitrary diagonal n-qubit unitary to within error ϵ. We describe an application to batched synthesis of single-qubit unitaries: we can approximate a tensor product of m = O(log log(1/ϵ)) arbitrary single-qubit unitaries to within error ϵ with the same asymptotic T-count as is required to approximate just one single-qubit unitary. View details
    Preview abstract High-volume enterprise service organizations face a persistent challenge in transitioning from reactive support models to proactive, preventative ones. This paper introduces the Agentic Trend-to-Knowledge (ATK) methodology, a novel, autonomous framework designed to address this gap. The ATK methodology employs an AI agent that operates in a recurring, closed loop. It first uses a two-stage process for the autonomous thematic analysis of recent support cases to identify the most significant recurring issue. It then leverages Retrieval-Augmented Generation (RAG) to source relevant institutional knowledge. A key innovation is the agent's adaptive, bimodal response: if relevant knowledge is found, it drafts a proactive communication for human review; if a knowledge gap is detected, it autonomously creates a content creation task for the appropriate team. This transforms the agent from an automation tool into a proactive process owner that creates a virtuous cycle of continuous improvement for both case deflection and knowledge base quality. By automating the entire workflow from insight to action, the ATK framework provides a concrete methodology for shifting from a "human-in-the-loop" to a more strategic "human-on-the-loop" operational paradigm. View details
    CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors
    Juro Gottweis
    CJ Park
    Salman Rahman
    Ahmed Metwally
    Hong Yu
    Ivor Rendulic
    Yuzhe Yang
    Petar Sirkovic
    Daniel McDuff
    Shwetak Patel
    Nicolas Stroppa
    Yubin Kim
    Mark Malhotra
    Orson Xu
    Sam Schmidgall
    Tim Althoff
    Elahe Vedadi
    Cynthia Breazeal
    Hae Won Park
    (2026)
    Preview abstract Object-Counting for remote-sensing (RS) imagery is raising increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - an adaptation of existing work for Open Vocabulary Counting (OVC) approach from general computer vision to the RS domain. We show that our model is capable of accurate counting of novel object classes, that are unseen during training, based solely on textual and/or visual conditioning. View details
    Preview abstract The rapid expansion of the Internet of Things (IoT) and smart home ecosystems has led to a fragmented landscape of user data management across consumer electronics (CE) such as Smart TVs, gaming consoles, and set-top boxes. Current onboarding processes on these devices are characterized by high friction due to manual data entry and opaque data-sharing practices. This paper introduces the User Data Sharing System (UDSS), a platform-agnostic framework designed to facilitate secure, privacy-first PII (Personally Identifiable Information) exchange between device platforms and third-party applications. Our system implements a Contextual Scope Enforcement (CSE) mechanism that programmatically restricts data exposure based on user intent—specifically distinguishing between Sign-In and Sign-Up workflows. Unlike cloud-anchored identity standards such as FIDO2/WebAuthn, UDSS is designed for shared, device-centric CE environments where persistent user-to-device bind-ing cannot be assumed. We further propose a tiered access model that balances developer needs with regulatory compliance (GDPR/CCPA). A proof-of-concept implementation on a reference ARMv8 Linux-based middleware demonstrates that UDSS reduces user onboarding latency by 65% and measurably reduces PII over-exposure risk through protocol-enforced data minimization. This framework provides a standardized approach to identity management in the heterogeneous CE market. View details
    Preview abstract Managing compiler build errors that can arise during infrastructure upgrades in large, polyglot codebases may be challenging, as manual remediation can be slow and some automated tools may not support modern language syntax. A system can provide automated error remediation by ingesting compiler diagnostics and analyzing source code using an Abstract Syntax Tree (AST). A recursive scope resolution algorithm, for example, can traverse the AST to identify a specific and narrowly-scoped code block at which to apply an error suppression. Conversely, this algorithmic complexity can be bypassed when lexical scope resolution is not required, and the system can identify the specific location of error suppressions directly from the error's exact coordinates. The system may then generate and apply language-specific patches, such as structured comments for JavaScript source files or line-scoped comments for TypeScript source files, for example, by using a transactional rewrite engine. This approach can provide a scalable method for managing automated code remediation, which may facilitate infrastructure upgrades by reducing the need for manual intervention. View details
    Preview abstract Enterprise service delivery platforms, while vital for HR operations, create significant challenges in managing the risks of Personally Identifiable Information (PII) exposure. The integration of Generative AI offers new efficiencies but also amplifies these risks. Existing solutions—ranging from manual redaction and rule-based Data Loss Prevention (DLP) to inflexible data masking—fail to provide a nuanced, integrated approach. This paper introduces the Dual-Mode Privacy Guard (DMPG), a conceptual framework that establishes a model for Augmented Compliance. The framework provides a "defense-in-depth" strategy built on three pillars: (1) a Zero-Trust AI Foundation leveraging a verifiable, non-retention API gateway to ensure data privacy; (2) a proactive "Guardrail" that uses AI to detect and flag potential PII for human-in-the-loop review; and (3) an on-demand "Tool" that allows users to create securely anonymized data assets. By differentiating between proactive monitoring and reactive utility, the DMPG shifts the compliance paradigm from a manual burden to an AI-assisted process that enhances, rather than replaces, human oversight. This paper details the framework’s platform-agnostic architecture, using Salesforce as a reference implementation, and argues for its novelty as a model for operationalizing privacy principles within modern enterprise systems. View details
    Preview abstract We introduce ALPS (Activation-based Length Prediction for Scheduling), a method for predicting LLM generation length from prefill activations before any tokens are generated. Unlike existing approaches that require model fine-tuning or complex entropy-weighted pooling, ALPS uses a simple linear probe on the last-token activation at intermediate layers. We discover that generation length is encoded in prefill representations: a ridge regression probe achieves R-squared > 0.85 across three model families. Validation across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B demonstrates: (1) intermediate layers generally perform well, with some architectural variation; (2) simple last-token extraction outperforms complex pooling strategies; (3) activations improve substantially over surface-feature baselines (24 percentage points over input length plus lexical features). The best models achieve R-squared = 0.943 (Gemma), R-squared = 0.880 (Llama), and R-squared = 0.857 (Qwen) with MAE of 38-80 tokens. All test prompts terminated naturally (100% EOS), eliminating truncation confounds. While our evaluation uses 200 curated prompts—sufficient for demonstrating the phenomenon but requiring broader validation—cross-validation confirms generalization beyond training data. ALPS enables practical applications including budget-constrained inference, request scheduling, and resource allocation. The probe adds negligible overhead (~16KB direction vector, single dot product), making ALPS practical for production deployment. View details
    MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR
    Sieun Kim
    Qianhui Zheng
    Ruoyu Xu
    Ravi Tejasvi
    Anuva Kulkarni
    Junyi Zhu
    2026
    Preview abstract In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g. faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to five concurrent sources (e.g. two voices + three instruments) with ca. 2 second processing latency. We validate MoXaRt through a technical evaluation on a new, complex dataset we collected, and a 22-participant user study. Our results demonstrate that MoXaRt significantly improves communication clarity—boosting listening comprehension in noisy conditions by 33.2% (p=0.0058)—and significantly reduces cognitive load (M=7.50 vs. M=3.36, p<0.001), paving the way for more perceptive and socially adept XR experiences. View details
    Preview abstract This study examines the psychological and ethical implications of generative-AI chatbot use among youth, introducing the CTRL framework (Cognitive Trust, Reliance, and Learning Diminution) to explain how repeated use fosters cognitive offloading and reduced verification behavior. Survey data from 420 participants analyzed through factor analysis and structural equation modeling reveal that higher trust predicts greater reliance and diminished critical evaluation, alongside elevated concerns around privacy and academic integrity. Findings highlight the need for AI literacy and responsible design to mitigate unintended cognitive impacts. View details
    ×