Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11255 publications
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Victor May
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Preview abstract Global shared service centers are critical to modern enterprise operations but struggle to provide consistent, timely support across linguistic boundaries. This paper introduces the Glossary-Grounded Universal Queue (GGUQ), a socio-technical framework designed to bridge the gap between the operational goal of a unified global service queue and the reality of a multilingual workforce. The GGUQ is a real-time, workflow-embedded communication architecture that leverages Large Language Models (LLMs) to provide high-fidelity, two-way translation directly within an agent's enterprise platform. The framework's key innovation is a "glossary-grounded" approach, where translation prompts are programmatically injected with a curated repository of enterprise-specific terminology. This ensures a level of contextual and terminological integrity unachievable by generic machine translation tools. By detailing the GGUQ's three-pillar architecture—Dynamic Translation, Glossary-Grounded Integrity, and Resilient Operations—we propose a new model for computer-mediated communication in global enterprises. This framework aims to move beyond federated, language-siloed support models to enable a true "follow-the-sun" operational capability, promoting both organizational efficiency and a more inclusive employee experience. View details
    Preview abstract This framework manages AI agents by establishing behavioral boundaries and a persistent identity. It uses a multi-layered stack, combining safety rules with brand guidelines, to shape an agent's reasoning. Features include authority decay to limit power if confidence drops and memory segmentation to prevent data tampering. Centralized oversight ensures these digital representatives remain aligned with company policies through continuous monitoring and testing. View details
    Identifying Hearing Difficulty Moments in Conversational Audio
    Jack Collins
    Adrian Buzea
    Chris Collier
    Alejandro Ballesta Rosen
    Julian Maclaren
    Kelly Miles
    Simon Carlile
    Trends in Hearing (2026)
    Preview abstract Individuals regularly experience Hearing Difficulty Moments in everyday conversation. Identifying Hearing Difficulty Moments has particular significance in the field of hearing assistive technology where timely interventions are key for real-time hearing assistance. In this article, we propose and compare machine learning solutions for the temporal detection of segments containing Hearing Difficulty Moments in conversational audio. We show that audio language models, through their multimodal reasoning capabilities, can achieve state-of-the-art results for this task, significantly outperforming a simple automatic speech recognition (ASR) hotword heuristic and a more conventional fine-tuning approach with Wav2Vec, an audio-only input architecture that is state-of-the-art for ASR. View details
    Preview abstract Audio Description ( AD) provides essential access to visual media for blind and low vision ( BLV) audiences. Yet current AD production tools remain largely inaccessible to BLV video creators, who possess valuable expertise but face barriers due to visually- driven interfaces. We present ADCanvas, a multimodal authoring system that supports non- visual control over audio description ( AD) creation. ADCanvas combines conversational interaction with keyboard- based playback control and a plain- text, screen reader– accessible editor to support end- to- end AD authoring and visual question answering ( VQA). Combining screen- reader- friendly controls with a multimodal LLM agent, ADCanvas supports live VQA, script generation, and AD modification. Through a user study with 12 BLV video creators, we find that users adopt the conversational agent as an informational aide and drafting assistant, while maintaining agency through verification and editing. For example, participants saw themselves as curators who received information from the model and filtered it down for their audience. Our findings offer design implications for accessible media tools, including precise editing controls, accessibility support for creative ideation, and configurable rules for human- AI collaboration. View details
    An experimental evaluation of an AI-powered interactive learning platform
    Nicole Miller
    Yael Haramaty
    Lidan Hackmon
    Lior Belinsky
    Abraham Oritz Tapia
    Lucy Tootill
    Scott Siebert
    Frontiers in Artificial Intelligence (2026) (to appear)
    Preview abstract Generative AI, which is capable of transforming static content into dynamic learning experiences, holds the potential to revolutionize student engagement in educational contexts. However, questions still remain around whether or not these tools are effective at facilitating student learning. In this research, we test the effectiveness of an AI-powered platform incorporating multiple representations and assessment through Learn Your Way, an experimental research platform that transforms textbook chapters into dynamic visual and audio representations. Through a between-subjects, mixed methods experiment with 60 US-based students, we demonstrate that students who used Learn Your Way had a more positive learning experience and had better learning outcomes compared to students learning the same content through a digital textbook. These findings indicate that AI-driven tools, capable of providing choice among interactive representations of content, constitute an effective and promising method for enhancing student learning. View details
    SNPeek: Side-Channel Analysis for Privacy Applications on Confidential VMs
    Ruiyi Zhang
    Albert Cheu
    Adria Gascon
    Michael Schwarz
    Octavian Suciu
    Network and Distributed System Security (NDSS) (2026)
    Preview abstract Confidential virtual machines (CVMs) based on trusted execution environments (TEEs) enable new privacy-preserving solutions. But CVMs are not a privacy panacea, as they are vulnerable to side-channel attacks that may compromise confidentially of workloads. In this work, we develop the FARFETCH’D framework to help developers evaluate side-channel assisted privacy attacks that are broadly applicable to CVMs. The privacy reduction due to these attacks heavily depend on the execution environment and the workload, which varies vastly:What are avail-able attack primitives? How does the particular privacy work-load behave?This makes manual investigation and efficiently mitigating software-based side channels a cumbersome and impossible task. FARFETCH’D solves this challenge by providing a set of configurable attack primitives that can execute on real CVM hardware and automated ML-based analysis pipelines. We evaluate the effectiveness of FARFETCH’D on privacy-preserving workloads. Our results show that our approach is effective at pinpointing the vulnerability of privacy apps against side channels and help evaluating mitigation based on oblivious memory and differential privacy. View details
    Preview abstract When managing complex, unpredictable (non-deterministic) AI agents using simple, fixed control systems (like finite state machines), operational failures and accountability issues often arise. This document introduces a probabilistic governance and telemetry framework to resolve these problems. Instead of following a rigid sequence of steps, this framework defines a multi-dimensional operational boundary, a 'behavioral volume', and assigns the agent a goal. This allows the agent to use its own reasoning to achieve the goal while remaining within the defined boundaries. A separate telemetry layer monitors the agent's actions by calculating metrics, such as alignment scores and drift velocity, to measure how much the agent deviates from its intended behavior. This system provides a method for guiding, monitoring, and securing autonomous agents, effectively managing the performance and security of an unpredictable AI workforce in complex environments. View details
    Bi-level Hierarchical Neural Contextual Bandits for Online Recommendation
    Yunzhe Qi
    Yikun Ban
    Allan Stewart
    Chuanwei Ruan
    Jiachuan He
    Shishir Kumar Prasad
    Haixun Wang
    Jingrui He
    Transactions on Machine Learning Research (2026)
    Preview abstract Contextual bandit algorithms aim to identify the optimal choice among a set of candidate arms, based on their contextual information. Among others, the neural contextual bandit algorithms have demonstrated generally superior performance compared to traditional linear and kernel-based methods. Nevertheless, neural methods are not inherently suitable to handle a large number of candidate arms due to their high computational cost when performing neural exploration. Motivated by the widespread availability of arm category information (e.g., movie genres, retailer types), we formulate contextual bandits into a bi-level recommendation problem based on the accessible arm category information, and propose a novel neural bandit framework, named H2N-Bandit, which utilizes a bi-level hierarchical neural structure to mitigate the substantial computational cost found in conventional neural bandit methods. To demonstrate its effectiveness, we provide the regret bound for H2N-Bandit under the over-parameterized neural bandit settings. Furthermore, to illustrate its efficiency, we conduct extensive experiments on multiple real-world public data sets with various specifications, showing that H2N-Bandit can significantly reduce the computational cost over existing non-linear methods while achieving better or comparable performances against state-of-the-art baselines. View details
    The Perfection Paradox: From Architect to Curator in AI-Assisted API Design
    JJ Geewax
    David R Karger
    Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26), ACM, Barcelona, Spain, TBD
    Preview abstract Enterprise API design is often bottlenecked by the tension between rapid feature delivery and the rigorous maintenance of usability standards. We present an industrial case study evaluating an AI-assisted design workflow trained on API Improvement Proposals(AIPs). Through a controlled study with 16 industry experts, we compared AI-generated API specifications against human-authored ones. While quantitative results indicated AI superiority in 10 of 11 usability dimensions and an 87% reduction in authoring time, qualitative analysis revealed a paradox: experts frequently misidentified AI work as human (19% accuracy) yet described the designs as unsettlingly “perfect.” We characterize this as a “Perfection Paradox”—where hyper-consistency signals a lack of pragmatic human judgment. We discuss the implications of this perfection paradox, proposing a shift in the human designer’s role from the “drafter” of specifications to the “curator” of AI-generated patterns. View details
    Preview abstract Despite advances in high performance computing, accurate numerical simulations of global atmospheric dynamics remain a challenge. The resolution required to fully resolve the vast range scales as well as the strong coupling with—often not fully-understood—physics renders such simulations computationally infeasible over time horizons relevant for long-term climate risk assessment. While data-driven parameterizations have shown some promise of alleviating these obstacles, the scarcity of high-quality training data and their lack of long-term stability typically hinders their ability to capture the risk of rare extreme events. In this work we present a general strategy for training variational (probabilistic) neural network models to non-intrusively correct under-resolved long-time simulations of turbulent climate systems. The approach is based on the paradigm introduced by Barthel Sorensen et al. (2024, https://doi.org/10.1029/2023ms004122) which involves training a post-processing correction operator on under-resolved simulations nudged toward a high-fidelity reference. Our variational framework enables us to learn the dynamics of the underlying system from very little training data and thus drastically improve the extrapolation capabilities of the previous deterministic state-of-the art—even when the statistics of that training data are far from converged. We investigate and compare three recently introduced variational network architectures and illustrate the benefits of our approach on an anisotropic quasi-geostrophic flow. For this prototype model our approach is able to not only accurately capture global statistics, but also the anistropic regional variation and the statistics of multiple extreme event metrics—demonstrating significant improvement over previously introduced deterministic architectures. View details
    A Framework for Interactive Machine Learning and Enhanced Conversational Systems
    Jerry Young
    Richard Abisla
    Sanjay Batra
    Mikki Phan
    Nature, Springer-Verlag (2026)
    Preview abstract Conversational systems are increasingly prevalent, yet current versions often fail to support the full range of human speech, including variations in speed, rhythm, syntax, grammar, articulation, and resonance. This reduces their utility for individuals with dysarthria, apraxia, dysphonia, and other language and speech-related disabilities. Building on research that emphasizes the need for specialized datasets and model training tools, our study uses a scaffolded approach to understand the ideal model training and voice recording process. Our findings highlight two distinct user flows for improving model training and provide six guidelines for future conversational system-related co-design frameworks. This study offers important insights on creating more effective conversational systems by emphasizing the need to integrate interactive machine learning into training strategies. View details
    VISTA: A Test-Time Self-Improving Video Generation Agent
    Xuan Long Do
    Hootan Nakhost
    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (to appear) (2026)
    Preview abstract Despite rapid advances in text-to-video (T2V) synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. To address this, we introduce VISTA, a novel multi-agent system that autonomously refines prompts to improve video generation. VISTA operates in an iterative loop, first decomposing a user's idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. To rigorously evaluate our proposed approach, we introduce MovieGen-Bench, a new benchmark of diverse single- and multi-scene video generation tasks. Experiments show that while prior methods yield inconsistent gains, VISTA consistently improves video quality, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA's outputs in 68% of comparisons. View details
    Preview abstract Voice activity detection (VAD) plays a vital role in enabling applications such as speech recognition. We analyze the impact of window size on the accuracy of three VAD algorithms: Silero, WebRTC, and Root Mean Square (RMS) across a set of diverse real-world digital audio streams. We additionally explore the use of hysteresis on top of each VAD output. Our results offer practical references for optimizing VAD systems. Silero significantly outperforms WebRTC and RMS, and hysteresis provides a benefit for WebRTC. View details
    Preview abstract Generative AI is reshaping software development, yet its psychological impact remains under-researched. During May and August 2025 we conducted reflexive thematic analysis of interviews with 12 senior engineers (≥5 years experience) recruited from Western technology hubs to explore shifts in professional identity. We identify a central transition from "coder to conductor," where AI acts as a cognitive partner. Key findings include: (1) a re-architecting of focus from implementation to strategy; (2) a shift in productivity metrics from output to impact; and (3) a dual-impact on agency, where AI empowers autonomy but threatens competence through de-skilling anxieties. These findings suggest that as implementation becomes commoditised, organisational training and career progression must prioritise architectural mastery and metacognitive oversight to ensure sustained developer motivation and system integrity. View details

    Follow us

    ×