Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10733 publications
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    DialogLab: Authoring, Simulating, and Testing Dynamic Group Conversations in Hybrid Human-AI Conversations
    Erzhen Hu
    Mingyi Li
    Alex Olwal
    Seongkook Heo
    UIST '25: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, ACM (2025), 210:1-20
    Preview abstract Designing compelling multi-party conversations involving both humans and AI agents presents significant challenges, particularly in balancing scripted structure with emergent, human-like interactions. We introduce DialogLab, a prototyping toolkit for authoring, simulating, and testing hybrid human-AI dialogues. DialogLab provides a unified interface to configure conversational scenes, define agent personas, manage group structures, specify turn-taking rules, and orchestrate transitions between scripted narratives and improvisation. Crucially, DialogLab allows designers to introduce controlled deviations from the script—through configurable agents that emulate human unpredictability—to systematically probe how conversations adapt and recover. DialogLab facilitates rapid iteration and evaluation of complex, dynamic multi-party human-AI dialogues. An evaluation with both end users and domain experts demonstrates that DialogLab supports efficient iteration and structured verification, with applications in training, rehearsal, and research on social dynamics. Our findings show the value of integrating real-time, human-in-the-loop improvisation with structured scripting to support more realistic and adaptable multi-party conversation design. View details
    Preview abstract Tesseract is a Most-Likely-Error decoder designed for quantum error-correcting codes. Tesseract conducts a search through an graph on the set of all subsets of errors to find the lowest cost subset of errors consistent with the input syndrome. Although this set is exponentially large, the search can be made efficient in practice for random errors using A* along with a variety of pruning heuristics. We show through benchmark circuits for surface, color, and bivariate-bicycle codes that Tesseract is competitive with integer programming-based decoders at moderate physical error rates. Finally, we compare surface and bivariate bicycle codes using most-likely error decoding View details
    Gemini & Physical World: Large Language Models Can Estimate the Intensity of Earthquake Shaking from Multi-Modal Social Media Posts
    Marc Stogaitis
    Tajinder Gadh
    Richard Allen
    Alexei Barski
    Robert Bosch
    Patrick Robertson
    Youngmin Cho
    Nivetha Thiruverahan
    Aman Raj
    Geophysical Journal International (2025), ggae436
    Preview abstract This paper presents a novel approach for estimating the ground shaking intensity using real-time social media data and CCTV footage. Employing the Gemini 1.5 Pro’s (Reid et al. 2024) model, a multi-modal language model, we demonstrate the ability to extract relevant information from unstructured data utilizing generative AI and natural language processing. The model’s output, in the form of Modified Mercalli Intensity (MMI) values, align well with independent observational data. Furthermore, our results suggest that beyond its advanced visual and auditory understanding abilities, Gemini appears to utilize additional sources of knowledge, including a simplified understanding of the general relationship between earthquake magnitude, distance, and MMI intensity, which it presumably acquired during its training, in its reasoning and decision-making processes. These findings raise intriguing questions about the extent of Gemini's general understanding of the physical world and its phenomena. Gemini’s ability to generate results consistent with established scientific knowledge highlights the potential of LLMs like Gemini in augmenting our understanding of complex physical phenomena such as earthquakes. More specifically, the results of this study highlight the potential of LLMs like Gemini to revolutionize citizen seismology by enabling rapid, effective, and flexible analysis of crowdsourced data from eyewitness accounts for assessing earthquake impact and providing crisis situational awareness. This approach holds a great promise for improving early warning systems, disaster response, and overall resilience in earthquake-prone regions. This study provides a significant step toward harnessing the power of social media and AI for earthquake disaster mitigation. View details
    Governing Innovation: Google's SOX Controls for AI/ML in Financial Systems
    Eshan Bhatt
    Ivey Publishing, Ivey Business School, Western University, London, Ontario, Canada (2025)
    Preview abstract The integration of Artificial Intelligence (AI) and Machine Learning (ML) in financial systems is transforming risk modeling, forecasting, and operational efficiency. However, the adoption of these technologies introduces new risks to financial reporting. This business case outlines how organizations can design and implement SOX-compliant IT controls tailored to AI/ML use cases in Finance, aligning with Internal Control over Financial Reporting (ICFR) requirements and regulatory expectations. View details
    Tracing the Representation Geometry of Language Models from Pretraining to Post-training
    Guillaume Lajoie
    Arna Ghosh
    Kumar Krishna Agrawal
    Komal Kumar Teru
    Blake Richards
    Melody Zixuan Li
    Adam Santoro
    2025
    Preview abstract Complex representational changes in Large Language Models (LLMs) are critical for their capabilities, but are often obscured by standard metrics used to evaluate models during training, like loss or gradient norms. Here, we examine the representational changes that occur during LLM pretraining by analyzing their high-dimensional representation geometry using spectral methods (αReQ, RankMe). In two different families of models (OLMo and Pythia), hidden beneath the near monotonically-decreasing loss and gradient norm, we uncover non-monotonic learning phases in the geometry of the representations. These phases are curves. Specifically, we find that the pretraining stage consistently exhibits three distinct phases: (1) a ‘warm-up’ phase where the dimensionality of the representations drops drastically, (2) an ’entropy-seeking’ phase that expands the effective dimensionality of the representations in all directions, and (3) a ’compression-seeking’ phase that reduces the dimensionality by selectively expanding only along the dominant representational axes. This evolving representation geometry governs the trade-off between fitting the training distribution and generalizing beyond it: The models get better at reproducing specific short-context sequences from the data during the entropy-seeking phase, and at generalizing to novel long-context dependencies during the compression-seeking phase. Continued pretraining can lead to additional entropy-seeking and compression-seeking phases. Crucially, we also find that these different phases have implications for downstream fine-tuning. Optimal adaptability for Supervised Fine-Tuning (SFT) emerges significantly earlier than peak zero-shot performance on factual question answering tasks and aligns with the transition out of the first compression-seeking phase. Furthermore, we observe that SFT often induces an ’entropy-seeking’ dynamic whereas Reinforcement Learning from Verifiable Rewards (RLVR) induces a ’compression-seeking’ dynamic. We investigate the implications of these representational dynamics on downstream generalization of instruction-tuning, and exploration capabilities of RLVR-tuned models. Our results demonstrate that spectral methods for analyzing high-dimensional representations can provide new insights on the functionally relevant changes that occur in LLMs over pretraining. View details
    GOALIE (GOAL oriented IntErventions) Proactive Multimodal Agent to Assist Augmented Reality
    Saptarashmi Bandyopadhyay
    Vikas Bahirwani
    Lavisha Aggarwal
    Bhanu Guda
    Lin Li
    Qin Liu
    Tom Goldstein
    John Dickerson
    Andrea Colaco
    2025
    Preview abstract Multimodal AI Agents are helpful to assist and guide users in completing real-time tasks like cooking, robotics, manufacturing. An emerging form of multimodal communication is Augmented Reality (AR), where an AI Agent can enhance user experience with step-by-step guidance of tasks by observing the user's vision and language inputs. Current LLM or VLM based agents are reactive, waiting for an user query before responding. Proactive AI Agents in AR focus on detecting when the AI Agent should autonomously intervene to fix mistakes or followup any instruction. Our GOALIE (GOAL-oriented IntErvention) Agent is the first multimodal proactive AR agent which guides the user step-by-step on its own. We build an innovative Zero-Shot Prompting framework PSoS (Proactive Sequence of Steps) with the context of abstract past user actions, the agent's previous responses, and the user's granular goals and actions before it is detected that the AI Agent should intervene. We use PSoS for Supervised Finetuning (SFT), Direct Preference Optimization (DPO) and Group-Relative Policy Optimization (GRPO) finetuning of our AI agent to improve the quality of the agent's proactive intervention. We also propose a new algorithmic framework, Bagged group Relative Policy Optimization (BRPO), to reduce the variance in rewards of generation groups, to adapt the finetuning algorithm for multimodal proactive interventions by the AI Agent and to enable real-time finetuning of the AI model. We compare the step-by-step intervention quality and efficiency of the GOALIE Agent with Gemma-3 models along with other VLMs for task execution with human expert labels. We conduct human evaluation of the proactive interventions, demonstrating user satisfaction with the GOALIE Agent's proactive interventions. We will release the code, model and human evaluation data. View details
    Balancing AI and Human Insights in Scientific Discovery: Challenges and Guidelines
    Javier García-Martínez
    Pilar Manchon
    Ricardo Vinuesa
    Sergio Hoyas
    The Innovation (2025)
    Preview abstract Recent advancements in large language models (LLMs) have enabled AI systems to autonomously assist in scientific research, from hypothesis generation to laboratory experimentation, transforming how research proposals are written and experiments are designed. Tools like AI "co-scientists" promise to enhance scientific productivity but raise concerns about diminishing human intuition, reinforcing incremental research, and concentrating power among a few entities. As LLMs become increasingly integrated into research processes, there is a risk of reduced creativity, ethical misconduct, and overreliance on AI-driven evaluation systems. To address these challenges, in this article we propose ethical guidelines focusing on transparency, accountability, fairness, and safeguarding transformative research. Ultimately, AI should be used to augment—not replace—human insight in scientific discovery.n View details
    Preview abstract We study the effect of a firm's new information disclosure on the information asymmetry between its informed and uninformed investors and its liquidity. To do this, we employ advanced natural language processing (NLP) methods to introduce a novel measure of firms' 10-K filing predictability that quantifies the amount of new information in these reports. Our findings show that more new information is associated with higher bid-ask spreads and lower trading volumes, indicating increased information asymmetry and reduced liquidity, respectively. Notably, institutional ownership moderates these effects, suggesting that sophisticated investors can mitigate the adverse consequences of disclosure unpredictability. An event study analysis further reveals that more new information triggers increased trading activity and abnormal returns immediately after disclosure, though these effects are short-lived. View details
    PROTECT: A Framework to Foster Digital Resilience for Youth Navigating Technology-Facilitated Abuse
    Diana Freed
    Natalie Bazarova
    Dan Cosley
    Patrick Gage Kelley
    Social Sciences Journal, 14(6) (2025)
    Preview abstract Youth are increasingly exposed to a broad range of technology-facilitated abuse that challenges their safety and well-being. Building on previous work that examined youth help-seeking behaviors, coping strategies, threats they encounter, and the social support systems around them, we articulate a framework— called PROTECT—Problem recognition, Reaching out, Organizing support, Training, Engaging experts, Continuous support, and Tackling safety measures—which integrates existing models of support, help-seeking, and digital skills to offer a high-level, structured approach to adults who serve as a support system to youth navigate technology-facilitated abuse. The framework unpacks social and contextual dynamics that influence help-seeking behaviors, providing a foundation for educators, advocates, health professionals, developers and other adult stakeholders to design and develop trauma-informed, timely interventions to promote resilience. View details
    Passive Heart Rate Monitoring During Smartphone Use in Everyday Life
    Shun Liao
    Paolo Di Achille
    Jiang Wu
    Silviu Borac
    Jonathan Wang
    Eric Teasley
    Lawrence Cai
    Daniel McDuff
    Hao-Wei Su
    Brent Winslow
    Anupam Pathak
    Shwetak Patel
    Jim Taylor
    Jamie Rogers
    (2025)
    Preview abstract Resting heart rate (RHR) is an important biomarker of cardiovascular health and mortality, but tracking it longitudinally generally requires a wearable device, limiting its availability. We present PHRM, a deep learning system for passive heart rate (HR) and RHR measurements during ordinary smartphone use, using facial video-based photoplethysmography. Our system was developed using 225,773 videos from 495 participants and validated on 185,970 videos from 205 participants in laboratory and free-living conditions – the largest validation study of its kind. Compared to reference electrocardiogram, PHRM achieved a mean absolute percentage error (MAPE) <10% for HR measurements across three skin tone groups of light, medium and dark pigmentation; MAPE for each skin tone group was non-inferior versus the others. Daily RHR measured by PHRM had a mean absolute error <5 bpm compared to a wearable HR tracker, and was associated with known risk factors. These results highlight the potential of smartphones to enable passive and equitable heart health monitoring. View details
    Preview abstract Perch is a performant pre-trained model for bioacoustics. It was trained in supervised fashion, providing both off-the-shelf classification scores for thousands of vocalizing species as well as strong embeddings for transfer learning. In this new release, Perch 2.0, we expand from training exclusively on avian species to a large multi-taxa dataset. The model is trained with self-distillation using a prototype-learning classifier as well as a new source-prediction training criterion. Perch 2.0 obtains state-of-the-art performance on the BirdSet and BEANS benchmarks. It also outperforms specialized marine models on marine transfer learning tasks, despite having almost no marine training data. We present hypotheses as to why fine-grained species classification is a particularly robust pre-training task for bioacoustics. View details
    Preview abstract Recently, decomposing complex problems into simple subtasks--a crucial part of human-like natural planning--to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce Plan-Tuning, a unified post-training framework that (i) distills synthetic task decompositions (termed “planning trajectories”) from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average ~7%. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average ~10% and ~12% performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that Plan-Tuning is an effective strategy for improving task-specific performance of smaller LLMs. View details
    VIDEOPHY-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation
    Kai-Wei Chang
    Hritik Bansal
    Aditya Grover
    Roman Goldenberg
    Clark Peng
    (2025)
    Preview abstract Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation. The data and code is available at https://videophy2.github.io/ View details
    Neural Speech and Audio Coding
    Minje Kim
    IEEE Signal Processing Magazine, 41 (2025), pp. 85-93
    Preview abstract This paper explores the integration of model-based and data-driven approaches within the realm of neural speech and audio coding systems. It highlights the challenges posed by the subjective evaluation processes of speech and audio codecs and discusses the limitations of purely data-driven approaches, which often require inefficiently large architectures to match the performance of model-based methods. The study presents hybrid systems as a viable solution, offering significant improvements to the performance of conventional codecs through meticulously chosen design enhancements. Specifically, it introduces a neural network-based signal enhancer designed to post-process existing codecs’ output, along with the autoencoder-based end-to-end models and LPCNet—hybrid systems that combine linear predictive coding (LPC) with neural networks. Furthermore, the paper delves into predictive models operating within custom feature spaces (TF-Codec) or predefined transform domains (MDCTNet) and examines the use of psychoacoustically calibrated loss functions to train end-to-end neural audio codecs. Through these investigations, the paper demonstrates the potential of hybrid systems to advance the field of speech and audio coding by bridging the gap between traditional model-based approaches and modern data-driven techniques. View details