Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10822 publications
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Productionizing Quantum Mass Production
    Bill Huggins
    Nathan Wiebe
    arXiv for now (2026) (to appear)
    Preview abstract For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step. View details
    Oculomics: Current Concepts and Evidence
    Zhuoting Zhu
    Yueye Wang
    Ziyi Qi
    Wenyi Hu
    Xiayin Zhang
    Siegfried Wagner
    Yujie Wang
    An Ran Ran
    Joshua Ong
    Ethan Waisberg
    Mouayad Masalkhi
    Alex Suh
    Yih Chung Tham
    Carol Y. Cheung
    Xiaohong Yang
    Honghua Yu
    Zongyuan Ge
    Wei Wang
    Bin Sheng
    Andrew G. Lee
    Alastair Denniston
    Peter van Wijngaarden
    Pearse Keane
    Ching-Yu Cheng
    Mingguang He
    Tien Yin Wong
    Progress in Retinal and Eye Research (2025)
    Preview abstract The eye provides novel insights into general health, as well as pathogenesis and development of systemic diseases. In the past decade, growing evidence has demonstrated that the eye's structure and function mirror multiple systemic health conditions, especially in cardiovascular diseases, neurodegenerative disorders, and kidney impairments. This has given rise to the field of oculomics- the application of ophthalmic biomarkers to understand mechanisms, detect and predict disease. The development of this field has been accelerated by three major advances: 1) the availability and widespread clinical adoption of high-resolution and non-invasive ophthalmic imaging (“hardware”); 2) the availability of large studies to interrogate associations (“big data”); 3) the development of novel analytical methods, including artificial intelligence (AI) (“software”). Oculomics offers an opportunity to enhance our understanding of the interplay between the eye and the body, while supporting development of innovative diagnostic, prognostic, and therapeutic tools. These advances have been further accelerated by developments in AI, coupled with large-scale linkage datasets linking ocular imaging data with systemic health data. Oculomics also enables the detection, screening, diagnosis, and monitoring of many systemic health conditions. Furthermore, oculomics with AI allows prediction of the risk of systemic diseases, enabling risk stratification, opening up new avenues for prevention or individualized risk prediction and prevention, facilitating personalized medicine. In this review, we summarise current concepts and evidence in the field of oculomics, highlighting the progress that has been made, remaining challenges, and the opportunities for future research. View details
    Balancing AI and Human Insights in Scientific Discovery: Challenges and Guidelines
    Javier García-Martínez
    Pilar Manchon
    Ricardo Vinuesa
    Sergio Hoyas
    The Innovation (2025)
    Preview abstract Recent advancements in large language models (LLMs) have enabled AI systems to autonomously assist in scientific research, from hypothesis generation to laboratory experimentation, transforming how research proposals are written and experiments are designed. Tools like AI "co-scientists" promise to enhance scientific productivity but raise concerns about diminishing human intuition, reinforcing incremental research, and concentrating power among a few entities. As LLMs become increasingly integrated into research processes, there is a risk of reduced creativity, ethical misconduct, and overreliance on AI-driven evaluation systems. To address these challenges, in this article we propose ethical guidelines focusing on transparency, accountability, fairness, and safeguarding transformative research. Ultimately, AI should be used to augment—not replace—human insight in scientific discovery.n View details
    Preview abstract Differentially private (DP) synthetic data is a versatile tool for enabling the analysis of private data. With the rise of foundation models, a number of new synthetic data algorithms privately finetune the weights of foundation models to improve over existing approaches to generating private synthetic data. In this work, we propose two algorithms for using API access only to generate DP tabular synthetic data. We extend the Private Evolution algorithm \citep{lin2023differentially, xie2024differentially} to the tabular data domain, define a workload-based distance measure, and propose a family of algorithms that use one-shot API access to LLMs. View details
    Improving Informally Romanized Language Identification
    Adrian Benton
    Christo Kirov
    Proceedings of EMNLP (2025) (to appear)
    Preview abstract The Latin script is often used informally to write languages with non-Latin native scripts. In many cases (e.g., most languages in India), there is no orthography, meaning that there is no conventional spelling of words in the Latin script, hence there will be high spelling variability in written text. Such romanization can render languages that are normally easily distinguished based on script highly confusable, such as Hindi and Urdu. In this work, we present methods to improve language identification of romanized text by improving methods to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher language identification system accuracy than including available naturally occurring examples in the training set or even training higher capacity models. We demonstrate new state-of-the-art language identification performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text. View details
    Efficient Single-Step Item ID Generation for Large-ScaleLLM-based Recommendation
    Anushya Subbiah
    Vikram Aggarwal
    James Pine
    Krishna Sayana
    Kun Su
    2025
    Preview abstract Integrating product catalogs and user behavior into LLMs can enhance recommendations with broad world knowledge, but the scale of real-world item catalogs, often containing millions of discrete item identifiers (Item IDs), poses a significant challenge. This contrasts with the smaller, tokenized text vocabularies typically used in LLMs. The predominant view within the LLM-based recommendation literature is that it is infeasible to treat item ids as a first class citizen in the LLM and instead some sort of tokenization of an item into multiple tokens is required. However, this creates a key practical bottleneck in serving these models for real-time low-latency applications. Our paper challenges this predominant practice and integrates item ids as first class citizens into the LLM. We provide simple, yet highly effective, novel training and inference modifications that enable single-token representations of items and single-step decoding. Our method shows improvements in recommendation quality (Recall and NDCG) over existing techniques on the Amazon shopping datasets while significantly improving inference efficiency by 5x-14x. Our work offers an efficiency perspective distinct from that of other popular approaches within LLM-based recommendation, potentially inspiring further research and opening up a new direction for integrating IDs into LLMs. Our code is available here https://drive.google.com/file/d/1cUMj37rV0Z1bCWMdhQ6i4q4eTRQLURtC/edit View details
    Preview abstract Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing LLMs with relevant and up-to-date information. However, the retrieved sources can often bring conflicting information and it is not clear how models address such discrepancies. In this work, we first point out that knowledge conflicts stem from various reasons and thus require tailored solutions in order to better align model responses to human preferences. To that end, we introduce a novel taxonomy of knowledge conflicts in RAG and define the desired model’s behavior for each category. Additionally, we construct a high-quality benchmark by asking two expert annotators to identify the conflict type within realistic RAG instances, each comprising a query and its associated search results. Finally, we conduct extensive experiments and show that explicitly informing LLMs about the potential conflict category significantly improves the quality and appropriateness of the responses. Yet, there is still a vast room for improvement. Taken together, our work highlights the importance of evaluating RAG systems not only on factual accuracy but also on their ability to manage and resolve knowledge conflicts effectively. View details
    Preview abstract As large language models (LLMs) improve in their capacity to serve as personal AI assistants, their ability to output uniquely tailored, personalized responses that align with the soft preferences of their users is imperative for maximizing user satisfaction and retention. However, lay users are notoriously bad at prompt specification and often struggle with conveying their latent preferences to AI assistants. To resolve this, we demonstrate that activation steering, an inference-time method, can effectively control the response of the LLMs towards expressing different preferences. In contrast to memory-based personalization methods that require long user history, steering is extremely lightweight and easily-controllable via an interpretable linear strength factor. We further conduct a within-subjects user study (n=14) to investigate how end users personalize their conversations through three different steerable chatbot interfaces. The results demonstrate the effectiveness of preference-based steering for aligning real-world conversations with user preferences, and we discuss qualitative findings on how diverse values around control, transparency, and usability of personalization lead users to prefer different interfaces. View details
    Beyond Digital Literacy: Building Youth Digital Resilience Through Existing “Information Sensibility” Practices
    Mia Hassoun
    Ian Beacock
    Todd Carmody
    Patrick Gage Kelley
    Beth Goldberg
    Devika Kumar
    Laura Murray
    Rebekah Park
    Behzad Sarmadi
    Social Sciences Journal, 14(4) (2025)
    Preview abstract Youth media consumption and disordered eating practices have historically been subjects of moral panics, often resulting in protective, deficit-based interventions like content removal. We argue for interventions which instead equip youth to evaluate and manage risks in their online environments, building upon their existing “information sensibility” practices. Drawing upon ethnographic research and intervention testing with 77 participants in the US and India, we analyze how youth (aged 13–26), including those with diverse political perspectives and those recovering from disordered eating (DE), engage with online news and health information. Participants generally algorithmically encountered (rather than searched for) information online, and their engagement was shaped more by social motivations—like belonging—than truth seeking. Participants interpreted online information collaboratively, relying on social cues and peer validation within their online communities. They demonstrated preference for personal testimonies and relatable sources, particularly those with similar social identities. We propose resilience-building interventions that build upon these youth online information practices by: (1) leveraging peer networks, promoting critical information engagement through collaborative learning and peer-to-peer support within online communities; (2) developing social media sensibility, equipping youth to critically evaluate information sources in situ; (3) providing pathways offline, connecting youth to desired in-person communities; and (4) encouraging probabilistic thinking. View details
    Enhanced $H$-Consistency Bounds
    Anqi Mao
    Proceedings of the 36th International Conference on Algorithmic Learning Theory (ALT 2025)
    Preview abstract Recent research has introduced a key notion of $H$-consistency bounds for surrogate losses. These bounds offer finite-sample guarantees, quantifying the relationship between the zero-one estimation error (or other target loss) and the surrogate loss estimation error for a specific hypothesis set. However, previous bounds were derived under the condition that a lower bound of the surrogate loss conditional regret is given as a convex function of the target conditional regret, without non-constant factors depending on the predictor or input instance. Can we derive finer and more favorable $H$-consistency bounds? In this work, we relax this condition and present a general framework for establishing *enhanced $H$-consistency bounds* based on more general inequalities relating conditional regrets. Our theorems not only subsume existing results as special cases but also enable the derivation of more favorable bounds in various scenarios. These include standard multi-class classification, binary and multi-class classification under Tsybakov noise conditions, and bipartite ranking. View details
    Beyond Retrieval: Generating Narratives in Conversational Recommender Systems
    Krishna Sayana
    Raghavendra Vasudeva
    Yuri Vasilevski
    Kun Su
    Liam Hebert
    James Pine
    Hubert Pham
    Ambarish Jash
    Sukhdeep Sodhi
    (2025)
    Preview abstract Large Language Models (LLMs) have shown remarkable progress in generating human-quality text and engaging in complex reasoning. This presents a unique opportunity to revolutionize conversational recommender systems by enabling them to generate rich, engaging and personalized narratives that go beyond recommendations. However, the lack of suitable datasets limits research in this area. This paper addresses this challenge by making two key contributions. First, we introduce REGEN Reviews Enhanced with GEnerative Narratives, a new dataset extending the Amazon Product Reviews with rich user narratives. Furthermore, we perform an extensive automated evaluation of the dataset using a rater LLM. Second, the paper introduces a fusion architecture (CF model with an LLM) which serves as a baseline for REGEN. To the best of our knowledge, this represents the first attempt to analyze the capabilities of LLMs in understanding recommender signals and generating rich narratives. We demonstrate that LLMs can effectively learn from simple fusion architectures utilizing interaction-based CF embeddings, and this can be further enhanced using the metadata and personalization data associated with items. Our experiments show that combining CF and content embeddings leads to improvements of 4-12% in key language metrics compared to using either type of embedding individually. We also provide an analysis to interpret their contributions to this new generative task. View details
    Data Selection for ERMs
    Alexander Shlimovich
    Steve Hanneke
    Amir Yehudayoff
    Shay Moran
    2025
    Preview abstract Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule \(\mathcal{A}\) and a data selection budget \(n\), how well can \(\mathcal{A}\) perform when trained on at most \(n\) data points selected from a population of \(N\) points? We investigate when it is possible to select \(n \ll N\) points and achieve performance comparable to training on the entire population. We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research. View details
    Preview abstract Responsible AI advocates for user evaluations, particularly when concerning people with disabilities, health conditions, and accessibility needs ( DHA)–wide- ranging but umbrellaed sociodemograph- ics. However, community- centered text- to- image AI’s ( T2I) evaluations are often researcher- led, situating evaluators as consumers. We instead recruited 21 people with diverse DHA to evaluate T2I by writing and editing their own T2I prompts with their preferred language and topics, in a method mirroring everyday use. We contribute user- generated terminology categories which inform future research and data collections, necessary for developing authentic scaled evaluations. We additionally surface yet- discussed DHA AI harms intersecting race and class, and participants shared harm impacts they experienced as image- creator evaluators. To this end, we demonstrate that prompt engineering– proposed as a misrepresentation mitigation– was largely ineffective at improving DHA representations. We discuss the importance of evaluator agency to increase ecological validity in community- centered evaluations, and opportunities to research iterative prompting as an evaluation technique. View details
    On Design Principles for Private Adaptive Optimizers
    Abhradeep Guha Thakurta
    Arun Ganesh
    Privacy-Preserving Machine Learning Workshop 2025 (2025) (to appear)
    Preview abstract The spherical noise added to gradients in differentially private (DP) training undermines the performance of adaptive optimizers like AdaGrad and Adam, and hence many recent works have proposed algorithms to address this challenge. However, the empirical results in these works focus on simple tasks and models and the conclusions may not generalize to model training in practice. In this paper we survey several of these variants, and develop better theoretical intuition for them as well as perform empirical studies comparing them. We find that a common intuition of aiming for unbiased estimates of second moments of gradients in adaptive optimizers is misguided, and instead that a simple technique called scale-then-privatize (which does not achieve unbiased second moments) has more desirable theoretical behaviors and outperforms all other variants we study on a small-scale language model training task. We additionally argue that scale-then-privatize causes the noise addition to better match the application of correlated noise mechanisms which are more desirable to use in practice. View details
    ×