Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10793 publications
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Productionizing Quantum Mass Production
    Bill Huggins
    Nathan Wiebe
    arXiv for now (2026) (to appear)
    Preview abstract For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step. View details
    Preview abstract Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first-principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network’s inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model’s complexity. View details
    Fast Tensor Completion via Approximate Richardson Iteration
    Mehrdad Ghadiri
    Yunbum Kook
    Ali Jadbabaie
    Proceedings of the 42nd International Conference on Machine Learning (2025)
    Preview abstract We study tensor completion (TC) through the lens of low-rank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods, which solve highly structured linear regression problems at each step (e.g., for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is lost in TC regression problems, making direct extensions unclear. To address this, we propose a lifting approach that approximately solves TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We theoretically analyze the convergence rate of our approximate Richardson iteration based algorithm, and we demonstrate on real-world tensors that its running time can be 100x faster than direct methods for CP completion. View details
    Understanding challenges to the validity of disaggregated evaluations for algorithmic fairness
    Chirag Nagpal
    David Madras
    Vishwali Mhasawade
    Olawale Salaudeen
    Shannon Sequeira
    Santiago Arciniegas
    Lillian Sung
    Nnamdi Ezeanochie
    Heather Cole-Lewis
    Sanmi Koyejo
    Proceedings of the 2025 Conference on Neural Information Processing Systems (NeurIPS) (2025)
    Preview abstract Disaggregated evaluation across subgroups is critical for assessing the fairness of machine learning models, but its uncritical use can mislead practitioners. We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of the relevant populations but reflective of real-world disparities. Furthermore, when data are not representative due to selection bias, both disaggregated evaluation and alternative approaches based on conditional independence testing may be invalid without explicit assumptions regarding the bias mechanism. We use causal graphical models to characterize fairness properties and metric stability across subgroups under different data generating processes. Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift, including conditional independence testing and weighted performance estimation. These findings have broad implications for how practitioners design and interpret model assessments given the ubiquity of disaggregated evaluation. View details
    Preview abstract Judging an action’s safety requires knowledge of the context in which the action takes place. To human agents who act in various contexts, this may seem obvious: performing an action such as email deletion may or may not be appropriate depending on the email’s content, the goal (e.g., to erase sensitive emails or to clean up trash), and the type of email address (e.g., work or personal). Unlike people, computational systems have often had only limited agency in limited contexts. Thus, manually crafted policies and user confirmation (e.g., smartphone app permissions or network access control lists), while imperfect, have sufficed to restrict harmful actions. However, with the upcoming deployment of generalist agents that support a multitude of tasks (e.g., an automated personal assistant), we argue that we must rethink security designs to adapt to the scale of contexts and capabilities of these systems. As a first step, this paper explores contextual security in the domain of agents and proposes contextual agent security (Conseca), a framework to generate just-in-time, contextual, and human-verifiable security policies. View details
    Preview abstract Measuring software development can help drive impactful change. However, it’s a complex task, and getting started can be daunting as it involves understanding what you should measure, and determining what you can measure. This article provides a guide to selecting a framework that aligns with organizational measurement strategy. View details
    Bridging Sign and Spoken Languages: Pseudo GlossGeneration for Sign Language Translation
    Trevor Cohn
    Jianyuan Guo
    Advances in Neural Information Processing Systems (NeurIPS) (2025)
    Preview abstract Sign Language Translation (SLT) aims to map sign language videos to spoken language text. A common approach leverages gloss annotations as an intermediate representation, decomposing SLT into two sub-tasks: video-to-gloss recognition and gloss-to-text translation. While effective, this paradigm relies on expert-annotated gloss labels, which are costly and increasingly unavailable in many datasets, limiting scalability. To address this challenge, we propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses while preserving the structured intermediate representation. Specifically, we prompt a Large Language Model (LLM) with example text-gloss pairs to extract potential sign-related gloss words from the text by leveraging its in-context learning capability. To mitigate the inherent misalignment between generated pseudo glosses and sign sequences in the video, we further refine their order by formulating the alignment as a weakly supervised learning problem. With the reordered pseudo-glosses, additional alignment losses such as CTC can be incorporated to enhance supervision. We train our SLT model—comprising a vision encoder and a translator—under a three-stage pipeline, effectively bridging the gap between sign and spoken language. Despite its simplicity, our approach outperforms previous state-of-the-art gloss-free frameworks across three SLT benchmarks and achieves competitive results with gloss-based methods. View details
    Integer Programming for Generalized Causal Bootstrap Designs
    Adel Javanmard
    Nick Doudchenko
    Proceedings of the 42 nd International Conference on Machine Learning (2025)
    Preview abstract In experimental causal inference, we distinguish between two sources of uncertainty: design uncertainty, due to the treatment assignment mechanism, and sampling uncertainty, when the sample is drawn from a super-population. This distinction matters in settings with small fixed samples and heterogeneous treatment effects, as in geographical experiments. The standard bootstrap procedure most often used by practitioners primarily estimates sampling uncertainty, and the causal bootstrap procedure, which accounts for design uncertainty, was developed for the completely randomized design and the difference-in-means estimator, whereas non-standard designs and estimators are often used in these low-power regimes. We address this gap by proposing an integer program which computes numerically the worst-case copula used as an input to the causal bootstrap method, in a wide range of settings. Specifically, we prove the asymptotic validity of our approach for unconfounded, conditionally unconfounded, and individualistic with bounded confoundedness assignments, as well as generalizing to any linear-in-treatment and quadratic-in-treatment estimators. We demonstrate the refined confidence intervals achieved through simulations of small geographical experiments. View details
    Preview abstract Google has a long tradition of open-source software, which encompasses the field of operations research with OR-Tools. In development since 2008, it offers several solvers useful to many OR practitioners: - PDLP, a revolutionary first-order linear solver that is reshaping the landscape of linear optimisation; - CP-SAT, an award-winning constraint-programming solver; - Glop, an accurate linear solver; - Routing, a vehicle routing solver underpinning Google Maps Platform Route Optimization. OR-Tools has long had its features accessible from other languages: the core algorithms are implemented in C++ for performance, but users can tap into them in Python, Java, C#, or Go. It is recently available in Julia too, with a current focus on the linear and constraint solvers, either locally or remotely. We provide a wrapper for our solvers that brings them to JuMP.jl through MathOptInterface.jl. This tutorial will walk you through the features of OR-Tools and its solvers, then show examples of using OR-Tools from within Julia, either through JuMP or a lower-level interface. We will also share our experience of C++-Julia interop. View details
    Preview abstract Generative Artificial Intelligence (AI), particularly Large Language Models (LLMs), have demonstrated significant potential in clinical reasoning skills such as history-taking and differential diagnosis generation—critical aspects of medical education. This work explores how LLMs can augment medical curricula through interactive learning. We conducted a participatory design process with medical students, residents and medical education experts to co-create an AI-powered tutor prototype for clinical reasoning. As part of the co-design process, we conducted a qualitative user study, investigating learning needs and practices via interviews, and conducting concept evaluations through interactions with the prototype. Findings highlight the challenges learners face in transitioning from theoretical knowledge to practical application, and how an AI tutor can provide personalized practice and feedback. We conclude with design considerations, emphasizing the importance of context-specific knowledge and emulating positive preceptor traits, to guide the development of AI tools for medical education. View details
    Preview abstract Decoder-based large language models (LLMs) have proven highly versatile, with remarkable successes even on problems ostensibly removed from traditional language generation. One such example is solving regression problems, where the targets are real numbers rather than textual tokens. A common approach to use LLMs on such problems is to perform fine-tuning based on the cross-entropy loss, and use autoregressive sampling at inference time. Another approach relies on fine-tuning a separate predictive head with a suitable loss such as squared error. While each approach has had success, there has been limited study on principled ways of using decoder LLMs for regression. In this work, we compare different prior works under a unified view, and introduce regression-aware fine-tuning(RAFT), a novel approach based on the Bayes-optimal decision rule. We demonstrate how RAFT improves over established baselines on several benchmarks and model families. View details
    Faster electronic structure quantum simulation by spectrum amplification
    Guang Hao Low
    Robbie King
    Alec White
    Rolando Somma
    Dominic Berry
    Qiushi Han
    Albert Eugene DePrince III
    arXiv (2025) (to appear)
    Preview abstract We discover that many interesting electronic structure Hamiltonians have a compact and close-to-frustration-free sum-of-squares representation with a small energy gap. We show that this gap enables spectrum amplification in estimating ground state energies, which improves the cost scaling of previous approaches from the block-encoding normalization factor $\lambda$ to just $\sqrt{\lambda E_{\text{gap}}}$. For any constant-degree polynomial basis of fermionic operators, a sum-of-squares representation with optimal gap can be efficiently computed using semi-definite programming. Although the gap can be made arbitrarily small with an exponential-size basis, we find that the degree-$2$ spin-free basis in combination with approximating two-body interactions by a new Double-Factorized (DF) generalization of Tensor-Hyper-Contraction (THC) gives an excellent balance of gap, $\lambda$, and block-encoding costs. For classically-hard FeMoco complexes -- candidate applications for first useful quantum advantage -- this combination improves the Toffoli gates cost of the first estimates with DF [Phys. Rev. Research 3, 033055] or THC [PRX Quantum 2, 030305] by over two orders of magnitude. https://drive.google.com/file/d/1hw4zFv_X0GeMpE4et6SS9gAUM9My98iJ/view?usp=sharing View details
    Preview abstract Large Language Models (LLMs) have demonstrated impressive capabilities across a range of natural language processing tasks. In particular, improvements in reasoning abilities and the expansion of context windows have opened new avenues for leveraging these powerful models. NL2SQL is challenging in that the natural language question is inherently ambiguous, while the SQL generation requires a precise understanding of complex data schema and semantics. One approach to this semantic ambiguous problem is to provide more and sufficient contextual information. In this work, we explore the performance and the latency trade-offs of the extended context window (a.k.a., long context) offered by Google's state-of-the-art LLM (\textit{gemini-1.5-pro}). We study the impact of various contextual information, including column example values, question and SQL query pairs, user-provided hints, SQL documentation, and schema. To the best of our knowledge, this is the first work to study how the extended context window and extra contextual information can help NL2SQL generation with respect to both accuracy and latency cost. We show that long context LLMs are robust and do not get lost in the extended contextual information. Additionally, our long-context NL2SQL pipeline based on Google's \textit{gemini-pro-1.5} achieve a strong performance with 67.41\% on BIRD benchmark (dev) without finetuning and expensive self-consistency based techniques. View details
    Development and Evaluation of ML Models for Cardiotocography Interpretation
    Nicole Chiou
    Nichole Young-Lin
    Abdoulaye Diack
    Christopher Kelly
    Sanmi Koyejo
    NPJ Women's Health (2025)
    Preview abstract The inherent variability in the visual interpretation of cardiotocograms (CTGs) by obstetric clinical experts, both intra- and inter-observer, presents a substantial challenge in obstetric care. In response, we investigate automated CTG interpretation as a potential solution to enhance the early detection of fetal hypoxia during labor, thereby reducing unnecessary operative interventions and improving overall maternal and neonatal care. This study employs deep learning techniques to reduce the subjectivity associated with visual CTG interpretation. Our results demonstrate that employing objective cord blood pH measurements, rather than clinician-defined Apgar scores, yields more consistent and robust model performance. Additionally, through a series of ablation studies, we investigate the impact of temporal distribution shifts on the performance of these deep learning models. We examine tradeoffs between performance and fairness, specifically evaluating performance across demographic and clinical subgroups. Finally, we discuss the practical implications of our findings for the real-world deployment of such systems, emphasizing their potential utility in medical settings with limited resources. View details