Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10795 publications
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Productionizing Quantum Mass Production
    Bill Huggins
    Nathan Wiebe
    arXiv for now (2026) (to appear)
    Preview abstract For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step. View details
    Preview abstract The need for characterizing global variability of atmospheric carbon dioxide (CO2) is quickly increasing, with a growing urgency for tracking greenhouse gasses with sufficient resolution, precision and accuracy so as to support independent verification of CO2 fluxes at local to global scales. The current generation of space-based sensors, however, can only provide sparse observations in space and/or in time, by design. While upcoming missions may address some of these challenges, most are still years away from launch. This challenge has fueled interest in the potential use of data from existing missions originally developed for other applications for inferring global greenhouse gas variability. The Advanced Baseline Imager (ABI) onboard the Geostationary Operational Environmental Satellite (GOES-East), operational since 2017, provides full coverage of much of the western hemisphere at 10-minute intervals from geostationary orbit at 16 wavelengths. We leverage this high temporal resolution by developing a single-pixel, fully-connected neural network to estimate dry-air column CO2 mole fractions (XCO2). The model employs a time series of GOES-East's 16 spectral bands, which aids in disentangling atmospheric CO2 from surface reflectance, alongside ECMWF ERA5 lower tropospheric meteorology, solar angles, and day of year. Training used collocated GOES-East and OCO-2/OCO-3 observations (2017-2020, within 5 km and 10 minutes), with validation and testing performed on 2021 data. The model successfully captures monthly latitudinal XCO2 gradients and shows reasonable agreement with ground-based TCCON measurements. Furthermore, we demonstrate the model's ability to detect elevated XCO2 signals from high-emitting power plants, particularly over low-reflectance surfaces. We also confirm that removing bands 5 (1.6 µm) and 16 (13.3 µm) substantially decreases performance, indicating that the model is able to extract useful information from these bands. Although GOES-East derived XCO2 precision may not rival dedicated instruments, its unprecedented combination of contiguous geographic coverage, 10-minute temporal frequency, and multi-year record offers the potential to observe aspects of atmospheric CO2 variability currently unseen from space, with further potential through spatio-temporal aggregation. View details
    Spherical dimension
    Bogdan Chornomaz
    Shay Moran
    Tom Waknine
    2025
    Preview abstract We introduce and study the \emph{spherical dimension}, a natural topological relaxation of the VC dimension that unifies several results in learning theory where topology plays a key role in the proofs. The spherical dimension is defined by extending the set of realizable datasets (used to define the VC dimension) to the continuous space of realizable distributions. In this space, a shattered set of size d (in the VC sense) is completed into a continuous object, specifically a d-dimensional sphere of realizable distributions. The spherical dimension is then defined as the dimension of the largest sphere in this space. Thus, the spherical dimension is at least the VC dimension. The spherical dimension serves as a common foundation for leveraging the Borsuk-Ulam theorem and related topological tools. We demonstrate the utility of the spherical dimension in diverse applications, including disambiguations of partial concept classes, reductions from classification to stochastic convex optimization, stability and replicability, and sample compression schemes. Perhaps surprisingly, we show that the open question posed by Alon, Hanneke, Holzman, and Moran (FOCS 2021) of whether there exist non-trivial disambiguations for halfspaces with margin is equivalent to the basic open question of whether the VC and spherical dimensions are finite together. View details
    Preview abstract Recent work suggested utilizing inference compute, showing that scaling of number of samples consistently improves the fractions of problems solved by any attempt, namely the coverage. In this work, we suggest that inference scaling gains should be compared with proper baselines, as some datasets become degenerate when allowing a large number of attempts. We focus on two domains - mathematical reasoning and factual knowledge, showing that for the MATH and Entity Questions datasets, informed answer enumeration obtains similar or even better results than repeated model sampling, with a much lower sample budget. While we believe that inference scaling is a promising approach for unlocking the potential of language models, we recommend carefully selecting models and datasets when applying this method. Otherwise, the results of inference scaling should be interpreted with caution. View details
    RADAR: Benchmarking Language Models on Imperfect Tabular Data
    Ken Gu
    Kumar Ayush
    Hong Yu
    Zhihan Zhang
    Yuzhe Yang
    Shwetak Patel
    Max Xu
    Mark Malhotra
    Orson Xu
    Evelyn Zhang
    Tim Althoff
    2025
    Preview abstract Language models (LMs) are increasingly being deployed to perform autonomous data analyses, yet their~\textit{\robustnessTerm}-- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies—remains under-explored. These artifacts are common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data awareness on tabular data. RADAR introduces programmatic perturbations for each unique query table pair, enabling targeted evaluation of model behavior. RADAR~ comprises 2500 queries for data analysis across 55 datasets spanning 20 domains and 5 data awareness dimensions. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance scales with input length. In our evaluation, we identify fundamental gaps in their ability to perform reliable, data-aware analyses. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning. View details
    Preview abstract As one of the world's most populous countries, with 700 languages spoken, Indonesia is behind in terms of NLP progress. We introduce Lorax, a benchmark that focuses on low-resource languages of Indonesia and covers 6 diverse tasks: reading comprehension, open-domain QA, language inference, causal reasoning, translation, and cultural QA. We cover 20 languages, with the addition of 2 politeness registers for 3 of the languages. As a benchmark is essential to the progress itself, this data should provide a useful contribution to the community. We benchmark a diverse set of multilingual and region-focused LLMs and found that this benchmark is challenging. We note a visible discrepancy between performance in Indonesia and other languages, especially the low-resource ones. There is no clear lead when using a region-specific model as opposed to the general multilingual model. Lastly, we show that a change in register affects model performance, especially with registers not commonly found in social media, such as high-level politeness 'Krama' Javanese. View details
    RemapRoute: Local Remapping of Internet Path Changes
    renata cruz teixeira
    italo cunha
    Elverton Fazzion
    Darryl Veitch
    2025
    Preview abstract Several systems rely on traceroute to track a large number of Internet paths as they change over time. Monitoring systems perform this task by remapping paths periodically or whenever a change is detected. This paper shows that such complete remapping is inefficient, because most path changes are localized to a few hops of a path. We develop RemapRoute, a tool to remap a path locally given the previously known path and a change point. RemapRoute sends targeted probes to locate and remap the often few hops that have changed. Our evaluation with trace-driven simulations and in a real deployment shows that local remapping reduces the average number of probes issued during remapping by 63% and 79%, respectively, when compared with complete remapping. At the same time, our results show that local remapping has little impact on the accuracy of inferred paths. View details
    Preview abstract Generative AI's potential for hallucinations and inaccuracies are by far the most discussed limitation in AI-assisted software development. But, whether developers have other concerns about using generative AI in their coding practice has not been thoroughly explored. This article describes the results of in-depth interviews with developers about their other concerns about generative AI in coding, beyond the tools accuracy, and discusses related policy implications for organizations developing software. View details
    Quantum algorithm for linear matrix equations
    Rolando Somma
    Guang Hao Low
    Dominic Berry
    arXiv (2025)
    Preview abstract We describe an efficient quantum algorithm for solving the linear matrix equation AX+XB=C, where A, B, and C are given complex matrices and X is unknown. This is known as the Sylvester equation, a fundamental equation with applications in control theory and physics. Our approach constructs the solution matrix X/x in a block-encoding, where x is a rescaling factor needed for normalization. This allows us to obtain certain properties of the entries of X exponentially faster than would be possible from preparing X as a quantum state. The query and gate complexities of the quantum circuit that implements this block-encoding are almost linear in a condition number that depends on A and B, and depend logarithmically in the dimension and inverse error. We show how our quantum circuits can solve BQP-complete problems efficiently, discuss potential applications and extensions of our approach, its connection to Riccati equation, and comment on open problems. View details
    V𝜖rity: Verifiable Local Differential Privacy
    Amrita Roy Chowdhury
    Baiyu Li
    Adria Gascon
    James Bell-Clark
    2025
    Preview abstract Local differential privacy (LDP) enables individuals to report sensitive data while preserving privacy. Unfortunately, LDP mechanisms are vulnerable to poisoning attacks, where adversaries controlling a fraction of the reporting users can significantly distort the aggregate output–much more so than in a non-private solution where the inputs are reported directly. In this paper, we present two novel solutions that prevent poisoning attacks under LDP while preserving its privacy guarantees. Our first solution, Vϵrity-Auth, addresses scenarios where the users report inputs with a ground truth available to a third party. The second solution, Vϵrity, tackles the more challenging case in which the users locally generate their input and there is no ground truth which can be used to bootstrap verifiable randomness generation. View details
    Probing non-equilibrium topological order on a quantum processor
    Melissa Will
    Tyler Cochran
    Bernhard Jobst
    Norhan Eassa
    Michael Knap
    Adam Gammon-Smith
    Frank Pollmann
    Nature, 645 (2025), 348–353
    Preview abstract Out-of-equilibrium phases in many-body systems constitute a new paradigm in quantum matter—they exhibit dynamical properties that may otherwise be forbidden by equilibrium thermodynamics. Among these non-equilibrium phases are periodically driven (Floquet) systems, which are generically difficult to simulate classically because of their high entanglement. Here we realize a Floquet topologically ordered state on an array of superconducting qubits. We image the characteristic dynamics of its chiral edge modes and characterize its emergent anyonic excitations. Devising an interferometric algorithm allows us to introduce and measure a bulk topological invariant to probe the dynamical transmutation of anyons for system sizes up to 58 qubits. Our work demonstrates that quantum processors can provide key insights into the thus-far largely unexplored landscape of highly entangled non-equilibrium phases of matter. View details
    Preview abstract Despite exceptional achievements, training neural networks remains computationally expensive and is often plagued by instabilities that can degrade convergence. While learning rate schedules can help mitigate these issues, finding optimal schedules is time-consuming and resource-intensive. This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime. Surprisingly, we show that the order of gradient updates affects stability and convergence in gradient-based optimizers. We illustrate this new line of thinking using backward-SGD, which processes batch gradient updates like SGD but in reverse order. Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. While full backward-SGD is computationally intensive in practice, it highlights opportunities to exploit reverse training dynamics (or more generally alternate iteration orders) to improve training. To our knowledge, this represents a new and unexplored avenue in deep learning optimization. View details
    Preview abstract Buffered Linear Toeplitz (BLT) matrices are a family of parameterized lower-triangular matrices that play an important role in streaming differential privacy with correlated noise. Our main result is a BLT inversion theorem: the inverse of a BLT matrix is itself a BLT matrix with different parameters. We also present an efficient and differentiable O(d^3) algorithm to compute the parameters of the inverse BLT matrix, where d is the degree of the original BLT (typically d < 10). Our characterization enables direct optimization of BLT parameters for privacy mechanisms through automatic differentiation. View details
    Preview abstract Recent knowledge distillation (KD) research made significant progress on improving smaller student models to match larger teachers' performances. Two noticeable methods, supervised KD and on-policy KD emerged as the state-of-the-art approaches. However, supervised KD for auto-regressive models suffers from distribution mismatch between training over fixed dataset and inference over student generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples and the teacher's potential inaccuracies in assessing these samples. To address these limitations, we introduce Speculative Knowledge Distillation (SKD). Instead of solely training on teacher- or student-proposed samples, SKD leverages the student model to initially propose tokens following its own generation distribution. Subsequently, the teacher model is employed to replace tokens that are deemed out-of-distribution. Compared with supervised KD, the samples generated by SKD are more likely to align with the student's inference-time distribution, and 2) SKD can mitigate the generation of low-quality sequences by incorporating the teacher's feedback at each token. Furthermore, we demonstrate that SKD is a generic framework capable of implementing both supervised and on-policy knowledge distillation as specific instances. To validate SKD's effectiveness, we apply it to distill autoregressive large language models for various tasks, including translation, summarization, math, and instruction following. Our experiments consistently demonstrate SKD's superior performance compared to existing methods across different domains, tasks, data sizes, and model initialization strategies. View details