Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10795 publications
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Productionizing Quantum Mass Production
    Bill Huggins
    Nathan Wiebe
    arXiv for now (2026) (to appear)
    Preview abstract For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step. View details
    Quantum simulation with sum-of-squares spectral amplification
    Robbie King
    Guang Hao Low
    Rolando Somma
    arXiv:2505.01528 (2025)
    Preview abstract We introduce sum-of-squares spectral amplification (SOSSA), a framework for improving quantum simulation algorithms relevant to low-energy problems. SOSSA first represents the Hamiltonian as a sum-of-squares and then applies spectral amplification to amplify the low-energy spectrum. The sum-of-squares representation can be obtained using semidefinite programming. We show that SOSSA can improve the efficiency of traditional methods in several simulation tasks involving low-energy states. Specifically, we provide fast quantum algorithms for energy and phase estimation that improve over the state-of-the-art in both query and gate complexities, complementing recent results on fast time evolution of low-energy states. To further illustrate the power of SOSSA, we apply it to the Sachdev-Ye-Kitaev model, a representative strongly correlated system, where we demonstrate asymptotic speedups by a factor of the square root of the system size. Notably, SOSSA was recently used in [G.H. Low \textit{et al.}, arXiv:2502.15882 (2025)] to achieve state-of-art costs for phase estimation of real-world quantum chemistry systems. View details
    Preview abstract Cardinality sketches are compact data structures that efficiently estimate the number of distinct elements across multiple queries while minimizing storage, communication, and computational costs. However, recent research has shown that these sketches can fail under {\em adaptively chosen queries}, breaking down after approximately $\tilde{O}(k^2)$ queries, where $k$ is the sketch size. In this work, we overcome this \emph{quadratic barrier} by designing robust estimators with fine-grained guarantees. Specifically, our constructions can handle an {\em exponential number of adaptive queries}, provided that each element participates in at most $\tilde{O}(k^2)$ queries. This effectively shifts the quadratic barrier from the total number of queries to the number of queries {\em sharing the same element}, which can be significantly smaller. Beyond cardinality sketches, our approach expands the toolkit for robust algorithm design. View details
    Improved Balanced Classification with Theoretically Grounded Loss Functions
    The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
    Preview abstract The *balanced loss* is a widely adopted objective for multi-class classification under class imbalance. By assigning equal importance to all classes, regardless of their frequency, it promotes fairness and ensures that minority classes are not overlooked. However, directly minimizing the balanced classification loss is typically intractable, which makes the design of effective surrogate losses a central question. This paper introduces and studies two advanced surrogate loss families: Generalized Logit-Adjusted (GLA) loss functions and Generalized Class-Aware weighted (GCA) losses. GLA losses generalize Logit-Adjusted losses, which shift logits based on class priors, to the broader general cross-entropy loss family. GCA loss functions extend the standard class-weighted losses, which scale losses inversely by class frequency, by incorporating class-dependent confidence margins and extending them to the general cross-entropy family. We present a comprehensive theoretical analysis of consistency for both loss families. We show that GLA losses are Bayes-consistent, but only $H$-consistent for unbounded and complete hypothesis sets. Moreover, their $H$-consistency bounds depend inversely on the minimum class probability, scaling at least as $1/\mathsf p _{\min}$. In contrast, GCA losses are $H$-consistent for any hypothesis set that is bounded or complete, with $H$-consistency bounds that scale more favorably as $1/\sqrt{\mathsf p _{\min}}$, offering significantly stronger theoretical guarantees in imbalanced settings. We report the results of experiments demonstrating that, empirically, both the GCA losses with calibrated class-dependent confidence margins and GLA losses can greatly outperform straightforward class-weighted losses as well as the LA losses. GLA generally performs slightly better in common benchmarks, whereas GCA exhibits a slight edge in highly imbalanced settings. Thus, we advocate for both GLA and GCA losses as principled, theoretically sound, and state-of-the-art surrogates for balanced classification under class imbalance. View details
    Improving simulation-based origin-destination demand calibration using sample segment counts data
    Arwa Alanqary
    Yechen Li
    The 12th Triennial Symposium on Transportation Analysis conference (TRISTAN XII), Okinawa, Japan (2025)
    Preview abstract This paper introduces a novel approach to demand estimation that utilizes partial observations of segment-level track counts. Building on established simulation-based demand estimation methods, we present a modified formulation that integrates sample track counts as a regularization term. This approach effectively addresses the underdetermination challenge in demand estimation, moving beyond the conventional reliance on a prior OD matrix. The proposed formulation aims to preserve the distribution of the observed track counts while optimizing the demand to align with observed path-level travel times. We tested this approach on Seattle's highway network with various congestion levels. Our findings reveal significant enhancements in the solution quality, particularly in accurately recovering ground truth demand patterns at both the OD and segment levels. View details
    Preview abstract Due to the size and complexity of modern large language models (LLMs), it has proven challenging to uncover the underlying mechanisms that models use to solve reasoning problems. For instance, is their reasoning for a specific problem localized to certain parts of the network? Do they break down the reasoning problem into modular components that are then executed as sequential steps as we go deeper in the model? To better understand the reasoning capability of LLMs, we study a minimal propositional logic problem that requires combining multiple facts to arrive at a solution. By studying this problem on Mistral and Gemma models, up to 27B parameters, we illuminate the core components the models use to solve such logic problems. From a mechanistic interpretability point of view, we use causal mediation analysis to uncover the pathways and components of the LLMs' reasoning processes. Then, we offer fine-grained insights into the functions of attention heads in different layers. We not only find a sparse circuit that computes the answer, but we decompose it into sub-circuits that have four distinct and modular uses. Finally, we reveal that three distinct models -- Mistral-7B, Gemma-2-9B and Gemma-2-27B -- contain analogous but not identical mechanisms. View details
    Synthetic Text Generation for Training Large Language Models (LLMs) via Gradient Matching
    Dang Nguyen
    Zeman Li
    Meisam Razaviyayn
    Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML) (2025)
    Preview abstract Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at https://github.com/BigML-CS-UCLA/GRADMM. View details
    Pragmatic Fairness: Evaluating ML Fairness Within the Constraints of Industry
    Jessie Smith
    Michael Madaio
    Robin Burke
    Casey Fiesler
    2025
    Preview abstract Machine learning (ML) fairness evaluation in real-world, industry settings presents unique challenges due to business-driven constraints that influence decision-making processes. While prior research has proposed fairness frameworks and evaluation methodologies, these approaches often focus on idealized conditions and may lack consideration for the practical realities faced by industry practitioners. To understand these practical realities, we conducted a semi-structured interview study with 21 experts from academia and industry specializing in ML fairness. Through this study, we explore three constraints of ML fairness evaluation in industry— balancing competing interests, lacking power/access, and getting buy-in—and how these constraints lead to satisficing, seeking satisfactory rather than ideal outcomes. We define the path from these constraints to satisficing as pragmatic fairness. Using recommender systems as a case study, we explore how practitioners navigate these constraints and highlight actionable strategies to improve fairness evaluations within these business-minded boundaries. This paper provides practical insights to guide fairness evaluations in industry while also showcasing how the FAccT community can better align research goals with the operational realities of practitioners. View details
    Amplifying Trans and Nonbinary Voices: A Community-Centred Harm Taxonomy for LLMs
    Eddie Ungless
    Beka Gulotta
    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (2025)
    Preview abstract We explore large language model (LLM) responses that may negatively impact the transgender and nonbinary (TGNB) community and introduce the Transing Transformers Toolkit, T3, which provides resources for identifying such harmful response behaviors. The heart of T3 is a community-centred taxonomy of harms, developed in collaboration with the TGNB community, which we complement with, amongst other guidance, suggested heuristics for evaluation. To develop the taxonomy, we adopted a multi-method approach that included surveys and focus groups with community experts. The contribution highlights the importance of community-centred approaches in mitigating harm, and outlines pathways for LLM developers to improve how their models handle TGNB-related topics. View details
    Private List Learnability vs. Online List Learnability
    Hilla Schefler
    Steve Hanneke
    Iska Tsubari
    Shay Moran
    2025
    Preview abstract This work explores the connection between differential privacy (DP) and online learning in the context of PAC list learning. In this setting, a $k$-list learner outputs a list of $k$ potential predictions for an instance $x$ and incurs a loss if the true label of $x$ is not included in the list. A basic result in the multiclass PAC framework with a finite number of labels states that private learnability is equivalent to online learnability [\citet*{AlonLMM19,BunLM20,JungKT20}]. Perhaps surprisingly, we show that this equivalence does not hold in the context of list learning. Specifically, we prove that, unlike in the multiclass setting, a finite $k$-Littlestone dimension—a variant of the classical Littlestone dimension that characterizes online $k$-list learnability—is not a sufficient condition for DP $k$-list learnability. However, similar to the multiclass case, we prove that it remains a necessary condition. To demonstrate where the equivalence breaks down, we provide an example showing that the class of monotone functions with $k+1$ labels over $\mathbb{N}$ is online $k$-list learnable, but not DP $k$-list learnable. This leads us to introduce a new combinatorial dimension, the \emph{$k$-monotone dimension}, which serves as a generalization of the threshold dimension. Unlike the multiclass setting, where the Littlestone and threshold dimensions are finite together, for $k>1$, the $k$-Littlestone and $k$-monotone dimensions do not exhibit this relationship. We prove that a finite $k$-monotone dimension is another necessary condition for DP $k$-list learnability, alongside finite $k$-Littlestone dimension. Whether the finiteness of both dimensions implies private $k$-list learnability remains an open question. View details
    Preview abstract Measuring software development can help drive impactful change. However, it’s a complex task, and getting started can be daunting as it involves understanding what you should measure, and determining what you can measure. This article provides a guide to selecting a framework that aligns with organizational measurement strategy. View details
    Correspondance: Wearing a Fur Coat in the Summertime: Should Digital Pathology Redefine Medical Imaging?
    Kenneth Philbrick
    Brian Napora
    John Groth
    Mustafa Yousuf
    Journal of Pathology Informatics (2025)
    Preview abstract In response to recent critiques, members of DICOM Working Group 26 assert that DICOM is the robust and essential standard for digital pathology, actively facilitating interoperability and communication of medical images far beyond simple pixel data. They highlight successful global deployments and collaborations (like the recent Connectathon) demonstrating DICOM's proven ability to integrate WSI scanners, archives, viewers, and AI tools. Despite concerns, DICOM offers flexible metadata encoding, robust security features, and strong industry and regulatory support, making it indispensable for patient care. The authors advocate for continued investment in and adoption of DICOM to advance efficiency, accuracy, and patient safety in integrated healthcare systems. View details
    Preview abstract This IEEE Spectrum article reflects on advocacy for U.S. technological leadership during my Congressional visit through IEEE-USA. Leading an expert group of other distinguished IEEE members, we urged lawmakers to support critical initiatives. Key priorities included sustained funding for federal research institutions like NIST, NASA, and the NSF, reauthorizing the SBIR/STTR programs vital for small business innovation, and passing the CREATE AI Act to democratize AI resources by establishing the National AI Research Resource (NAIRR). We also emphasized strengthening the STEM talent pipeline through the CHIPS and Science Act and expanding high-skilled immigrant visas. We highlighted rapid AI advancements, such as autonomous vehicles, the surge in FDA-approved AI based medical devices, as underscoring the need for these strategic investments and policy actions. The article conveys a sense of urgency, calling for concrete congressional action to ensure the U.S. maintains its technological edge while also sharing my personal experiences. View details
    Syntactic and Semantic Gender Biases in the Language on Children’s Television: Evidence from a Corpus of 98 Shows from 1960 to 2018
    Andrea Vial
    Ruyuan Zuo
    Shreya Havaldar
    Morteza Dehghani
    Andrei Cimpian
    Psychological Science (2025)
    Preview abstract Biased media content shapes children’s social concepts and identities. We examined gender bias in a large corpus of scripts from 98 children’s television programs spanning 1960 to 2018 (6,600 episodes, ~2.7 million sentences, ~16 million words). We focused on agency and communion, the fundamental psychological dimensions underlying gender stereotypes. At the syntactic level, words referring to men/boys (vs. women/girls) appear more often in the agent (vs. patient) role. This syntactic bias remained stable between 1960 and 2018. At the semantic level, words referring to men/boys (vs. women/girls) co-occurred more often with words denoting agency. Words denoting communion showed both stereotypical and counterstereotypical associations. Some semantic gender biases have remained unchanged or weakened over time; others have grown. These findings suggest gender stereotypes are built into the core of children’s stories. Whether we are closer to gender equality in children’s media depends on where one looks. View details