Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 11082 publications
A Computer Vision Problem in Flatland
Erin Connelly
Annalisa Crannell
Timothy Duff
Rekha R. Thomas
SIAM Journal on Applied Algebra and Geometry, 10 (2026), pp. 14-45
Preview abstract
When is it possible to project two sets of labeled points of equal cardinality lying in a pair of projective planes to the same image on a projective line? We give a complete answer to this question, obtaining the following results. We first show that such a pair of projections exist if and only if the two point sets are themselves images of a common point set in projective space. Moreover, we find that for generic pairs of point sets, a common projection exists if and only if their cardinality is at most seven. In these cases, we give an explicit description of the loci of projection centers that enable a common image.
View details
Preview abstract
Semantic data models express high-level business concepts and metrics, capturing the business logic needed to query a database correctly. Most data modeling solutions are built as layers above SQL query engines, with bespoke query languages or APIs. The layered approach means that semantic models can’t be used directly in SQL queries. This paper focuses on an open problem in this space – can we define semantic models in SQL, and make them naturally queryable in SQL?
In parallel, graph query is becoming increasingly popular, including in SQL. SQL/PGQ extends SQL with an embedded subset of the GQL graph query language, adding property graph views and making graph traversal queries easy.
We explore a surprising connection: semantic data models are graphs, and defining graphs is a data modeling problem. In both domains, users start by defining a graph model, and need query language support to easily traverse edges in the graph, which means doing joins in the underlying data.
We propose some useful SQL extensions that make it easier to use higher-level data model abstractions in queries. Users can define a “semantic data graph” view of their data, encapsulating the complex business logic required to query the underlying tables correctly. Then they can query that semantic graph model easily with SQL.
Our SQL extensions are useful independently, simplifying many queries – particularly, queries with joins. We make declared foreign key relationships usable for joins at query time – a feature that seems obvious but is notably missing in standard SQL.
In combination, these extensions provide a practical approach to extend SQL incrementally, bringing semantic modeling and graph query together with the relational model and SQL.
View details
ALF: Advertiser Large Foundation Model for Multi-Modal Advertiser Understanding
Sunny Rajagopalan
Alireza Golestaneh
Shubhra Chandra
Min Zhou
Jonathan Vronsky
Songbai Yan
2026
Preview abstract
We present ALF (Advertiser Large Foundation model), a multi-modal transformer architecture for understanding advertiser behavior and intent across text, image, video and structured data modalities. Through contrastive learning and multi-task optimization, ALF creates unified advertiser representations that capture both content and behavioral patterns. Our model achieves state-of-the-art performance on critical tasks including fraud detection, policy violation identification, and advertiser similarity matching. In production deployment, ALF reduces false positives by 90\% while maintaining 99.8\% precision on abuse detection tasks. The architecture's effectiveness stems from its novel combination of multi-modal transformations, intersample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs.
View details
CrossCheck: Input Validation for WAN Control Systems
Rishabh Iyer
Isaac Keslassy
Sylvia Ratnasamy
Networked Systems Design and Implementation (NSDI) (2026) (to appear)
Preview abstract
We present CrossCheck, a system that validates inputs to the Software-Defined Networking (SDN) controller in a Wide Area Network (WAN). By detecting incorrect inputs—often stemming from bugs in the SDN control infrastructure—CrossCheck alerts operators before they trigger network outages.
Our analysis at a large-scale WAN operator identifies invalid inputs as a leading cause of major outages, and we show how CrossCheck would have prevented those incidents. We deployed CrossCheck as a shadow validation system for four weeks in a production WAN, during which it accurately detected the single incident of invalid inputs that occurred while sustaining a 0% false positive rate under normal operation, hence imposing little additional burden on operators. In addition, we show through simulation that CrossCheck reliably detects a wide range of invalid inputs (e.g., detecting demand perturbations as small as 5% with 100% accuracy) and maintains a near-zero false positive rate for realistic levels of noisy, missing, or buggy telemetry data (e.g., sustaining zero false positives with up to 30% of corrupted telemetry data).
View details
Preview abstract
AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization.
View details
Who Controls the Curriculum for AI? The Limits of Participatory Design for Educational AI
Michael Madaio
Learning Under Algorithmic Conditions, University of Minnesota Press (2026)
Preview abstract
Participatory design is a long-standing effort to shift control over technology design from technologists to users and communities impacted by technologies. For educational AI, this means involving students, families, teachers, and other stakeholders in shaping the design of AI systems. While promising, in this article, I situate the recent calls for participatory design of educational AI systems within a different historical tradition—that of contests over local control of educational curricula. I argue that approaches that attempt to steer the design and development of educational AI through participatory methods may inadvertently reproduce the history of political contestation of educational curricula, in ways that may privilege the most powerful communities, rather than those inequitably impacted. What might it look like to treat participatory AI design as a site for political contestation? How might these approaches avoid reproducing the same majoritarian tendencies that led to educational inequities in the first place?
View details
Preview abstract
How many T gates are needed to approximate an arbitrary n-qubit quantum state to within
a given precision ϵ? Improving prior work of Low, Kliuchnikov and Schaeffer, we show that the
optimal asymptotic scaling is Θ(sqrt{2^n log(1/ε)} + log(1/ε)) if we allow an unlimited number of ancilla qubits. We also show that this is the optimal T-count for implementing an arbitrary
diagonal n-qubit unitary to within error ϵ. We describe an application to batched synthesis of
single-qubit unitaries: we can approximate a tensor product of m = O(log log(1/ϵ)) arbitrary
single-qubit unitaries to within error ϵ with the same asymptotic T-count as is required to
approximate just one single-qubit unitary.
View details
Preview abstract
For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step.
View details
Escaping Collapse: The Strength of Weak Data for Large Language Model Training
Sara Babakniya
Advances in Neural Information Processing Systems 38 (NeurIPS 2025)
Preview abstract
Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper curation it can cause LLM performance to plateau, or even "collapse", after many training iterations. In this paper, we formalize this question and develop a theoretical framework to investigate how much curation is needed in order to ensure that LLM performance continually improves. Our analysis is inspired by boosting, a classic machine
learning technique that leverages a very weak learning algorithm to produce an arbitrarily good classifier. The approach we analyze subsumes many recently proposed methods for training LLMs on synthetic data, and thus our analysis sheds light on why they are successful, and also suggests opportunities for future improvement. We present experiments that validate our theory, and show that dynamically focusing labeling resources on the most challenging examples --- in much the same way that boosting focuses the efforts of the weak learner --- leads to improved performance.
View details
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Yingcong Li
Maryam Fazel
Samet Oymak
Davoud Ataee Tarzanagh
Conference on Language Modeling (COLM) 2025
Preview abstract
Linear attention methods provide a strong alternative to softmax attention as they allow for efficient recurrent decoding. Recent research has focused on enhancing standard linear attention by incorporating gating while retaining its computational benefits. Such Gated Linear Attention (GLA) architectures include highly competitive models such as Mamba and RWKV. In this work, we examine the in-context learning capabilities of the GLA model and make the following contributions. We show that a multilayer GLA can implement a general class of Weighted Projected Gradient Descent (WPGD) algorithms with data-dependent weights. These weights are induced by the gating and allows the model to control the contribution of individual tokens to prediction. To further understand the mechanics of weighting, we introduce a novel data model with multitask prompts and characterize the optimization landscape of the problem of learning a WPGD algorithm. We identify mild conditions under which there is a unique (global) minimum up to scaling invariance, and the associated WPGD algorithm is unique as well. Finally, we translate these findings to explore the optimization landscape of GLA and
shed light on how gating facilitates context-aware learning and when it is provably better than vanilla linear attention.
View details
Preview abstract
The main goal in distributed symmetry-breaking is to understand the locality of problems;
i.e., the radius of the neighborhood that a node needs to explore in order to arrive at its part
of a global solution. In this work, we study the locality of matching problems in the family
of regular graphs, which is one of the main benchmarks for establishing lower bounds on the
locality of symmetry-breaking problems, as well as for obtaining classification results. Our main
results are summarized as follows:
1. Approximate matching: We develop randomized algorithms to show that (1 + ϵ)-
approximate matching in regular graphs is truly local; i.e., the locality depends only on ϵ
and is independent of all other graph parameters. Furthermore, as long as the degree ∆ is
not very small (namely, as long as ∆ ≥ poly(1/ϵ)), this dependence is only logarithmic in
1/ϵ. This stands in sharp contrast to maximal matching in regular graphs which requires
some dependence on the number of nodes n or the degree ∆. We show matching lower
bounds for both results.
2. Maximal matching: Our techniques further allow us to establish a strong separation
between the node-averaged complexity and worst-case complexity of maximal matching in
regular graphs, by showing that the former is only O(1).
Central to our main technical contribution is a novel martingale-based analysis for the ≈ 40-
year-old algorithm by Luby. In particular, our analysis shows that applying one round of Luby’s
algorithm on the line graph of a ∆-regular graph results in an almost ∆/2-regular graph.
View details
Balls-and-Bins Sampling for DP-SGD
Lynn Chua
Charlie Harrison
Pritish Kamath
Ethan Leeman
Amer Sinha
Chiyuan Zhang
AISTATS (2025)
Preview abstract
We introduce the Balls-and-Bins sampling for differentially private (DP) optimization methods such as DP-SGD. While it has been common practice to use some form of shuffling in DP-SGD implementations, privacy accounting algorithms have typically assumed that Poisson subsampling is used instead. Recent work by Chua et al. (2024) however pointed out that shuffling based DP-SGD can have a much larger privacy cost in practical regime of parameters. We show that the Balls-and-Bins sampling achieves the “best-of-both” samplers, namely, the implementation of Balls-and-Bins sampling is similar to that of Shuffling and models trained with Balls-and-Bins based DP-SGD achieve utility comparable to those trained with Shuffle based DP-SGD at the same noise multiplier, and yet, Balls-and-Bins sampling enjoys similar-or-better privacy amplification as compared to Poisson subsampling.
View details
Mufu: Multilingual Fused Learning for Low- Resource Translation with LLM
Zheng Lim
Honglin Yu
Trevor Cohn
International Conference on Learning Representations (ICLR) 2025
Preview abstract
Multilingual large language models (LLMs) are great translators, but this is largely limited to high-resource languages. For many LLMs, translating in and out of low-resource languages remains a challenging task. To maximize data efficiency in this low-resource setting, we introduce Mufu, which includes a selection of automatically generated multilingual candidates and an instruction to correct inaccurate translations in the prompt. Mufu prompts turn a translation task into a postediting one, and seek to harness the LLM's reasoning capability with auxiliary translation candidates, from which the model is required to assess the input quality, align the semantics cross-lingually, copy from relevant inputs and override instances that are incorrect. Our experiments on En-XX translations over the Flores-200 dataset show LLMs finetuned against Mufu-style prompts are robust to poor quality auxiliary translation candidates, achieving performance superior to NLLB 1.3B distilled model in 64% of low- and very-low-resource language pairs. We then distill these models to reduce inference cost, while maintaining on average 3.1 chrF improvement over finetune-only baseline in low-resource translations.
View details
HueManity: Probing Fine-Grained Visual Perception in MLLMs
Rynaa Grover
Jayant Tamarapalli
Sahiti Yerramilli
Nilay Pande
arXiv preprint arXiv:2506.03194 (2025)
Preview abstract
Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric "easy" task and a striking 3% on the alphanumeric "hard" task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We will open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.
View details
TOKENFORMER: Rethinking Transformers Scaling with Tokenized Model Parameters
Haiyang Wang
Fan Yue
Jan Eric Lenssen
Liwei Wang
Bernt Schiele
2025
Preview abstract
Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on fixed parameters within linear projections, especially when architectural modifications (e.g., channel dimensions) are introduced. Each scaling iteration typically requires retraining the entire model from the beginning, leading to suboptimal utilization of computational resources. To overcome this limitation, we introduce TokenFormer, a naturally scalable architecture that leverages the attention mechanism exclusively for computations among input tokens and interactions between input tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformer with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This innovative approach allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124 million to 1.4 billion parameters by incrementally adding new key-value parameters, achieving performance comparable to models trained from scratch while greatly reducing training costs. Code and models will be publicly available.
View details