Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 1410 publications
    Automatic Synthesis of Specialized Hash Function
    Renato B Hoffmann
    Leonardo G Fae
    Fernando Magno Quintao Pereira
    Dalvan Grieber
    2025
    Preview abstract Hashing is a fundamental operation in various computer sci- ence applications. Despite the prevalence of specific key formats like social security numbers, MAC addresses, plate numbers, and URLs, hashing libraries typically treat them as general byte sequences. This paper introduces a technique for synthesizing specialized hash functions tailored to par- ticular byte formats. The proposed code generation method leverages three prevalent patterns: (i) fixed-length keys, (ii) keys with common subsequences, and (iii) keys ranging on predetermined sequences of bytes. The code generation pro- cess involves two algorithms: one identifies relevant regular expressions within key examples, and the other generates specialized hash functions based on these expressions. This approach, straightforward to implement, showcases improve- ments over highly optimized hash function implementations. Comparative analysis demonstrates that our synthetic func- tions outperform counterparts in the C++ Standard Template Library and the Google Abseil Library, achieving speedups ranging from 2% to 11%, depending on the key format. View details
    Preview abstract We consider the problem of auto-bidding in online advertising from the perspective of a single advertiser. The goal of the advertiser is to maximize their value under the Return-on-Spend (RoS) constraint, with performance measured in terms of \emph{regret} against the optimal offline solution that knows all queries a priori. Importantly, the value of the item is \textit{unknown} to the bidder ahead of time. The goal of the bidder is to quickly identify the optimal bid, while simultaneously satisfying budget and RoS constraints. Using a simple UCB-style algorithm, we provide the first result which achieves optimal regret and constraint violation for this problem. View details
    Preview abstract We consider the Coalition Structure Learning (CSL) problem in multi-agent systems, motivated by the existence of coalitions in many real-world systems, e.g., trading platforms and auction systems. In this problem, there is a hidden coalition structure within a set of $n$ agents, which affects the behavior of the agents in games. Our goal is to actively design a sequence of games for the agents to play, such that observations in these games can be used to learn the hidden coalition structure. In particular, we consider the setting where in each round, we design and present a game together with a strategy profile to the agents, and receive a multiple-bit observation -- for each agent, we observe whether or not they would like to deviate from the specified strategy in this given game. Our contributions are three-fold: First, we show that we can learn the coalition structure in $O(\log n)$ rounds if we are allowed to choose any normal-form game in each round, matching the information-theoretical lower bound, and the result can be extended to congestion games. Second, in a more restricted setting where we can only choose a graphical game with degree limit $d$, we develop an algorithm to learn the coalition structure in $O(n/d+\log d)$ rounds. Third, when we can only learn the coalition structure through running second-price auctions with personalized reserve prices, we show that the coalition structure can be learned in $O(c\log n)$ rounds, where $c$ is the size of the largest coalition. View details
    Improving simulation-based origin-destination demand calibration using sample segment counts data
    Arwa Alanqary
    Yechen Li
    The 12th Triennial Symposium on Transportation Analysis conference (TRISTAN XII), Okinawa, Japan (2025)
    Preview abstract This paper introduces a novel approach to demand estimation that utilizes partial observations of segment-level track counts. Building on established simulation-based demand estimation methods, we present a modified formulation that integrates sample track counts as a regularization term. This approach effectively addresses the underdetermination challenge in demand estimation, moving beyond the conventional reliance on a prior OD matrix. The proposed formulation aims to preserve the distribution of the observed track counts while optimizing the demand to align with observed path-level travel times. We tested this approach on Seattle's highway network with various congestion levels. Our findings reveal significant enhancements in the solution quality, particularly in accurately recovering ground truth demand patterns at both the OD and segment levels. View details
    Preview abstract Users of routing services like Apple Maps, Google Maps, and Waze frequently wonder why a given route is proposed. This question particularly arises when dynamic conditions like traffic and road closures cause unusual routes to be proposed. While many such dynamic conditions may exist in a road network at any time, only a small fraction of those conditions are typically relevant to a given user's route. In this work, we give a simple algorithm that identifies a small set of traffic-laden road segments that answer the following question: Which traffic conditions cause a particular shortest traffic-aware route to differ from the shortest traffic-free route? We theoretically and experimentally show that our algorithm generates small and interpretable answers to this question. View details
    Fast Tensor Completion via Approximate Richardson Iteration
    Mehrdad Ghadiri
    Yunbum Kook
    Ali Jadbabaie
    Proceedings of the 42nd International Conference on Machine Learning (2025)
    Preview abstract We study tensor completion (TC) through the lens of low-rank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods, which solve highly structured linear regression problems at each step (e.g., for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is lost in TC regression problems, making direct extensions unclear. To address this, we propose a lifting approach that approximately solves TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We theoretically analyze the convergence rate of our approximate Richardson iteration based algorithm, and we demonstrate on real-world tensors that its running time can be 100x faster than direct methods for CP completion. View details
    Preview abstract Estimating Origin-Destination (OD) travel demand is vital for effective urban planning and traffic management. Developing universally applicable OD estimation methodologies is significantly challenged by the pervasive scarcity of high-fidelity traffic data and the difficulty in obtaining city-specific prior OD estimates (or seed ODs), which are often prerequisite for traditional approaches. Our proposed method directly estimates OD travel demand by systematically leveraging aggregated, anonymized statistics from Google Maps Traffic Trends, obviating the need for conventional census or city-provided OD data. The OD demand is estimated by formulating a single-level, one-dimensional, continuous nonlinear optimization problem with nonlinear equality and bound constraints to replicate highway path travel times. The method achieves efficiency and scalability by employing a differentiable analytical macroscopic network model. This model by design is computationally lightweight, distinguished by its parsimonious parameterization that requires minimal calibration effort and its capacity for instantaneous evaluation. These attributes ensure the method's broad applicability and practical utility across diverse cities globally. Using segment sensor counts from Los Angeles and San Diego highway networks, we validate our proposed approach, demonstrating a two-thirds to three-quarters improvement in the fit to segment count data over a baseline. Beyond validation, we establish the method's scalability and robust performance in replicating path travel times across diverse highway networks, including Seattle, Orlando, Denver, Philadelphia, and Boston. In these expanded evaluations, our method not only aligns with simulation-based benchmarks but also achieves an average 13% improvement in it's ability to fit travel time data compared to the baseline during afternoon peak hours. View details
    Day-of-the-week Awareness in Time of Day Breakpoints for Traffic Light Plans
    Ori Rottenstreich
    Eliav Buchnik
    Shai Ferster
    Tom Kalvari
    Ron Tsibulsky
    Danny Veikherman
    Jack Haddad
    2025
    Preview abstract Time-of-day breakpoints (TODs) refer to the times over the day in which the plan of a traffic light is changed. Traditionally, TODs are selected jointly for all weekdays (Monday-Friday), typically with additional TODs dedicated to weekends. In this paper, we present an alternative approach motivated by traffic characteristics that can differ among the weekdays Monday-Friday and consider TODs which are day-of-the-week aware. The traffic-aware approach studies similarities among days and computes TODs that can be shared among days with similar characteristics but can also have other forms for weekdays with unique characteristics. Based on traffic properties derived from anonymized trajectories, we apply the new methodology to compute time-of-day breakpoints that are day-of-the-week aware in the city of Rio de Janeiro, Brazil and estimate the impact of the new methodology. View details
    Preview abstract This paper presents a novel framework for optimizing capacitor selection in electronic design using multi-objective linear and non-linear constrained optimization techniques. We demonstrate the effectiveness of this approach in minimizing cost and board area while meeting critical performance requirements. View details
    Preview abstract Julia's strength in mathematical computation and high performance makes it a popular choice across scientific fields, mostly due to its focus on mathematics in a broad sense and execution performance. It is a language of choice to implement new numerical algorithms, but it really shines in modelling for optimisation thanks to JuMP.jl and MathOptInterface.jl. These libraries are, first and foremost, made for mathematical optimisation (linear, mixed-integer, conic, etc.), yet they are now generic enough to support more paradigms, such as constraint programming. This talk will introduce the basic principles behind the current implementation of JuMP.jl and explain why and how they are very good matches for modelling using constraint programming… and solving using any kind of mixed-integer-programming solver. Constraint-programming solvers can also be implemented using linear programming, in a great collaboration between discrete and continuous optimisation. This talk will briefly explain the connection and its implementation in Google’s CP-SAT, a leading, award-winning constraint solver that uses linear programs in its solving process — a solver that will soon be available in Julia too. View details
    Preview abstract We study the problem of learning the optimal item pricing for a unit-demand buyer with independent item values, and we have query access to the underlying value distributions. We consider two common query model in the literature: the “sample complexity” model where we can obtain a sample of each item value, and the “pricing query complexity” model where we can set a price to each item and obtain a binary signal on whether the sampled value of the item is greater than our proposed price. In this work we give nearly tight sample complexity and pricing query complexity of the problem. View details
    Preview abstract We consider the question of learnability of distribution classes in the presence of adaptive adversaries – that is, adversaries capable of inspecting the whole sample requested by a learner and applying their manipulations before passing it on to the learner. We formulate a general notion of learnability with respect to adaptive adversaries, taking into account the budget of the adversary. We show that learnability with respect to additive adaptive adversaries is a strictly stronger condition than learnability with respect to additive oblivious adversaries. View details
    Synthetic Text Generation for Training Large Language Models (LLMs) via Gradient Matching
    Dang Nguyen
    Zeman Li
    Meisam Razaviyayn
    Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML) (2025)
    Preview abstract Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at https://github.com/BigML-CS-UCLA/GRADMM. View details
    Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs
    Bowen Tan
    Zheng Xu
    Eric Xing
    Zhiting Hu
    International Conference on Machine Learning (ICML) (2025)
    Preview abstract Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach. View details
    Preview abstract While prior works focused on the computational separation between Transformers and fully connected networks, here we focus on a purely statistical separation, and ask: What function class can Transformers learn with fewer samples compared to fully connected networks, even with infinite compute? We define a toy task---q-sparse token retrieval---and prove that the sample complexities of transformers that solve this task is smaller than those of feedforward and recurrent neural networks. View details