# Matthew Fahrbach

Matthew is a senior research scientist at Google on the Large-Scale Optimization team, which is part of the broader Algorithms and Optimization group.
He received his PhD in computer science from the Georgia Institute of Technology, where he was advised by Dana Randall. Prior to that, he studied computer science and mathematics at the University of Kentucky. He is the recipient of a FOCS 2020 Best Paper Award, NSF Graduate Research Fellowship, and Barry M. Goldwater Scholarship. His research interests broadly include algorithms, discrete mathematics, machine learning, and optimization.

Authored Publications

Google Publications

Other Publications

Sort By

Practical Performance Guarantees for Pipelined DNN Inference

Kuikui Liu

Prakash Prabhu

Proceedings of the 41st International Conference on Machine Learning(2024) (to appear)

Preview abstract
This work optimizes pipeline parallelism of machine learning model inference by
partitioning computation graphs into $k$ stages and minimizing the running time of the bottleneck stage.
We design practical algorithms for this NP-complete problem
and prove they are nearly optimal in practice by comparing against lower bounds
obtained from solving novel mixed-integer programming (MIP) formulations.
We apply these algorithms and lower-bound techniques
to production models to achieve substantial improvements in the approximation guarantees,
compared to simple combinatorial lower bounds.
For example, our new MIP formulations improve the lower bounds enough to
drop the geometric mean approximation ratio from $2.175$ to $1.082$ across
production data with $k=16$ pipeline stages.
This work shows that while bottleneck partitioning is theoretically hard,
in practice we have a handle on the algorithmic side of the problem and
much of the remaining challenge is in developing more accurate cost models
to give to the partitioning algorithms.
View details

PriorBoost: An Adaptive Algorithm for Learning from Aggregate Responses

Adel Javanmard

Proceedings of the 41st International Conference on Machine Learning(2024) (to appear)

Preview abstract
This work studies algorithms for learning from aggregate responses. We focus on the construction of aggregation sets (called \emph{bags} in the literature) for event-level loss functions. We prove for linear regression and generalized linear models (GLMs) that the optimal bagging problem reduces to one-dimensional size-constrained $k$-means clustering. Further, we theoretically quantify the advantage of using curated bags over random bags. We propose the \texttt{PriorBoost} algorithm, which iteratively forms increasingly homogenous bags with respect to (unseen) individual responses to improve model quality. We also explore label differential privacy for aggregate learning, and provide extensive experiments that demonstrate that \PriorBoost regularly achieves optimal quality, in contrast to non-adaptive algorithms for aggregate learning.
View details

Approximately Optimal Core Shapes for Tensor Decompositions

Mehrdad Ghadiri

Proceedings of the 40th International Conference on Machine Learning(2023), pp. 11237-11254

Preview abstract
This work studies the combinatorial optimization problem of finding an optimal core tensor shape, also called multilinear rank, for a size-constrained Tucker decomposition. We give an algorithm with provable approximation guarantees for its reconstruction error via connections to higher-order singular values. Specifically, we introduce a novel Tucker packing problem, which we prove is NP-hard, and give a polynomial-time approximation scheme based on a reduction to the 2-dimensional knapsack problem with a matroid constraint. We also generalize our techniques to tree tensor network decompositions. We implement our algorithm using an integer programming solver, and show that its solution quality is competitive with (and sometimes better than) the greedy algorithm that uses the true Tucker decomposition loss at each step, while also running up to 1000x faster.
View details

Learning Rate Schedules in the Presence of Distribution Shift

Adel Javanmard

Pratik Worah

Proceedings of the 40th International Conference on Machine Learning(2023), pp. 9523-9546

Preview abstract
We design learning rate schedules that minimize regret for SGD-based online learning in the presence of a changing data distribution. We fully characterize the optimal learning rate schedule for online linear regression via a novel analysis with stochastic differential equations. For general convex loss functions, we propose new learning rate schedules that are robust to distribution shift, and we give upper and lower bounds for the regret that only differ by constants. For non-convex loss functions, we define a notion of regret based on the gradient norm of the estimated models and propose a learning schedule that minimizes an upper bound on the total expected regret. Intuitively, one expects changing loss landscapes to require more exploration, and we confirm that optimal learning rate schedules typically increase in the presence of distribution shift. Finally, we provide experiments for high-dimensional regression models and neural networks to illustrate these learning rate schedules and their cumulative regret.
View details

Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems

Ben Coleman

Ruoxi Wang

Lichan Hong

Advances in Neural Information Processing Systems(2023)

Preview abstract
Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. A typical model ingests hundreds of features with vocabularies on the order of millions to billions of tokens. The standard approach is to represent each feature value as a d-dimensional embedding, which introduces hundreds of billions of parameters for extremely high-cardinality features. This bottleneck has led to substantial progress in alternative embedding algorithms. Many of these methods, however, make the assumption that each feature uses an independent embedding table. This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used for many different categorical features. Our theoretical and empirical analysis reveals that multiplexed embeddings can be decomposed into components from each constituent feature, allowing models to distinguish between features. We show that multiplexed representations give Pareto-optimal space-accuracy tradeoffs for three public benchmark datasets. Further, we propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware. Unified embedding gives significant improvements in offline and online metrics compared to highly competitive baselines across five web-scale search, ads, and recommender systems, where it serves billions of users across the world in industry-leading products.
View details

Sequential Attention for Feature Selection

Taisuke Yasuda

Lin Chen

Proceedings of the 11th International Conference on Learning Representations(2023)

Preview abstract
Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on L1 regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.
View details

Subquadratic Kronecker Regression with Applications to Tensor Decomposition

Mehrdad Ghadiri

Proceedings of the 36th Annual Conference on Neural Information Processing Systems(2022), pp. 28776-28789

Preview abstract
Kronecker regression is a highly-structured least squares problem $\min_{\mathbf{x}} \lVert \mathbf{K}\mathbf{x} - \mathbf{b} \rVert_{2}^2$, where the design matrix $\mathbf{K} = \mathbf{A}^{(1)} \otimes \cdots \otimes \mathbf{A}^{(N)}$ is a Kronecker product of factor matrices. This regression problem arises in each step of the widely-used alternating least squares (ALS) algorithm for computing the Tucker decomposition of a tensor. We present the first \emph{subquadratic-time} algorithm for solving Kronecker regression to a $(1+\varepsilon)$-approximation that avoids the exponential term $O(\varepsilon^{-N})$ in the running time. Our techniques combine leverage score sampling and iterative methods. By extending our approach to block-design matrices where one block is a Kronecker product, we also achieve subquadratic-time algorithms for (1) Kronecker ridge regression and (2) updating the factor matrices of a Tucker decomposition in ALS, which is not a pure Kronecker regression problem, thereby improving the running time of all steps of Tucker ALS. We demonstrate the speed and accuracy of this Kronecker regression algorithm on synthetic data and real-world image tensors.
View details

Edge-Weighted Online Bipartite Matching

Runzhou Tao

Zhiyi Huang

Journal of the ACM, 69(2022), 45:1-45:35

Preview abstract
Online bipartite matching is one of the most fundamental problems in the online algorithms literature. Karp, Vazirani, and Vazirani (STOC 1990) introduced an elegant algorithm for the unweighted problem that achieves an optimal competitive ratio of 1 - 1/e. Aggarwal et al. (SODA 2011) later generalized their algorithm and analysis to the vertex-weighted case. Little is known, however, about the most general edge-weighted problem aside from the trivial 1/2-competitive greedy algorithm. In this paper, we present the first online algorithm that breaks the long standing 1/2 barrier and achieves a competitive ratio of at least 0.5086. In light of the hardness result of Kapralov, Post, and Vondrák (SODA 2013) that restricts beating a 1/2 competitive ratio for the more general problem of monotone submodular welfare maximization, our result can be seen as strong evidence that edge-weighted bipartite matching is strictly easier than submodular welfare maximization in the online setting.
The main ingredient in our online matching algorithm is a novel subroutine called online correlated selection (OCS), which takes a sequence of pairs of vertices as input and selects one vertex from each pair. Instead of using a fresh random bit to choose a vertex from each pair, the OCS negatively correlates decisions across different pairs and provides a quantitative measure on the level of correlation. We believe our OCS technique is of independent interest and will find further applications in other online optimization problems.
View details

A Fast Minimum Degree Algorithm and Matching Lower Bound

Robert Cummings

Animesh Fatehpuria

Proceedings of the 32nd Annual ACM-SIAM Symposium on Discrete Algorithms(2021), pp. 724-734

Preview abstract
The minimum degree algorithm is one of the most widely-used heuristics for reducing the cost of solving large sparse systems of linear equations. It has been studied for nearly half a century and has a rich history of bridging techniques from data structures, graph algorithms, and scientific computing. We present a simple but novel combinatorial algorithm for computing an exact minimum degree elimination ordering in $O(nm)$ time. Our approach uses a careful amortized analysis, which also allows us to derive output-sensitive bounds for the running time of $O(\min\{m\sqrt{m^+}, \Delta m^+\} \log n)$, where $m^+$ is the number of unique fill edges and original edges encountered by the algorithm and $\Delta$ is the maximum degree of the input graph.
Furthermore, we show there cannot exist a minimum degree algorithm that runs in $O(nm^{1-\varepsilon})$ time, for any $\varepsilon > 0$, assuming the strong exponential time hypothesis. Our fine-grained reduction uses a new sparse, low-degree graph construction called \emph{$U$-fillers}, which act as pathological inputs and cause any minimum degree algorithm to exhibit nearly worst-case performance.
View details

Faster Graph Embeddings via Coarsening

Gramoz Goranci

Richard Peng

Sushant Sachdeva

Chi Wang

Proceedings of the 37th International Conference on Machine Learning(2020), pp. 2953-2963

Preview abstract
Graph embeddings are a ubiquitous tool for machine learning tasks on graph-structured data (e.g., node classification and link prediction). Computing embeddings for large-scale graphs, however, is often prohibitively inefficient, even if we are only interested in a small subset of relevant vertices. To address this, we present an efficient graph coarsening algorithm based on Schur complements that only computes the embeddings of the relevant vertices. We prove these embeddings are well approximated by the coarsened graph obtained via Gaussian elimination on the irrelevant vertices. As computing Schur complements can be expensive, we also give a nearly linear time algorithm to generate a coarsened graph on the relevant vertices that provably matches the Schur complement in expectation. In our experiments, we investigate various graph prediction tasks and demonstrate that computing embeddings of the coarsened graphs, rather than the entire graph, leads to significant time and space savings without sacrificing accuracy.
View details

Non-monotone Submodular Maximization with Nearly Optimal Adaptivity and Query Complexity

Proceedings of the 36th International Conference on Machine Learning(2019), pp. 1833-1842

Preview abstract
Submodular maximization is a general optimization problem with a wide range of applications in machine learning (e.g., active learning, clustering, and feature selection). In large-scale optimization, the parallel running time of an algorithm is governed by its adaptivity, which measures the number of sequential rounds needed if the algorithm can execute polynomially-many independent oracle queries in parallel. While low adaptivity is ideal, it is not sufficient for an algorithm to be efficient in practice—there are many applications of distributed submodular optimization where the number of function evaluations becomes prohibitively expensive. Motivated by these applications, we study the adaptivity and query complexity of submodular maximization. In this paper, we give the first constant-factor approximation algorithm for maximizing a nonmonotone submodular function subject to a cardinality constraint k that runs in O(log(n)) adaptive rounds and makes O(n log(k)) oracle queries in expectation. In our empirical study, we use three real-world applications to compare our algorithm with several benchmarks for non-monotone submodular maximization. The results demonstrate that our algorithm finds competitive solutions using significantly fewer rounds and queries.
View details

Submodular Maximization with Nearly Optimal Approximation, Adaptivity and Query Complexity

Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms(2019), pp. 255-273

Preview abstract
Submodular optimization generalizes many classic problems in combinatorial optimization and has recently found a wide range of applications in machine learning (e.g., feature engineering and active learning). For many large-scale optimization problems, we are often concerned with the adaptivity complexity of an algorithm, which quantifies the number of sequential rounds where polynomially-many independent function evaluations can be executed in parallel. While low adaptivity is ideal, it is not sufficient for a distributed algorithm to be efficient, since in many practical applications of submodular optimization the number of function evaluations becomes prohibitively expensive. Motivated by these applications, we study the adaptivity and query complexity of adaptive submodular optimization.
Our main result is a distributed algorithm for maximizing a monotone submodular function with cardinality constraint k that achieves a (1 − 1/e − ε)-approximation in expectation. This algorithm runs in O(log(n)) adaptive rounds and makes O(n) calls to the function evaluation oracle in expectation. The approximation guarantee and query complexity are optimal, and the adaptivity is nearly optimal. Moreover, the number of queries is substantially less than in previous works. We also extend our results to the submodular cover problem to demonstrate the generality of our algorithm and techniques.
View details

Slow Mixing of Glauber Dynamics for the Six-Vertex Model in the Ordered Phases

Dana Randall

Proceedings of the 23rd International Conference on Randomization and Computation(2019), 37:1-37:20

Nearly Tight Bounds for Sandpile Transience on the Grid

David Durfee

Yu Gao

Tao Xiao

Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms(2018), pp. 605-624

Analyzing Boltzmann Samplers for Bose–Einstein Condensates with Dirichlet Generating Functions

Megan Bernstein

Dana Randall

Proceedings of the 15th Meeting on Analytic Algorithmics and Combinatorics(2018), pp. 107-117

Graph Sketching Against Adaptive Adversaries Applied to the Minimum Degree Algorithm

Gary L. Miller

Richard Peng

Saurabh Sawlani

Junxing Wang

Shen Chen Xu

Proceedings of the 59th Annual IEEE Symposium on Foundations of Computer Science(2018), pp. 101-112

Approximately Sampling Elements with Fixed Rank in Graded Posets

Prateek Bhakta

Ben Cousins

Dana Randall

Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms(2017), pp. 1828-1838

Coefficients and Roots of Peak Polynomials