Matthew Fahrbach
Matthew is a Staff Research Scientist at Google in the Algorithms and Optimization group.
He received his PhD in computer science from the Georgia Institute of Technology, where he was advised by Dana Randall. Prior to that, he studied computer science and mathematics at the University of Kentucky. He is the recipient of a FOCS 2020 Best Paper Award, NSF Graduate Research Fellowship, and Barry Goldwater Scholarship. His research interests broadly include algorithms, discrete mathematics, machine learning, and optimization.
Authored Publications
Sort By
Practical Performance Guarantees for Pipelined DNN Inference
Kuikui Liu
Proceedings of the 41st International Conference on Machine Learning (2024), pp. 1655-1671
Preview abstract
This work optimizes pipeline parallelism of machine learning model inference by
partitioning computation graphs into $k$ stages and minimizing the running time of the bottleneck stage.
We design practical algorithms for this NP-complete problem
and prove they are nearly optimal in practice by comparing against lower bounds
obtained from solving novel mixed-integer programming (MIP) formulations.
We apply these algorithms and lower-bound techniques
to production models to achieve substantial improvements in the approximation guarantees,
compared to simple combinatorial lower bounds.
For example, our new MIP formulations improve the lower bounds enough to
drop the geometric mean approximation ratio from $2.175$ to $1.082$ across
production data with $k=16$ pipeline stages.
This work shows that while bottleneck partitioning is theoretically hard,
in practice we have a handle on the algorithmic side of the problem and
much of the remaining challenge is in developing more accurate cost models
to give to the partitioning algorithms.
View details
PriorBoost: An Adaptive Algorithm for Learning from Aggregate Responses
Adel Javanmard
Proceedings of the 41st International Conference on Machine Learning (2024), pp. 21410-21429
Preview abstract
This work studies algorithms for learning from aggregate responses. We focus on the construction of aggregation sets (called \emph{bags} in the literature) for event-level loss functions. We prove for linear regression and generalized linear models (GLMs) that the optimal bagging problem reduces to one-dimensional size-constrained $k$-means clustering. Further, we theoretically quantify the advantage of using curated bags over random bags. We propose the \texttt{PriorBoost} algorithm, which iteratively forms increasingly homogenous bags with respect to (unseen) individual responses to improve model quality. We also explore label differential privacy for aggregate learning, and provide extensive experiments that demonstrate that \PriorBoost regularly achieves optimal quality, in contrast to non-adaptive algorithms for aggregate learning.
View details
Sequential Attention for Feature Selection
Taisuke Yasuda
Lin Chen
Proceedings of the 11th International Conference on Learning Representations (2023)
Preview abstract
Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on L1 regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.
View details
Approximately Optimal Core Shapes for Tensor Decompositions
Mehrdad Ghadiri
Proceedings of the 40th International Conference on Machine Learning (2023), pp. 11237-11254
Preview abstract
This work studies the combinatorial optimization problem of finding an optimal core tensor shape, also called multilinear rank, for a size-constrained Tucker decomposition. We give an algorithm with provable approximation guarantees for its reconstruction error via connections to higher-order singular values. Specifically, we introduce a novel Tucker packing problem, which we prove is NP-hard, and give a polynomial-time approximation scheme based on a reduction to the 2-dimensional knapsack problem with a matroid constraint. We also generalize our techniques to tree tensor network decompositions. We implement our algorithm using an integer programming solver, and show that its solution quality is competitive with (and sometimes better than) the greedy algorithm that uses the true Tucker decomposition loss at each step, while also running up to 1000x faster.
View details
Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems
Ben Coleman
Ruoxi Wang
Lichan Hong
Advances in Neural Information Processing Systems (2023), pp. 56234-56255
Preview abstract
Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. A typical model ingests hundreds of features with vocabularies on the order of millions to billions of tokens. The standard approach is to represent each feature value as a d-dimensional embedding, which introduces hundreds of billions of parameters for extremely high-cardinality features. This bottleneck has led to substantial progress in alternative embedding algorithms. Many of these methods, however, make the assumption that each feature uses an independent embedding table. This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used for many different categorical features. Our theoretical and empirical analysis reveals that multiplexed embeddings can be decomposed into components from each constituent feature, allowing models to distinguish between features. We show that multiplexed representations give Pareto-optimal space-accuracy tradeoffs for three public benchmark datasets. Further, we propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware. Unified embedding gives significant improvements in offline and online metrics compared to highly competitive baselines across five web-scale search, ads, and recommender systems, where it serves billions of users across the world in industry-leading products.
View details
Learning Rate Schedules in the Presence of Distribution Shift
Adel Javanmard
Proceedings of the 40th International Conference on Machine Learning (2023), pp. 9523-9546
Preview abstract
We design learning rate schedules that minimize regret for SGD-based online learning in the presence of a changing data distribution. We fully characterize the optimal learning rate schedule for online linear regression via a novel analysis with stochastic differential equations. For general convex loss functions, we propose new learning rate schedules that are robust to distribution shift, and we give upper and lower bounds for the regret that only differ by constants. For non-convex loss functions, we define a notion of regret based on the gradient norm of the estimated models and propose a learning schedule that minimizes an upper bound on the total expected regret. Intuitively, one expects changing loss landscapes to require more exploration, and we confirm that optimal learning rate schedules typically increase in the presence of distribution shift. Finally, we provide experiments for high-dimensional regression models and neural networks to illustrate these learning rate schedules and their cumulative regret.
View details
Subquadratic Kronecker Regression with Applications to Tensor Decomposition
Mehrdad Ghadiri
Proceedings of the 36th Annual Conference on Neural Information Processing Systems (2022), pp. 28776-28789
Preview abstract
Kronecker regression is a highly-structured least squares problem $\min_{\mathbf{x}} \lVert \mathbf{K}\mathbf{x} - \mathbf{b} \rVert_{2}^2$, where the design matrix $\mathbf{K} = \mathbf{A}^{(1)} \otimes \cdots \otimes \mathbf{A}^{(N)}$ is a Kronecker product of factor matrices. This regression problem arises in each step of the widely-used alternating least squares (ALS) algorithm for computing the Tucker decomposition of a tensor. We present the first \emph{subquadratic-time} algorithm for solving Kronecker regression to a $(1+\varepsilon)$-approximation that avoids the exponential term $O(\varepsilon^{-N})$ in the running time. Our techniques combine leverage score sampling and iterative methods. By extending our approach to block-design matrices where one block is a Kronecker product, we also achieve subquadratic-time algorithms for (1) Kronecker ridge regression and (2) updating the factor matrices of a Tucker decomposition in ALS, which is not a pure Kronecker regression problem, thereby improving the running time of all steps of Tucker ALS. We demonstrate the speed and accuracy of this Kronecker regression algorithm on synthetic data and real-world image tensors.
View details
Edge-Weighted Online Bipartite Matching
Runzhou Tao
Zhiyi Huang
Journal of the ACM, 69 (2022), 45:1-45:35
Preview abstract
Online bipartite matching is one of the most fundamental problems in the online algorithms literature. Karp, Vazirani, and Vazirani (STOC 1990) introduced an elegant algorithm for the unweighted problem that achieves an optimal competitive ratio of 1 - 1/e. Aggarwal et al. (SODA 2011) later generalized their algorithm and analysis to the vertex-weighted case. Little is known, however, about the most general edge-weighted problem aside from the trivial 1/2-competitive greedy algorithm. In this paper, we present the first online algorithm that breaks the long standing 1/2 barrier and achieves a competitive ratio of at least 0.5086. In light of the hardness result of Kapralov, Post, and Vondrák (SODA 2013) that restricts beating a 1/2 competitive ratio for the more general problem of monotone submodular welfare maximization, our result can be seen as strong evidence that edge-weighted bipartite matching is strictly easier than submodular welfare maximization in the online setting.
The main ingredient in our online matching algorithm is a novel subroutine called online correlated selection (OCS), which takes a sequence of pairs of vertices as input and selects one vertex from each pair. Instead of using a fresh random bit to choose a vertex from each pair, the OCS negatively correlates decisions across different pairs and provides a quantitative measure on the level of correlation. We believe our OCS technique is of independent interest and will find further applications in other online optimization problems.
View details
A Fast Minimum Degree Algorithm and Matching Lower Bound
Robert Cummings
Animesh Fatehpuria
Proceedings of the 32nd Annual ACM-SIAM Symposium on Discrete Algorithms (2021), pp. 724-734
Preview abstract
The minimum degree algorithm is one of the most widely-used heuristics for reducing the cost of solving large sparse systems of linear equations. It has been studied for nearly half a century and has a rich history of bridging techniques from data structures, graph algorithms, and scientific computing. We present a simple but novel combinatorial algorithm for computing an exact minimum degree elimination ordering in $O(nm)$ time. Our approach uses a careful amortized analysis, which also allows us to derive output-sensitive bounds for the running time of $O(\min\{m\sqrt{m^+}, \Delta m^+\} \log n)$, where $m^+$ is the number of unique fill edges and original edges encountered by the algorithm and $\Delta$ is the maximum degree of the input graph.
Furthermore, we show there cannot exist a minimum degree algorithm that runs in $O(nm^{1-\varepsilon})$ time, for any $\varepsilon > 0$, assuming the strong exponential time hypothesis. Our fine-grained reduction uses a new sparse, low-degree graph construction called \emph{$U$-fillers}, which act as pathological inputs and cause any minimum degree algorithm to exhibit nearly worst-case performance.
View details
Faster Graph Embeddings via Coarsening
Gramoz Goranci
Richard Peng
Sushant Sachdeva
Chi Wang
Proceedings of the 37th International Conference on Machine Learning (2020), pp. 2953-2963
Preview abstract
Graph embeddings are a ubiquitous tool for machine learning tasks on graph-structured data (e.g., node classification and link prediction). Computing embeddings for large-scale graphs, however, is often prohibitively inefficient, even if we are only interested in a small subset of relevant vertices. To address this, we present an efficient graph coarsening algorithm based on Schur complements that only computes the embeddings of the relevant vertices. We prove these embeddings are well approximated by the coarsened graph obtained via Gaussian elimination on the irrelevant vertices. As computing Schur complements can be expensive, we also give a nearly linear time algorithm to generate a coarsened graph on the relevant vertices that provably matches the Schur complement in expectation. In our experiments, we investigate various graph prediction tasks and demonstrate that computing embeddings of the coarsened graphs, rather than the entire graph, leads to significant time and space savings without sacrificing accuracy.
View details