Mohammadhossein Bateni
MohammadHossein Bateni is a staff research scientist at Google, where he is a member of the NYC Algorithms and Optimization Team. He obtained his Ph.D. and M.A. in Computer Science from Princeton University in 2011 and 2008, respectively, after finishing his undergraduate studies with a B.Sc. in Computer Engineering at Sharif University of Technology in 2006.
Hossein is broadly interested in combinatorics and combinatorial optimization. His research focuses on approximation algorithms, distributed computing, and analysis of game-theoretic models.
Authored Publications
Sort By
SubMix: Learning to Mix Graph Sampling Heuristics
Josh Dillon
Johannes Gasteiger
Preview abstract
Sampling subgraphs for training Graph Neural Networks (GNNs) is receiving much attention from the GNN community. While a variety of methods have been proposed, each method samples the graph according to its own heuristic. However, there has been little work in mixing these heuristics in an end-to-end trainable manner. In this work, we design a generative framework for graph sampling. Our method, SubMix, parameterizes subgraph sampling as a convex combination of heuristics. We show that a continuous relaxation of the discrete sampling process allows us to efficiently obtain analytical gradients for training the sampling parameters. Our experimental results illustrate the usefulness of learning graph sampling in three scenarios: (1) robust training of GNNs by automatically learning to discard noisy edge sources; (2) improving model performance by trainable and online edge subset selection; and (3) by integrating our framework into decoupled GNN models improves their performance on standard benchmarks.
View details
Data Sampling using Locality Sensitive Hashing for Large Scale Graph Learning
Mohamed Farghal
Animesh Nandi
Sarath Shekkizhar
Preview abstract
An important step in graph-based data analysis and processing is the construction of similarity graphs. Recent works have focused on the semi-supervised setting to learn an optimal similarity function for constructing a task-optimal graph. However, in many scenarios with billions of data points and trillions of potential edges, the run-time and computational requirements for training the similarity model make these approaches impractical.
In this work, we consider data sampling as a means to overcome this issue. Unlike typical sampling use-cases which only seek diversity, the similarity-learning for graph construction problem requires data samples that are both diverse and representative of highly similar data points. We present an efficient sampling approach by taking an adaptive partition view of locality sensitive hashing. Theoretically, we show that, though the samples obtained are correlated with sampling probabilities that do not sum to one, the training loss estimated for learning the graph similarity model using our approach is unbiased with a smaller variance compared to random sampling. Experiments on public datasets demonstrate the superior generalization of similarity models learned via our sampling. In a real large-scale industrial abuse-detection example, we observe ≈10× increase in identifying abusive items while having lower false positive rate compared to the baseline.
View details
Tackling Provably Hard Representative Selection via Graph Neural Networks
Transactions on Machine Learning Research (2023)
Preview abstract
Representative Selection (RS) is the problem of finding a small subset of exemplars from a dataset that is representative of the dataset. In this paper, we study RS for attributed graphs, and focus on finding representative nodes that optimize the accuracy of a model trained on the selected representatives. Theoretically, we establish a new hardness result for RS (in the absence of a graph structure) by proving that a particular, highly practical variant of it (RS for Learning) is hard to approximate in polynomial time within any reasonable factor, which implies a significant potential gap between the optimum solution of widely-used surrogate functions and the actual accuracy of the model. We then study the setting where a (homophilous) graph structure is available, or can be constructed, between the data points. We show that with an appropriate modeling approach, the presence of such a structure can turn a hard RS (for learning) problem into one that can be effectively solved. To this end, we develop RS-GNN, a representation learning-based RS model based on Graph Neural Networks. Empirically, we demonstrate the effectiveness of RS-GNN on problems with predefined graph structures as well as problems with graphs induced from node feature similarities, by showing that RS-GNN achieves significant improvements over established baselines on a suite of eight benchmarks.
View details
Optimal Fully Dynamic k-Center Clustering for Adaptive and Oblivious Adversaries
Preview
Monika Henzinger
Andreas Wiese
Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)
Sequential Attention for Feature Selection
Taisuke Yasuda
Lin Chen
Proceedings of the 11th International Conference on Learning Representations (2023)
Preview abstract
Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on L1 regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.
View details
Preview abstract
Metric clustering is a fundamental primitive in machine learning with several applications for mining massive data-sets. An important example of metric clustering is the $k$-center problem. While this problem has been extensively studied in distributed settings, all previous algorithms require $\Omega(k)$ space per machine and $\Omega(n k)$ total work.
In this paper, we develop the first highly scalable approximation algorithm for $k$-center clustering requiring $o(k)$ space per machine with $o(n k)$ total work. In particular, our algorithm needs $\widetilde{O}(n^{\eps})$ space per machine and $\tilde{O}(n^{1+\epsilon})$ total work, and computes an $O(\log \log \log n)$-approximation of the problem by selecting $(1+o(1))k$ centers in $O(\log \log n)$ rounds. This is achieved by introducing core-sets of truly sublinear size.
View details
Categorical Feature Compression via Submodular Optimization
Lin Chen
International Conference on Machine Learning (2019), pp. 515-523
Preview abstract
In modern machine learning tasks, the presence of categorical features with extremely large vocabularies is a reality. This becomes a bottleneck when using an ML model, which generally grows at least linearly with the vocabulary size, affecting the memory, training and inference costs, as well as overfitting risk. In this work, we seek to compress the vocabulary by maximizing the mutual information between the compressed categorical feature and the target binary labels. We note the relationship of this problem to that of quantization in a discrete memoryless channel, where there exists a quadratic-time algorithm to solve the problem. Unfortunately, such an algorithm does not scale to data sets with massive vocabularies and, in this paper, we develop a distributed quasi-linear O(n log n) algorithm with provable approximation guarantees. We first observe that although entropy is a submodular function, this is not the case for mutual information between a categorical feature and label. To tackle this problem, we define a set function over a different space, which still contains the optimal solution, and prove this function is submodular. We also provide a query oracle to the submodular function that runs in amortized logarithmic time, and is easy to compute in a distributed fashion. Combining these results with a greedy algorithm allows us to achieve a (1-1/e)-approximation in quasi-linear time. Finally, we compare our proposed algorithm to several existing approaches using the large-scale Criteo learning task and demonstrate better performance in retaining mutual information and also verify the learning performance of the compressed vocabulary.
View details
Preview abstract
Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems, e.g., in some cases, a big graph can be chopped into pieces that fit on one machine to be processed independently before stitching the results together, leading to certain suboptimality from the interaction among different pieces. In other cases, links between different parts may show up in the running time and/or network communications cost, hence the desire to have small cut size.
We study a distributed balanced-partitioning problem where the goal is to partition the vertices of a given graph into k pieces so as to minimize the total cut size. Our algorithm is composed of a few steps that are easily implementable in distributed computation frameworks such as MapReduce. The algorithm first embeds nodes of the graph onto a line, and then processes nodes in a distributed manner guided by the linear embedding order. We examine various ways to find the first embedding, e.g., via a hierarchical clustering or Hilbert curves. Then we apply four different techniques including local swaps,minimum cuts on the boundaries of partitions, as well as contraction and dynamic programming.
As our empirical study, we compare the above techniques with each other, and also to previous work in distributed graph algorithms, e.g., a label-propagation method [UB13], FENNEL [TGRV14] and Spinner [MLS14]. We report our results both on a private map graph and several public social networks,and show that our results beat previous distributed algorithms: For instance, compared to the label-propagation algorithm [UB13], we report an improvement of 15-25% in the cut value. We also observe that our algorithms admit scalable distributed implementation for any number of partitions.
Finally, we explain three applications of this work at Google.
•Balanced partitioning is used to route multi-term queries to different replicas in Google Search backend in a way that reduces the cache miss rates by≈0.5%, which leads to a double-digit gain in throughput of production clusters [AAB+19].
•Applied to the Google Maps Driving Directions, balanced partitioning minimizes the number of cross-shard queries with the goal of saving in CPU usage. This system achieves load balancing by dividing the world graph into several “shards.” Live experiments demonstrate an≈40% drop in the number of cross-shard queries when compared to a standard geography-based method.
•In a job scheduling problem for our data centers, we use balanced partitioning to evenly distribute the work while minimizing the amount of communication across geographically distant servers. In fact, the hierarchical nature of our solution goes well with the layering of data center servers, where certain machines are closer to each other and have faster links to one another.
View details
Beating Approximation Factor 2 for Minimum k-way Cut in Planar and Minor-free Graphs
Alireza Farhadi
MohammadTaghi Hajiaghayi
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM (2019), pp. 1055-1068
Preview abstract
The k-cut problem asks, given a connected graph G and a positive integer k, to find a minimum-weight set of edges whose removal splits G into k connected components. We give the first polynomial-time algorithm with approximation factor 2−ε (with constant ε>0) for the k-cut problem in planar and minor-free graphs. Applying more complex techniques, we further improve our method and give a polynomial-time approximation scheme for the k-cut problem in both planar and minor-free graphs. Despite persistent effort, to the best of our knowledge, this is the first improvement for the k-cut problem over standard approximation factor of 2 in any major class of graphs.
View details
Coresets Meet EDCS: Algorithms for Matching and Vertex Cover on Massive Graphs
Aaron Bernstein
Cliff Stein
Sepehr Assadi
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM (2019), pp. 1616-1635
Preview abstract
Maximum matching and minimum vertex cover are among the most fundamental graph
optimization problems. Recently, randomized composable coresets were introduced as an effective
technique for solving these problems in various models of computation on massive graphs. In this
technique, one partitions the edges of an input graph randomly into multiple pieces, compresses
each piece into a smaller subgraph, namely a coreset, and solves the problem on the union of these
coresets to find the final solution. By designing small size randomized composable coresets, one
can obtain efficient algorithms, in a black-box way, in multiple computational models including
streaming, distributed communication, and the massively parallel computation (MPC) model.
We develop randomized composable coresets of size Oe(n) that for any constant ε > 0, give a
(3/2 + ε)-approximation to matching and a (3 + ε)-approximation to vertex cover. Our coresets
improve upon the previously best approximation ratio of O(1) for matching and O(log n) for
vertex cover. Most notably, our result for matching goes beyond a 2-approximation, which is
a natural barrier for maximum matching in many models of computation. Our coresets lead
to improved algorithms for the simultaneous communication model with randomly partitioned
input, the streaming model when the input arrives in a random order, and the MPC model with
O~(n√n) memory per machine and only two MPC rounds.
Furthermore, inspired by the recent work of Czumaj et al. (arXiv 2017), we study algorithms
for matching and vertex cover in the MPC model with only Oe(n) memory per machine. Building
on our coreset constructions, we develop parallel algorithms that give an O(1)-approximation
to both matching and vertex cover in only O(log log n) MPC rounds and O~(n) memory per
machine. We further improve the approximation ratio of our matching algorithm to (1 + ε) for
any constant ε > 0. Our results settle multiple open questions posed by Czumaj et al.
A key technical ingredient of our paper is a novel application of edge degree constrained
subgraphs (EDCS) that were previously introduced in the context of maintaining matchings in
dynamic graphs. At the heart of our proofs are new structural properties of EDCS that identify
these subgraphs as sparse certificates for large matchings and small vertex covers which are
quite robust to sampling and composition.
View details