Mohammadhossein Bateni

Mohammadhossein Bateni

MohammadHossein Bateni is a staff research scientist at Google, where he is a member of the NYC Algorithms and Optimization Team. He obtained his Ph.D. and M.A. in Computer Science from Princeton University in 2011 and 2008, respectively, after finishing his undergraduate studies with a B.Sc. in Computer Engineering at Sharif University of Technology in 2006. Hossein is broadly interested in combinatorics and combinatorial optimization. His research focuses on approximation algorithms, distributed computing, and analysis of game-theoretic models.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Sampling subgraphs for training Graph Neural Networks (GNNs) is receiving much attention from the GNN community. While a variety of methods have been proposed, each method samples the graph according to its own heuristic. However, there has been little work in mixing these heuristics in an end-to-end trainable manner. In this work, we design a generative framework for graph sampling. Our method, SubMix, parameterizes subgraph sampling as a convex combination of heuristics. We show that a continuous relaxation of the discrete sampling process allows us to efficiently obtain analytical gradients for training the sampling parameters. Our experimental results illustrate the usefulness of learning graph sampling in three scenarios: (1) robust training of GNNs by automatically learning to discard noisy edge sources; (2) improving model performance by trainable and online edge subset selection; and (3) by integrating our framework into decoupled GNN models improves their performance on standard benchmarks. View details
    Preview abstract An important step in graph-based data analysis and processing is the construction of similarity graphs. Recent works have focused on the semi-supervised setting to learn an optimal similarity function for constructing a task-optimal graph. However, in many scenarios with billions of data points and trillions of potential edges, the run-time and computational requirements for training the similarity model make these approaches impractical. In this work, we consider data sampling as a means to overcome this issue. Unlike typical sampling use-cases which only seek diversity, the similarity-learning for graph construction problem requires data samples that are both diverse and representative of highly similar data points. We present an efficient sampling approach by taking an adaptive partition view of locality sensitive hashing. Theoretically, we show that, though the samples obtained are correlated with sampling probabilities that do not sum to one, the training loss estimated for learning the graph similarity model using our approach is unbiased with a smaller variance compared to random sampling. Experiments on public datasets demonstrate the superior generalization of similarity models learned via our sampling. In a real large-scale industrial abuse-detection example, we observe ≈10× increase in identifying abusive items while having lower false positive rate compared to the baseline. View details
    Preview abstract Representative Selection (RS) is the problem of finding a small subset of exemplars from a dataset that is representative of the dataset. In this paper, we study RS for attributed graphs, and focus on finding representative nodes that optimize the accuracy of a model trained on the selected representatives. Theoretically, we establish a new hardness result for RS (in the absence of a graph structure) by proving that a particular, highly practical variant of it (RS for Learning) is hard to approximate in polynomial time within any reasonable factor, which implies a significant potential gap between the optimum solution of widely-used surrogate functions and the actual accuracy of the model. We then study the setting where a (homophilous) graph structure is available, or can be constructed, between the data points. We show that with an appropriate modeling approach, the presence of such a structure can turn a hard RS (for learning) problem into one that can be effectively solved. To this end, we develop RS-GNN, a representation learning-based RS model based on Graph Neural Networks. Empirically, we demonstrate the effectiveness of RS-GNN on problems with predefined graph structures as well as problems with graphs induced from node feature similarities, by showing that RS-GNN achieves significant improvements over established baselines on a suite of eight benchmarks. View details
    Sequential Attention for Feature Selection
    Taisuke Yasuda
    Lin Chen
    Proceedings of the 11th International Conference on Learning Representations (2023)
    Preview abstract Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on L1 regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest. View details
    Preview abstract Metric clustering is a fundamental primitive in machine learning with several applications for mining massive data-sets. An important example of metric clustering is the $k$-center problem. While this problem has been extensively studied in distributed settings, all previous algorithms require $\Omega(k)$ space per machine and $\Omega(n k)$ total work. In this paper, we develop the first highly scalable approximation algorithm for $k$-center clustering requiring $o(k)$ space per machine with $o(n k)$ total work. In particular, our algorithm needs $\widetilde{O}(n^{\eps})$ space per machine and $\tilde{O}(n^{1+\epsilon})$ total work, and computes an $O(\log \log \log n)$-approximation of the problem by selecting $(1+o(1))k$ centers in $O(\log \log n)$ rounds. This is achieved by introducing core-sets of truly sublinear size. View details
    Preview abstract In modern machine learning tasks, the presence of categorical features with extremely large vocabularies is a reality. This becomes a bottleneck when using an ML model, which generally grows at least linearly with the vocabulary size, affecting the memory, training and inference costs, as well as overfitting risk. In this work, we seek to compress the vocabulary by maximizing the mutual information between the compressed categorical feature and the target binary labels. We note the relationship of this problem to that of quantization in a discrete memoryless channel, where there exists a quadratic-time algorithm to solve the problem. Unfortunately, such an algorithm does not scale to data sets with massive vocabularies and, in this paper, we develop a distributed quasi-linear O(n log n) algorithm with provable approximation guarantees. We first observe that although entropy is a submodular function, this is not the case for mutual information between a categorical feature and label. To tackle this problem, we define a set function over a different space, which still contains the optimal solution, and prove this function is submodular. We also provide a query oracle to the submodular function that runs in amortized logarithmic time, and is easy to compute in a distributed fashion. Combining these results with a greedy algorithm allows us to achieve a (1-1/e)-approximation in quasi-linear time. Finally, we compare our proposed algorithm to several existing approaches using the large-scale Criteo learning task and demonstrate better performance in retaining mutual information and also verify the learning performance of the compressed vocabulary. View details
    Preview abstract Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems, e.g., in some cases, a big graph can be chopped into pieces that fit on one machine to be processed independently before stitching the results together, leading to certain suboptimality from the interaction among different pieces. In other cases, links between different parts may show up in the running time and/or network communications cost, hence the desire to have small cut size. We study a distributed balanced-partitioning problem where the goal is to partition the vertices of a given graph into k pieces so as to minimize the total cut size. Our algorithm is composed of a few steps that are easily implementable in distributed computation frameworks such as MapReduce. The algorithm first embeds nodes of the graph onto a line, and then processes nodes in a distributed manner guided by the linear embedding order. We examine various ways to find the first embedding, e.g., via a hierarchical clustering or Hilbert curves. Then we apply four different techniques including local swaps,minimum cuts on the boundaries of partitions, as well as contraction and dynamic programming. As our empirical study, we compare the above techniques with each other, and also to previous work in distributed graph algorithms, e.g., a label-propagation method [UB13], FENNEL [TGRV14] and Spinner [MLS14]. We report our results both on a private map graph and several public social networks,and show that our results beat previous distributed algorithms: For instance, compared to the label-propagation algorithm [UB13], we report an improvement of 15-25% in the cut value. We also observe that our algorithms admit scalable distributed implementation for any number of partitions. Finally, we explain three applications of this work at Google. •Balanced partitioning is used to route multi-term queries to different replicas in Google Search backend in a way that reduces the cache miss rates by≈0.5%, which leads to a double-digit gain in throughput of production clusters [AAB+19]. •Applied to the Google Maps Driving Directions, balanced partitioning minimizes the number of cross-shard queries with the goal of saving in CPU usage. This system achieves load balancing by dividing the world graph into several “shards.” Live experiments demonstrate an≈40% drop in the number of cross-shard queries when compared to a standard geography-based method. •In a job scheduling problem for our data centers, we use balanced partitioning to evenly distribute the work while minimizing the amount of communication across geographically distant servers. In fact, the hierarchical nature of our solution goes well with the layering of data center servers, where certain machines are closer to each other and have faster links to one another. View details
    Beating Approximation Factor 2 for Minimum k-way Cut in Planar and Minor-free Graphs
    Alireza Farhadi
    MohammadTaghi Hajiaghayi
    Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM (2019), pp. 1055-1068
    Preview abstract The k-cut problem asks, given a connected graph G and a positive integer k, to find a minimum-weight set of edges whose removal splits G into k connected components. We give the first polynomial-time algorithm with approximation factor 2−ε (with constant ε>0) for the k-cut problem in planar and minor-free graphs. Applying more complex techniques, we further improve our method and give a polynomial-time approximation scheme for the k-cut problem in both planar and minor-free graphs. Despite persistent effort, to the best of our knowledge, this is the first improvement for the k-cut problem over standard approximation factor of 2 in any major class of graphs. View details
    Coresets Meet EDCS: Algorithms for Matching and Vertex Cover on Massive Graphs
    Aaron Bernstein
    Cliff Stein
    Sepehr Assadi
    Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM (2019), pp. 1616-1635
    Preview abstract Maximum matching and minimum vertex cover are among the most fundamental graph optimization problems. Recently, randomized composable coresets were introduced as an effective technique for solving these problems in various models of computation on massive graphs. In this technique, one partitions the edges of an input graph randomly into multiple pieces, compresses each piece into a smaller subgraph, namely a coreset, and solves the problem on the union of these coresets to find the final solution. By designing small size randomized composable coresets, one can obtain efficient algorithms, in a black-box way, in multiple computational models including streaming, distributed communication, and the massively parallel computation (MPC) model. We develop randomized composable coresets of size Oe(n) that for any constant ε > 0, give a (3/2 + ε)-approximation to matching and a (3 + ε)-approximation to vertex cover. Our coresets improve upon the previously best approximation ratio of O(1) for matching and O(log n) for vertex cover. Most notably, our result for matching goes beyond a 2-approximation, which is a natural barrier for maximum matching in many models of computation. Our coresets lead to improved algorithms for the simultaneous communication model with randomly partitioned input, the streaming model when the input arrives in a random order, and the MPC model with O~(n√n) memory per machine and only two MPC rounds. Furthermore, inspired by the recent work of Czumaj et al. (arXiv 2017), we study algorithms for matching and vertex cover in the MPC model with only Oe(n) memory per machine. Building on our coreset constructions, we develop parallel algorithms that give an O(1)-approximation to both matching and vertex cover in only O(log log n) MPC rounds and O~(n) memory per machine. We further improve the approximation ratio of our matching algorithm to (1 + ε) for any constant ε > 0. Our results settle multiple open questions posed by Czumaj et al. A key technical ingredient of our paper is a novel application of edge degree constrained subgraphs (EDCS) that were previously introduced in the context of maintaining matchings in dynamic graphs. At the heart of our proofs are new structural properties of EDCS that identify these subgraphs as sparse certificates for large matchings and small vertex covers which are quite robust to sampling and composition. View details