Mohammadhossein Bateni

Mohammadhossein Bateni

MohammadHossein Bateni is a staff research scientist at Google, where he is a member of the NYC Algorithms and Optimization Team. He obtained his Ph.D. and M.A. in Computer Science from Princeton University in 2011 and 2008, respectively, after finishing his undergraduate studies with a B.Sc. in Computer Engineering at Sharif University of Technology in 2006. Hossein is broadly interested in combinatorics and combinatorial optimization. His research focuses on approximation algorithms, distributed computing, and analysis of game-theoretic models.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Representative Selection (RS) is the problem of finding a small subset of exemplars from a dataset that is representative of the dataset. In this paper, we study RS for attributed graphs, and focus on finding representative nodes that optimize the accuracy of a model trained on the selected representatives. Theoretically, we establish a new hardness result for RS (in the absence of a graph structure) by proving that a particular, highly practical variant of it (RS for Learning) is hard to approximate in polynomial time within any reasonable factor, which implies a significant potential gap between the optimum solution of widely-used surrogate functions and the actual accuracy of the model. We then study the setting where a (homophilous) graph structure is available, or can be constructed, between the data points. We show that with an appropriate modeling approach, the presence of such a structure can turn a hard RS (for learning) problem into one that can be effectively solved. To this end, we develop RS-GNN, a representation learning-based RS model based on Graph Neural Networks. Empirically, we demonstrate the effectiveness of RS-GNN on problems with predefined graph structures as well as problems with graphs induced from node feature similarities, by showing that RS-GNN achieves significant improvements over established baselines on a suite of eight benchmarks. View details
    Sequential Attention for Feature Selection
    Taisuke Yasuda
    Proceedings of the 11th International Conference on Learning Representations(2023)
    Preview abstract Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on L1 regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest. View details
    Preview abstract Metric clustering is a fundamental primitive in machine learning with several applications for mining massive data-sets. An important example of metric clustering is the $k$-center problem. While this problem has been extensively studied in distributed settings, all previous algorithms require $\Omega(k)$ space per machine and $\Omega(n k)$ total work. In this paper, we develop the first highly scalable approximation algorithm for $k$-center clustering requiring $o(k)$ space per machine with $o(n k)$ total work. In particular, our algorithm needs $\widetilde{O}(n^{\eps})$ space per machine and $\tilde{O}(n^{1+\epsilon})$ total work, and computes an $O(\log \log \log n)$-approximation of the problem by selecting $(1+o(1))k$ centers in $O(\log \log n)$ rounds. This is achieved by introducing core-sets of truly sublinear size. View details
    Preview abstract Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems, e.g., in some cases, a big graph can be chopped into pieces that fit on one machine to be processed independently before stitching the results together, leading to certain suboptimality from the interaction among different pieces. In other cases, links between different parts may show up in the running time and/or network communications cost, hence the desire to have small cut size. We study a distributed balanced-partitioning problem where the goal is to partition the vertices of a given graph into k pieces so as to minimize the total cut size. Our algorithm is composed of a few steps that are easily implementable in distributed computation frameworks such as MapReduce. The algorithm first embeds nodes of the graph onto a line, and then processes nodes in a distributed manner guided by the linear embedding order. We examine various ways to find the first embedding, e.g., via a hierarchical clustering or Hilbert curves. Then we apply four different techniques including local swaps,minimum cuts on the boundaries of partitions, as well as contraction and dynamic programming. As our empirical study, we compare the above techniques with each other, and also to previous work in distributed graph algorithms, e.g., a label-propagation method [UB13], FENNEL [TGRV14] and Spinner [MLS14]. We report our results both on a private map graph and several public social networks,and show that our results beat previous distributed algorithms: For instance, compared to the label-propagation algorithm [UB13], we report an improvement of 15-25% in the cut value. We also observe that our algorithms admit scalable distributed implementation for any number of partitions. Finally, we explain three applications of this work at Google. •Balanced partitioning is used to route multi-term queries to different replicas in Google Search backend in a way that reduces the cache miss rates by≈0.5%, which leads to a double-digit gain in throughput of production clusters [AAB+19]. •Applied to the Google Maps Driving Directions, balanced partitioning minimizes the number of cross-shard queries with the goal of saving in CPU usage. This system achieves load balancing by dividing the world graph into several “shards.” Live experiments demonstrate an≈40% drop in the number of cross-shard queries when compared to a standard geography-based method. •In a job scheduling problem for our data centers, we use balanced partitioning to evenly distribute the work while minimizing the amount of communication across geographically distant servers. In fact, the hierarchical nature of our solution goes well with the layering of data center servers, where certain machines are closer to each other and have faster links to one another. View details
    Coresets Meet EDCS: Algorithms for Matching and Vertex Cover on Massive Graphs
    Aaron Bernstein
    Cliff Stein
    Sepehr Assadi
    Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM(2019), pp. 1616-1635
    Preview abstract Maximum matching and minimum vertex cover are among the most fundamental graph optimization problems. Recently, randomized composable coresets were introduced as an effective technique for solving these problems in various models of computation on massive graphs. In this technique, one partitions the edges of an input graph randomly into multiple pieces, compresses each piece into a smaller subgraph, namely a coreset, and solves the problem on the union of these coresets to find the final solution. By designing small size randomized composable coresets, one can obtain efficient algorithms, in a black-box way, in multiple computational models including streaming, distributed communication, and the massively parallel computation (MPC) model. We develop randomized composable coresets of size Oe(n) that for any constant ε > 0, give a (3/2 + ε)-approximation to matching and a (3 + ε)-approximation to vertex cover. Our coresets improve upon the previously best approximation ratio of O(1) for matching and O(log n) for vertex cover. Most notably, our result for matching goes beyond a 2-approximation, which is a natural barrier for maximum matching in many models of computation. Our coresets lead to improved algorithms for the simultaneous communication model with randomly partitioned input, the streaming model when the input arrives in a random order, and the MPC model with O~(n√n) memory per machine and only two MPC rounds. Furthermore, inspired by the recent work of Czumaj et al. (arXiv 2017), we study algorithms for matching and vertex cover in the MPC model with only Oe(n) memory per machine. Building on our coreset constructions, we develop parallel algorithms that give an O(1)-approximation to both matching and vertex cover in only O(log log n) MPC rounds and O~(n) memory per machine. We further improve the approximation ratio of our matching algorithm to (1 + ε) for any constant ε > 0. Our results settle multiple open questions posed by Czumaj et al. A key technical ingredient of our paper is a novel application of edge degree constrained subgraphs (EDCS) that were previously introduced in the context of maintaining matchings in dynamic graphs. At the heart of our proofs are new structural properties of EDCS that identify these subgraphs as sparse certificates for large matchings and small vertex covers which are quite robust to sampling and composition. View details
    Cache-aware load balancing of data center applications
    Aaron Schild
    Ray Yang
    Richard Zhuang
    Proceedings of the VLDB Endowment, 12(2019), pp. 709-723
    Preview abstract Our deployment of cache-aware load balancing in the Google web search backend reduced cache misses by ~0.5x, contributing to a double-digit percentage increase in the throughput of our serving clusters by relieving a bottleneck. This innovation has benefited all production workloads since 2015, serving billions of queries daily. A load balancer forwards each query to one of several identical serving replicas. The replica pulls each term's postings list into RAM from flash, either locally or over the network. Flash bandwidth is a critical bottleneck, motivating an application-directed RAM cache on each replica. Sending the same term reliably to the same replica would increase the chance it hits cache, and avoid polluting the other replicas' caches. However, most queries contain multiple terms and we have to send the whole query to one replica, so it is not possible to achieve a perfect partitioning of terms to replicas. We solve this via a voting scheme, whereby the load balancer conducts a weighted vote by the terms in each query, and sends the query to the winning replica. We develop a multi-stage scalable algorithm to learn these weights. We first construct a large-scale term-query graph from logs and apply a distributed balanced graph partitioning algorithm to cluster each term to a preferred replica. This yields a good but simplistic initial voting table, which we then iteratively refine via cache simulation to capture feedback effects. View details
    Beating Approximation Factor 2 for Minimum k-way Cut in Planar and Minor-free Graphs
    Alireza Farhadi
    MohammadTaghi Hajiaghayi
    Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM(2019), pp. 1055-1068
    Preview abstract The k-cut problem asks, given a connected graph G and a positive integer k, to find a minimum-weight set of edges whose removal splits G into k connected components. We give the first polynomial-time algorithm with approximation factor 2−ε (with constant ε>0) for the k-cut problem in planar and minor-free graphs. Applying more complex techniques, we further improve our method and give a polynomial-time approximation scheme for the k-cut problem in both planar and minor-free graphs. Despite persistent effort, to the best of our knowledge, this is the first improvement for the k-cut problem over standard approximation factor of 2 in any major class of graphs. View details
    Preview abstract In modern machine learning tasks, the presence of categorical features with extremely large vocabularies is a reality. This becomes a bottleneck when using an ML model, which generally grows at least linearly with the vocabulary size, affecting the memory, training and inference costs, as well as overfitting risk. In this work, we seek to compress the vocabulary by maximizing the mutual information between the compressed categorical feature and the target binary labels. We note the relationship of this problem to that of quantization in a discrete memoryless channel, where there exists a quadratic-time algorithm to solve the problem. Unfortunately, such an algorithm does not scale to data sets with massive vocabularies and, in this paper, we develop a distributed quasi-linear O(n log n) algorithm with provable approximation guarantees. We first observe that although entropy is a submodular function, this is not the case for mutual information between a categorical feature and label. To tackle this problem, we define a set function over a different space, which still contains the optimal solution, and prove this function is submodular. We also provide a query oracle to the submodular function that runs in amortized logarithmic time, and is easy to compute in a distributed fashion. Combining these results with a greedy algorithm allows us to achieve a (1-1/e)-approximation in quasi-linear time. Finally, we compare our proposed algorithm to several existing approaches using the large-scale Criteo learning task and demonstrate better performance in retaining mutual information and also verify the learning performance of the compressed vocabulary. View details
    Fast Algorithms for Knapsack via Convolution and Prediction
    MohammadTaghi Hajiaghayi
    Saeed Seddighin
    Proceedings of the 50th Annual ACM Symposium on the Theory of Computing (STOC)(2018), pp. 1269-1282
    Preview abstract The knapsack problem is a fundamental problem in combinatorial optimization. It has been studied extensively from theoretical as well as practical perspectives as it is one of the most well-known NP-hard problems. The goal is to pack a knapsack of size t with the maximum value from a collection of n items with given sizes and values. Recent evidence suggests that a classic O(nt) dynamic-programming solution for the knapsack problem might be the fastest in the worst case. In fact, solving the knapsack problem was shown to be equivalent to the (min,+) convolution problem (Cygan et al., ICALP 2017), which is thought to be facing a quadratic-time barrier. This hardness is in contrast to the more famous (+,·) convolution (generally known as polynomial multiplication), that has an O(nlogn)-time solution via Fast Fourier Transform. Our main results are algorithms with near-linear running times for the knapsack problem, if either the values or sizes of items are small integers. More specifically, if item sizes are integers bounded by s_max, the running time of our algorithm is O~((n + t)s_max). If the item values are integers bounded by v_max, our algorithm runs in time O~(n + t v_max). Best previously known running times were O(nt), O(n^2 s_max) and O(n s_max v_max) (Pisinger, J. of Alg., 1999). At the core of our algorithms lies the prediction technique: Roughly speaking, this new technique enables us to compute the convolution of two vectors in time O (n e_max) when an approximation of the solution within an additive error of e_max is available. Our results also have implications regarding algorithms for several other problems including tree sparsity, tree separability and the unbounded knapsack problem, in the case when some of the relevant numerical input values are bounded. View details