Vincent Pierre Cohen-addad

Vincent Pierre Cohen-addad

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We consider the classic Euclidean k-median and k-means objective on insertion-only streams, where the goal is to maintain a (1 + ε)-approximation to the k-median or k-means, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, merge-and-reduce, bicriteria approximation, sensitivity sampling, etc. Despite this intense effort to obtain smaller and smaller sketches for these problems, all known techniques require storing at least Ω(log (n∆)) words of memory, where n is size of the input and ∆ is the aspect ratio. In this paper, we show how to finally bypass this bound and obtain the first algorithm that achieves a (1 + ε)-approximation to the more general (k, z)-clustering problem, using only ̃O ( dk / ε^2) · (2^{z log z} ) · min ( 1/ε^z , k) · poly(log log(n∆)) words of memory. View details
    Streaming Euclidean MST to a Constant Factor
    Amit Levi
    Erik Waingarten
    Xi Chen
    54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)
    Preview abstract We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \R^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+\epsilon)$ approximations are possible in sublinear space. However, for high dimensional space the best known approximation for this problem was $\tilde{O}(\log n)$, due to [Chen, Jayaram, Levi, Waingarten, STOC'22], improving on the prior $O(\log^2 n)$ bound due to [Andoni, Indyk, Krauthgamer, SODA '08]. In this paper, we break the logarithmic barrier, and give the first constant factor sublinear space approximation to Euclidean MST. For any $\epsilon\geq 1$, our algorithm achieves an $\tilde{O}(\epsilon^{-2})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space. We complement this by demonstrating that any single pass algorithm which obtains a better than $1.10$-approximation must use $\Omega(\sqrt{n})$ space, demonstrating that $(1+\epsilon)$ approximations are not possible in high-dimensions, and that our algorithm is tight up to a constant. Nevertheless, we demonstrate that $(1+\epsilon)$ approximations are possible in sublinear space with $O(1/\epsilon)$ passes over the stream. More generally, for any $\alpha \geq 2$, we give a $\alpha$-pass streaming algorithm which achieves a $O(\frac{1}{ \alpha \epsilon})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space. All our streaming algorithms are linear sketches, and therefore extend to the massively-parallel computation model (MPC). Thus, our results imply the first $(1+\epsilon)$-approximation in a constant number of rounds in the MPC model. Previously, such a result was only known for low-dimensional space ([Andoni, Nikolov, Onak, Yaroslavtsev, STOC'15]), or either required $O(\log n)$ rounds or suffered a $O(\log n)$ approximation. View details
    Preview abstract We consider the classic correlation clustering problem: Given a complete graphs where edges are labelled either + or -, the goal is to find a partition of the vertices that minimizes the number of the + edges across parts plus the number of the - edges within parts. Recently, Cohen-Addad, Lee and Newman [FOCS’22] showed the first 1.995-approximation for the problem using the Sherali-Adams hierarchy hence breaking through the integrality gap of 2 of the clas- sic linear program and improving upon the 2.06-approximation of Chawla, Makarychev, Schramm and Yaroslavtsev [STOC’15]. We significantly improve the result by providing a 1.73-approximation for the problem, making a significant step below 2. Our approach brings together a new preprocessing of correlation clustering instances that enables a new LP formulation which combined with the approach of Cohen-Addad, Lee and Newman [FOCS’22] and a new rounding method yields the improved bound. View details
    Preview abstract We prove that there is a randomized polynomial-time algorithm that given an edge-weighted graph G excluding a fixed-minor Q on n vertices and an accuracy parameter ε > 0, constructs an edge-weighted graph H and an embedding η : V (G) → V (H) with the following properties: • For any constant size Q, the treewidth of H is polynomial in ε^−1, log n, and the logarithm of the stretch of the distance metric in G. • The expected multiplicative distortion is (1 + ε): for every pair of vertices u, v of G, we have dist_H(η(u), η(v)) ⩾ dist_G(u, v) always and E[distH(η(u), η(v))] ⩽ (1 + ε)dist_G(u, v). Our embedding is the first to achieve polylogarithmic treewidth of the host graph and comes close to the lower bound by Carroll and Goel, who showed that any embedding of a planar graph with O(1) expected distortion requires the host graph to have treewidth Ω(log n). It also provides a unified framework for obtaining randomized quasi-polynomial-time approximation schemes for a variety of problems including network design, clustering or routing problems, in minor-free metrics where the optimization goal is the sum of selected distances. Applications include the capacitated vehicle routing problem, and capacitated clustering problems. View details
    Preview abstract In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the precision parameter $\varepsilon^{-1}$ or $k$. Furthermore, there is no coreset construction that succeeds with probability $1-1/n$ and whose size does not depend on the number of input points, $n$. This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are $2.633$ for $k$-median [Ahmadian, Norouzi-Fard, Svensson, Ward, SIAM J. Comput. 20], $9+\varepsilon$ for $k$-means [Kanungo et al. Comput. Geom. 04]. Those are the best results, even when allowing a complexity FPT in the number of clusters $k$: this stands in sharp contrast with the $(1+\varepsilon)$-approximation achievable in that case, when allowing randomization. In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We show how to compute a dimension reduction onto $\varepsilon^{-O(1)} \log k$ dimensions in time $k^{O\left( \varepsilon^{-O(1)} + \log \log k\right)} \text{poly}(nd)$, and how to build a coreset of size $O\left( k^2 \log^3 k \varepsilon^{-O(1)}\right)$ in time $2^{\eps^{O(1)} k\log^3 k} + k^{O\left( \varepsilon^{-O(1)} + \log \log k\right)} \text{poly}(nd)$. In the case where $k$ is small, this answers an open question of [Feldman WIDM'20] and [Munteanu and Schwiegelshohn, Künstliche Intell. 18] on whether it is possible to efficiently compute coresets deterministically. We also construct a deterministic algorithm for computing $(1+\varepsilon)$-approximation to $k$-median and $k$-means in high dimensional Euclidean spaces in time $2^{k^2 \log^3 k/\varepsilon^{O(1)}} \text{poly}(nd)$, close to the best randomized complexity of $2^{(k/\varepsilon)^{O(1)}} nd$ (see [Kumar, Sabharwal, Sen, JACM 10] and [Bhattacharya, Jaiswal, Kumar, TCS'18]). Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS 22] by a factor $k$. View details
    Private estimation algorithms for stochastic block models and mixture models
    Hongjie Chen
    Tommaso D'Orsi
    Jacob Imola
    David Steurer
    Stefan Tiegel
    54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)
    Preview abstract We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient (ϵ,δ)-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an (ϵ,δ)-differentially private algorithm that recovers the centers of the k-mixture when the minimum separation is at least O(k^{1/t}/√t). For all choices of t, this algorithm requires sample complexity n≥k^O(1)d^O(t) and time complexity (nd)^O(t). Prior work required minimum separation at least O(√k) as well as an explicit upper bound on the Euclidean norm of the centers. View details
    Graph Searching with Predictions
    Anupam Gupta
    Siddhartha Banerjee
    Zhouzi Li
    ITCS'23 (2023) (to appear)
    Preview abstract Consider an agent (say a robot) exploring an unknown graph in search of some goal state. As it walks around the graph, it learns the nodes and their neighbors. The agent only knows where the goal state is when it reaches it. How do we reach this goal while moving only a small distance? This problem seems hopeless, even on trees of bounded degree, unless we give the agent some help. In our case, this help comes in the form of *predictions*: each node $v$ provides a prediction $f(v)$ of its distance to the goal vertex. Naturally if the predictions are correct, we can reach the goal quickly. What if some of the predictions are erroneous? This naturally models graph search problems where we are given some quality function to guide us towards the goal: in our framework the quality is given by the distance predictions. In this work, we initiate a study of this problem with predictions. We consider the problem on trees (where the problem is already non-trivial) and give deterministic realgorithms whose movement cost is only $O(OPT + ERR)$, where $OPT$ is the distance from the start to the goal vertex, and the $ERR$ is the total number of vertices whose predictions are erroneous. We also consider a ``planning'' version of the problem where the graph and predictions are known at the beginning, so the agent can use this global information to quickly reach the goal. For the planning version, we give a different algorithm for trees, and then an algorithm for all (weighted) graphs with bounded doubling dimension. View details
    Preview abstract We study the fine-grained complexity of the famous $k$-center problem in the metric induced by a graph with $n$ vertices and $m$ edges. The problem is NP-hard to approximate within a factor strictly better than $2$, and several $2$-approximation algorithms are known. Two of the most well-known approaches for the $2$-approximation are (1) finding a maximal distance $r$-independent set (where the minimum pairwise distance is greater than $r$) and (2) Gonzalez's algorithm that iteratively adds the center farthest from the currently chosen centers. For the approach based on distance-$r$ independent sets, Thorup [SIAM J. Comput. '05] already gave a nearly linear time algorithm. While Thorup's algorithm is not complicated, it still requires tools such as an approximate oracle for neighborhood size by Cohen [J. Comput. Syst. Sci. '97]. Our main result is a nearly straightforward algorithm that improves the running time by an $O(\log n$) factor. It results in an $(2+\eps)$-approximation for $k$-center in $O((m + n \log n)\log n \log(n/\eps))$ time. For Gonzalez's algorithm [Theor. Comput. Sci. 85], we show that the simple $\widetilde{O}(mk)$-time implementation is nearly optimal if we insist the {\em exact} implementation. On the other hand, we show that an $(1+\eps)$-approximate version of the algorithm is efficiently implementable, leading to an $(2+\eps)$-approximation algorithm in running time $O((m + n \log n)\log^2 n / \eps)$. We also show that, unlike in the distance $r$-independent set-based algorithm, the dependency of $1/\eps$ in the running time is essentially optimal for $(1 + \eps)$-approximate Gonzalez's. View details
    Preview abstract The $k$-Median problem is one the most fundamental clustering problems: Given $n$ points in a metric space and an integer parameter $k$ our goal is to select precisely $k$ points (called centers) so as to minimize the sum of the distances from each point to the closest center. Obtaining better and better approximation algorithms for $k$-Median is a central open problem. Over the years, two main approaches have emerged: the first one is the standard Local Search heuristic, yielding a $(3+\eps)$ approximation for any constant $\eps>0$ which is known to be tight. The second approach is based on a Lagrangian Multiplier Preserving (LMP) $\alpha$ approximation algorithm for the related facility location problem. This algorithm is used to build an $\alpha$ approximate fractional bipoint solution for $k$-Median, which is then rounded via a $\rho$ approximate bipoint rounding algorithm. Altogether this gives an $\alpha\cdot \rho$ approximation. A lot of progress was made on improving $\rho$, from the initial $2$ by Jain and Vazirani [FOCS'99, J.ACM'01], to $1.367$ by Li and Svensson [STOC'13, SICOMP'16], and finally to $1.338$ by Byrka et al. [SODA'15, TALG'17]. However for almost 20 years no progress was made on $\alpha$, where the current best result is the classical LMP $2$ approximation algorithm JMS for facility location by Jain et al. [STOC'02, \fab{J.ACM'03}] based on the Dual-Fitting technique. View details
    A Massively Parallel Modularity-Maximizing Algorithm With Provable Guarantees
    David Saulpic
    Frederik Mallmann-Trenn
    41st ACM Symposium on Principles of Distributed Computing 2022 (2022)
    Preview abstract Graph clustering is one of the most basic and pop- ular unsupervised learning problem. Among the different formulations of the problem, the modu- larity objective has been particularly successful for helping design impactful algorithms; Most notably the Louvain algorithm has become one of the most used algorithm for clustering graphs. Yet, one major limitation of the Louvain algorithm is its sequential nature which makes it impracti- cal in distributive environments and on massive datasets. In this paper, we provide a parallel version of Lou- vain which works in the massively parallel com- putation model (MPC). We show that it achieves optimal cluster recovery in the classic stochastic block model in only a constant number of parallel rounds, and so for the same regime of parameters than the standard Louvain algorithm as shown recently in Cohen-Addad et al. (2020). View details