Vincent Pierre Cohen-addad
Research Areas
Authored Publications
Sort By
Streaming Euclidean k-median and k-means to a (1 + ε)-approximation with o_{k,ε}(log n) Memory Words
Preview abstract
We consider the classic Euclidean k-median and k-means objective on insertion-only streams,
where the goal is to maintain a (1 + ε)-approximation to the k-median or k-means, while using
as little memory as possible. Over the last 20 years, clustering in data streams has received a
tremendous amount of attention and has been the test-bed for a large variety of new techniques,
including coresets, merge-and-reduce, bicriteria approximation, sensitivity sampling, etc. Despite
this intense effort to obtain smaller and smaller sketches for these problems, all known techniques
require storing at least Ω(log (n∆)) words of memory, where n is size of the input and ∆ is
the aspect ratio. In this paper, we show how to finally bypass this bound and obtain the first
algorithm that achieves a (1 + ε)-approximation to the more general (k, z)-clustering problem,
using only ̃O ( dk / ε^2) · (2^{z log z} ) · min ( 1/ε^z , k) · poly(log log(n∆)) words of memory.
View details
Streaming Euclidean MST to a Constant Factor
Amit Levi
Erik Waingarten
Xi Chen
54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)
Preview abstract
We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \R^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+\epsilon)$ approximations are possible in sublinear space. However, for high dimensional space the best known approximation for this problem was $\tilde{O}(\log n)$, due to [Chen, Jayaram, Levi, Waingarten, STOC'22], improving on the prior $O(\log^2 n)$ bound due to [Andoni, Indyk, Krauthgamer, SODA '08]. In this paper, we break the logarithmic barrier, and give the first constant factor sublinear space approximation to Euclidean MST. For any $\epsilon\geq 1$, our algorithm achieves an $\tilde{O}(\epsilon^{-2})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space.
We complement this by demonstrating that any single pass algorithm which obtains a better than $1.10$-approximation must use $\Omega(\sqrt{n})$ space, demonstrating that $(1+\epsilon)$ approximations are not possible in high-dimensions, and that our algorithm is tight up to a constant. Nevertheless, we demonstrate that $(1+\epsilon)$ approximations are possible in sublinear space with $O(1/\epsilon)$ passes over the stream. More generally, for any $\alpha \geq 2$, we give a $\alpha$-pass streaming algorithm which achieves a $O(\frac{1}{ \alpha \epsilon})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space.
All our streaming algorithms are linear sketches, and therefore extend to the massively-parallel computation model (MPC). Thus, our results imply the first $(1+\epsilon)$-approximation in a constant number of rounds in the MPC model. Previously, such a result was only known for low-dimensional space ([Andoni, Nikolov, Onak, Yaroslavtsev, STOC'15]), or either required $O(\log n)$ rounds or suffered a $O(\log n)$ approximation.
View details
Preview abstract
We consider the classic correlation clustering problem: Given a complete graphs where edges are
labelled either + or -, the goal is to find a partition of the vertices that minimizes the number of the +
edges across parts plus the number of the - edges within parts.
Recently, Cohen-Addad, Lee and Newman [FOCS’22] showed the first 1.995-approximation for the
problem using the Sherali-Adams hierarchy hence breaking through the integrality gap of 2 of the clas-
sic linear program and improving upon the 2.06-approximation of Chawla, Makarychev, Schramm and
Yaroslavtsev [STOC’15].
We significantly improve the result by providing a 1.73-approximation for the problem, making a
significant step below 2. Our approach brings together a new preprocessing of correlation clustering
instances that enables a new LP formulation which combined with the approach of Cohen-Addad, Lee
and Newman [FOCS’22] and a new rounding method yields the improved bound.
View details
Preview abstract
We prove that there is a randomized polynomial-time algorithm that given an edge-weighted graph G
excluding a fixed-minor Q on n vertices and an accuracy parameter ε > 0, constructs an edge-weighted
graph H and an embedding η : V (G) → V (H) with the following properties:
• For any constant size Q, the treewidth of H is polynomial in ε^−1, log n, and the logarithm of the stretch of the distance metric in G.
• The expected multiplicative distortion is (1 + ε): for every pair of vertices u, v of G, we have
dist_H(η(u), η(v)) ⩾ dist_G(u, v) always and E[distH(η(u), η(v))] ⩽ (1 + ε)dist_G(u, v).
Our embedding is the first to achieve polylogarithmic treewidth of the host graph and comes close
to the lower bound by Carroll and Goel, who showed that any embedding of a planar graph with
O(1) expected distortion requires the host graph to have treewidth Ω(log n). It also provides a unified
framework for obtaining randomized quasi-polynomial-time approximation schemes for a variety of
problems including network design, clustering or routing problems, in minor-free metrics where the
optimization goal is the sum of selected distances. Applications include the capacitated vehicle routing
problem, and capacitated clustering problems.
View details
Preview abstract
In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role.
For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the precision parameter $\varepsilon^{-1}$ or $k$. Furthermore, there is no coreset construction that succeeds with probability $1-1/n$ and whose size does not depend on the number of input points, $n$.
This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman WIREs Data Mining Knowl. Discov'20].
Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are $2.633$ for $k$-median [Ahmadian, Norouzi-Fard, Svensson, Ward, SIAM J. Comput. 20], $9+\varepsilon$ for $k$-means [Kanungo et al. Comput. Geom. 04]. Those are the best results, even when allowing a complexity FPT in the number of clusters $k$: this stands in sharp contrast with the $(1+\varepsilon)$-approximation achievable in that case, when allowing randomization.
In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We show how to compute a dimension reduction onto $\varepsilon^{-O(1)} \log k$ dimensions in time $k^{O\left( \varepsilon^{-O(1)} + \log \log k\right)} \text{poly}(nd)$, and how to build a coreset of size $O\left( k^2 \log^3 k \varepsilon^{-O(1)}\right)$ in time $2^{\eps^{O(1)} k\log^3 k} + k^{O\left( \varepsilon^{-O(1)} + \log \log k\right)} \text{poly}(nd)$. In the case where $k$ is small, this answers an open question of [Feldman WIDM'20] and [Munteanu and Schwiegelshohn, Künstliche Intell. 18] on whether it is possible to efficiently compute coresets deterministically.
We also construct a deterministic algorithm for computing $(1+\varepsilon)$-approximation to $k$-median and $k$-means in high dimensional Euclidean spaces in time $2^{k^2 \log^3 k/\varepsilon^{O(1)}} \text{poly}(nd)$, close to the best randomized complexity of $2^{(k/\varepsilon)^{O(1)}} nd$ (see [Kumar, Sabharwal, Sen, JACM 10] and [Bhattacharya, Jaiswal, Kumar, TCS'18]).
Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS 22] by a factor $k$.
View details
Private estimation algorithms for stochastic block models and mixture models
Hongjie Chen
Tommaso D'Orsi
Jacob Imola
David Steurer
Stefan Tiegel
54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)
Preview abstract
We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient (ϵ,δ)-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an (ϵ,δ)-differentially private algorithm that recovers the centers of the k-mixture when the minimum separation is at least O(k^{1/t}/√t). For all choices of t, this algorithm requires sample complexity n≥k^O(1)d^O(t) and time complexity (nd)^O(t). Prior work required minimum separation at least O(√k) as well as an explicit upper bound on the Euclidean norm of the centers.
View details
Preview abstract
Consider an agent (say a robot) exploring an unknown graph in search
of some goal state. As it walks around the graph, it learns the
nodes and their neighbors. The agent only knows where the goal state
is when it reaches it. How do we reach this goal while moving only a
small distance? This problem seems hopeless, even on trees of
bounded degree, unless we give the agent some help. In our case,
this help comes in the form of *predictions*: each node $v$
provides a prediction $f(v)$ of its distance to the goal
vertex. Naturally if the predictions are correct, we can reach the
goal quickly. What if some of the predictions are erroneous? This
naturally models graph search problems where we are given some
quality function to guide us towards the goal: in our framework the
quality is given by the distance predictions.
In this work, we initiate a study of this problem with
predictions. We consider the problem on trees (where the problem is
already non-trivial) and give deterministic realgorithms whose
movement cost is only $O(OPT + ERR)$, where $OPT$ is the distance
from the start to the goal vertex, and the $ERR$ is the total number
of vertices whose predictions are erroneous. We also consider a
``planning'' version of the problem where the graph and predictions
are known at the beginning, so the agent can use this global
information to quickly reach the goal. For the planning version, we
give a different algorithm for trees, and then an algorithm for all
(weighted) graphs with bounded doubling dimension.
View details
Preview abstract
We study the fine-grained complexity of the famous $k$-center problem in the metric induced by a graph with $n$ vertices and $m$ edges. The problem is NP-hard to approximate within a factor strictly better than $2$, and several $2$-approximation algorithms are known.
Two of the most well-known approaches for the $2$-approximation are (1) finding a maximal distance $r$-independent set (where the minimum pairwise distance is greater than $r$) and (2) Gonzalez's algorithm that iteratively adds the center farthest from the currently chosen centers.
For the approach based on distance-$r$ independent sets, Thorup [SIAM J. Comput. '05] already gave a nearly linear time algorithm. While Thorup's algorithm is not complicated, it still requires tools such as an approximate oracle for neighborhood size by Cohen [J. Comput. Syst. Sci. '97]. Our main result is a nearly straightforward algorithm that improves the running time by an $O(\log n$) factor. It results in an $(2+\eps)$-approximation for $k$-center in $O((m + n \log n)\log n \log(n/\eps))$ time.
For Gonzalez's algorithm [Theor. Comput. Sci. 85], we show that the simple $\widetilde{O}(mk)$-time implementation is nearly optimal if we insist the {\em exact} implementation. On the other hand, we show that an $(1+\eps)$-approximate version of the algorithm is efficiently implementable, leading to an $(2+\eps)$-approximation algorithm in running time $O((m + n \log n)\log^2 n / \eps)$. We also show that, unlike in the distance $r$-independent set-based algorithm, the dependency of $1/\eps$ in the running time is essentially optimal for $(1 + \eps)$-approximate Gonzalez's.
View details
Breaching the 2 LMP Approximation Barrier for Facility Location with Applications to k-Median
Chris Schwiegelshohn
Euiwoong Lee
Fabrizio Grandoni
SODA'23 (2023) (to appear)
Preview abstract
The $k$-Median problem is one the most fundamental clustering problems: Given $n$ points in a metric space and an integer parameter $k$ our goal is to select precisely $k$ points (called centers) so as to minimize the sum of the distances from each point to the closest center. Obtaining better and better approximation algorithms for $k$-Median
is a central open problem.
Over the years, two main approaches have emerged: the first one is the standard Local Search heuristic, yielding a $(3+\eps)$ approximation for any constant $\eps>0$ which is known to be tight.
The second approach is based on a Lagrangian Multiplier
Preserving (LMP) $\alpha$ approximation algorithm for the related facility location problem. This algorithm is used to build an $\alpha$ approximate fractional bipoint solution for $k$-Median, which is then rounded via a $\rho$ approximate bipoint rounding algorithm. Altogether this gives an $\alpha\cdot \rho$ approximation. A lot of progress was made on improving $\rho$, from the initial $2$ by Jain and Vazirani [FOCS'99, J.ACM'01], to $1.367$ by Li and Svensson [STOC'13, SICOMP'16], and finally to $1.338$ by Byrka et al. [SODA'15, TALG'17]. However for almost 20 years no progress was made on $\alpha$, where the current best result is the classical LMP $2$ approximation algorithm JMS for facility location by Jain et al. [STOC'02, \fab{J.ACM'03}] based on the Dual-Fitting technique.
View details
A Massively Parallel Modularity-Maximizing Algorithm With Provable Guarantees
David Saulpic
Frederik Mallmann-Trenn
41st ACM Symposium on Principles of Distributed Computing 2022 (2022)
Preview abstract
Graph clustering is one of the most basic and pop-
ular unsupervised learning problem. Among the
different formulations of the problem, the modu-
larity objective has been particularly successful
for helping design impactful algorithms; Most
notably the Louvain algorithm has become one
of the most used algorithm for clustering graphs.
Yet, one major limitation of the Louvain algorithm
is its sequential nature which makes it impracti-
cal in distributive environments and on massive
datasets.
In this paper, we provide a parallel version of Lou-
vain which works in the massively parallel com-
putation model (MPC). We show that it achieves
optimal cluster recovery in the classic stochastic
block model in only a constant number of parallel
rounds, and so for the same regime of parameters
than the standard Louvain algorithm as shown
recently in Cohen-Addad et al. (2020).
View details