Vincent Pierre Cohen-addad
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role.
For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the precision parameter $\varepsilon^{-1}$ or $k$. Furthermore, there is no coreset construction that succeeds with probability $1-1/n$ and whose size does not depend on the number of input points, $n$.
This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman WIREs Data Mining Knowl. Discov'20].
Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are $2.633$ for $k$-median [Ahmadian, Norouzi-Fard, Svensson, Ward, SIAM J. Comput. 20], $9+\varepsilon$ for $k$-means [Kanungo et al. Comput. Geom. 04]. Those are the best results, even when allowing a complexity FPT in the number of clusters $k$: this stands in sharp contrast with the $(1+\varepsilon)$-approximation achievable in that case, when allowing randomization.
In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We show how to compute a dimension reduction onto $\varepsilon^{-O(1)} \log k$ dimensions in time $k^{O\left( \varepsilon^{-O(1)} + \log \log k\right)} \text{poly}(nd)$, and how to build a coreset of size $O\left( k^2 \log^3 k \varepsilon^{-O(1)}\right)$ in time $2^{\eps^{O(1)} k\log^3 k} + k^{O\left( \varepsilon^{-O(1)} + \log \log k\right)} \text{poly}(nd)$. In the case where $k$ is small, this answers an open question of [Feldman WIDM'20] and [Munteanu and Schwiegelshohn, Künstliche Intell. 18] on whether it is possible to efficiently compute coresets deterministically.
We also construct a deterministic algorithm for computing $(1+\varepsilon)$-approximation to $k$-median and $k$-means in high dimensional Euclidean spaces in time $2^{k^2 \log^3 k/\varepsilon^{O(1)}} \text{poly}(nd)$, close to the best randomized complexity of $2^{(k/\varepsilon)^{O(1)}} nd$ (see [Kumar, Sabharwal, Sen, JACM 10] and [Bhattacharya, Jaiswal, Kumar, TCS'18]).
Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS 22] by a factor $k$.
View details
Preview abstract
Consider an agent (say a robot) exploring an unknown graph in search
of some goal state. As it walks around the graph, it learns the
nodes and their neighbors. The agent only knows where the goal state
is when it reaches it. How do we reach this goal while moving only a
small distance? This problem seems hopeless, even on trees of
bounded degree, unless we give the agent some help. In our case,
this help comes in the form of *predictions*: each node $v$
provides a prediction $f(v)$ of its distance to the goal
vertex. Naturally if the predictions are correct, we can reach the
goal quickly. What if some of the predictions are erroneous? This
naturally models graph search problems where we are given some
quality function to guide us towards the goal: in our framework the
quality is given by the distance predictions.
In this work, we initiate a study of this problem with
predictions. We consider the problem on trees (where the problem is
already non-trivial) and give deterministic realgorithms whose
movement cost is only $O(OPT + ERR)$, where $OPT$ is the distance
from the start to the goal vertex, and the $ERR$ is the total number
of vertices whose predictions are erroneous. We also consider a
``planning'' version of the problem where the graph and predictions
are known at the beginning, so the agent can use this global
information to quickly reach the goal. For the planning version, we
give a different algorithm for trees, and then an algorithm for all
(weighted) graphs with bounded doubling dimension.
View details
Preview abstract
We study the fine-grained complexity of the famous $k$-center problem in the metric induced by a graph with $n$ vertices and $m$ edges. The problem is NP-hard to approximate within a factor strictly better than $2$, and several $2$-approximation algorithms are known.
Two of the most well-known approaches for the $2$-approximation are (1) finding a maximal distance $r$-independent set (where the minimum pairwise distance is greater than $r$) and (2) Gonzalez's algorithm that iteratively adds the center farthest from the currently chosen centers.
For the approach based on distance-$r$ independent sets, Thorup [SIAM J. Comput. '05] already gave a nearly linear time algorithm. While Thorup's algorithm is not complicated, it still requires tools such as an approximate oracle for neighborhood size by Cohen [J. Comput. Syst. Sci. '97]. Our main result is a nearly straightforward algorithm that improves the running time by an $O(\log n$) factor. It results in an $(2+\eps)$-approximation for $k$-center in $O((m + n \log n)\log n \log(n/\eps))$ time.
For Gonzalez's algorithm [Theor. Comput. Sci. 85], we show that the simple $\widetilde{O}(mk)$-time implementation is nearly optimal if we insist the {\em exact} implementation. On the other hand, we show that an $(1+\eps)$-approximate version of the algorithm is efficiently implementable, leading to an $(2+\eps)$-approximation algorithm in running time $O((m + n \log n)\log^2 n / \eps)$. We also show that, unlike in the distance $r$-independent set-based algorithm, the dependency of $1/\eps$ in the running time is essentially optimal for $(1 + \eps)$-approximate Gonzalez's.
View details
Streaming Euclidean MST to a Constant Factor
Amit Levi
Erik Waingarten
Xi Chen
54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)
Preview abstract
We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \R^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+\epsilon)$ approximations are possible in sublinear space. However, for high dimensional space the best known approximation for this problem was $\tilde{O}(\log n)$, due to [Chen, Jayaram, Levi, Waingarten, STOC'22], improving on the prior $O(\log^2 n)$ bound due to [Andoni, Indyk, Krauthgamer, SODA '08]. In this paper, we break the logarithmic barrier, and give the first constant factor sublinear space approximation to Euclidean MST. For any $\epsilon\geq 1$, our algorithm achieves an $\tilde{O}(\epsilon^{-2})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space.
We complement this by demonstrating that any single pass algorithm which obtains a better than $1.10$-approximation must use $\Omega(\sqrt{n})$ space, demonstrating that $(1+\epsilon)$ approximations are not possible in high-dimensions, and that our algorithm is tight up to a constant. Nevertheless, we demonstrate that $(1+\epsilon)$ approximations are possible in sublinear space with $O(1/\epsilon)$ passes over the stream. More generally, for any $\alpha \geq 2$, we give a $\alpha$-pass streaming algorithm which achieves a $O(\frac{1}{ \alpha \epsilon})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space.
All our streaming algorithms are linear sketches, and therefore extend to the massively-parallel computation model (MPC). Thus, our results imply the first $(1+\epsilon)$-approximation in a constant number of rounds in the MPC model. Previously, such a result was only known for low-dimensional space ([Andoni, Nikolov, Onak, Yaroslavtsev, STOC'15]), or either required $O(\log n)$ rounds or suffered a $O(\log n)$ approximation.
View details
Preview abstract
We consider the classic Euclidean k-median and k-means objective on insertion-only streams,
where the goal is to maintain a (1 + ε)-approximation to the k-median or k-means, while using
as little memory as possible. Over the last 20 years, clustering in data streams has received a
tremendous amount of attention and has been the test-bed for a large variety of new techniques,
including coresets, merge-and-reduce, bicriteria approximation, sensitivity sampling, etc. Despite
this intense effort to obtain smaller and smaller sketches for these problems, all known techniques
require storing at least Ω(log (n∆)) words of memory, where n is size of the input and ∆ is
the aspect ratio. In this paper, we show how to finally bypass this bound and obtain the first
algorithm that achieves a (1 + ε)-approximation to the more general (k, z)-clustering problem,
using only ̃O ( dk / ε^2) · (2^{z log z} ) · min ( 1/ε^z , k) · poly(log log(n∆)) words of memory.
View details
Preview abstract
We prove that there is a randomized polynomial-time algorithm that given an edge-weighted graph G
excluding a fixed-minor Q on n vertices and an accuracy parameter ε > 0, constructs an edge-weighted
graph H and an embedding η : V (G) → V (H) with the following properties:
• For any constant size Q, the treewidth of H is polynomial in ε^−1, log n, and the logarithm of the stretch of the distance metric in G.
• The expected multiplicative distortion is (1 + ε): for every pair of vertices u, v of G, we have
dist_H(η(u), η(v)) ⩾ dist_G(u, v) always and E[distH(η(u), η(v))] ⩽ (1 + ε)dist_G(u, v).
Our embedding is the first to achieve polylogarithmic treewidth of the host graph and comes close
to the lower bound by Carroll and Goel, who showed that any embedding of a planar graph with
O(1) expected distortion requires the host graph to have treewidth Ω(log n). It also provides a unified
framework for obtaining randomized quasi-polynomial-time approximation schemes for a variety of
problems including network design, clustering or routing problems, in minor-free metrics where the
optimization goal is the sum of selected distances. Applications include the capacitated vehicle routing
problem, and capacitated clustering problems.
View details
Private estimation algorithms for stochastic block models and mixture models
Hongjie Chen
Tommaso D'Orsi
Jacob Imola
David Steurer
Stefan Tiegel
54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)
Preview abstract
We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient (ϵ,δ)-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an (ϵ,δ)-differentially private algorithm that recovers the centers of the k-mixture when the minimum separation is at least O(k^{1/t}/√t). For all choices of t, this algorithm requires sample complexity n≥k^O(1)d^O(t) and time complexity (nd)^O(t). Prior work required minimum separation at least O(√k) as well as an explicit upper bound on the Euclidean norm of the centers.
View details
Breaching the 2 LMP Approximation Barrier for Facility Location with Applications to k-Median
Chris Schwiegelshohn
Euiwoong Lee
Fabrizio Grandoni
SODA'23 (2023) (to appear)
Preview abstract
The $k$-Median problem is one the most fundamental clustering problems: Given $n$ points in a metric space and an integer parameter $k$ our goal is to select precisely $k$ points (called centers) so as to minimize the sum of the distances from each point to the closest center. Obtaining better and better approximation algorithms for $k$-Median
is a central open problem.
Over the years, two main approaches have emerged: the first one is the standard Local Search heuristic, yielding a $(3+\eps)$ approximation for any constant $\eps>0$ which is known to be tight.
The second approach is based on a Lagrangian Multiplier
Preserving (LMP) $\alpha$ approximation algorithm for the related facility location problem. This algorithm is used to build an $\alpha$ approximate fractional bipoint solution for $k$-Median, which is then rounded via a $\rho$ approximate bipoint rounding algorithm. Altogether this gives an $\alpha\cdot \rho$ approximation. A lot of progress was made on improving $\rho$, from the initial $2$ by Jain and Vazirani [FOCS'99, J.ACM'01], to $1.367$ by Li and Svensson [STOC'13, SICOMP'16], and finally to $1.338$ by Byrka et al. [SODA'15, TALG'17]. However for almost 20 years no progress was made on $\alpha$, where the current best result is the classical LMP $2$ approximation algorithm JMS for facility location by Jain et al. [STOC'02, \fab{J.ACM'03}] based on the Dual-Fitting technique.
View details
Preview abstract
We consider the classic correlation clustering problem: Given a complete graphs where edges are
labelled either + or -, the goal is to find a partition of the vertices that minimizes the number of the +
edges across parts plus the number of the - edges within parts.
Recently, Cohen-Addad, Lee and Newman [FOCS’22] showed the first 1.995-approximation for the
problem using the Sherali-Adams hierarchy hence breaking through the integrality gap of 2 of the clas-
sic linear program and improving upon the 2.06-approximation of Chawla, Makarychev, Schramm and
Yaroslavtsev [STOC’15].
We significantly improve the result by providing a 1.73-approximation for the problem, making a
significant step below 2. Our approach brings together a new preprocessing of correlation clustering
instances that enables a new LP formulation which combined with the approach of Cohen-Addad, Lee
and Newman [FOCS’22] and a new rounding method yields the improved bound.
View details
Near-Optimal Correlation Clustering with Privacy
Ashkan Norouzi Fard
Chenglin Fan
Jakub Tarnawski
Slobodan Mitrović
NeurIPS 2022 (2022) (to appear)
Preview abstract
Correlation clustering is a central problem in unsupervised learning, with applications spanning community detection, duplicate detection, automated labeling and many more. In the correlation clustering problem one receives as input a set of nodes and for each node a list of co-clustering preferences, and the goal is to output a clustering that minimizes the disagreement with the specified nodes' preferences. In this paper, we introduce a simple and computationally efficient algorithm for the correlation clustering problem with provable privacy guarantees. Our additive error is stronger than the one shown in prior work and is optimal up to polylogarithmic factors for fixed privacy parameters.
View details
Towards Optimal Coreset Bounds for Euclidean k-Median and k-Means
Chris Schwiegelshohn
David Saulpic
Kasper Green Larsen
54rd Annual ACM Symposium on Theory of Computing (STOC'22) (2022)
Preview abstract
The classic Euclidean $k$-Median and $k$-means problems consists of finding a set of $k$ points, such that the sum of Euclidean distances, respectively squared Euclidean distances, of every data point to its closest center is minimized. Coresets are a small, usually weighted subset of the input point set preserving the cost of any candidate solution up to a multiplicative $(1\pm \varepsilon)$ factor. Most coreset constructions focus on improving the size, i.e. the number of distinct points in the coreset. The arguably biggest open question in this line of research is to find tight bounds for Euclidean $k$-median and $k$-means. In this paper, we show the following.
\begin{itemize}
\item A coreset construction consisting of $O(k\cdot \log k \cdot \varepsilon^{-3}\cdot \log^{O(1)}\varepsilon^{-1})$ points for the k-median problem in arbitrary dimensions.
\item An unconditional lower bound showing that $\Omega(k\cdot \varepsilon^{-2})$ points are necessary for any coreset for either problem.
\item An unconditional lower bound showing that $\Omega(k \cdot \varepsilon^{-2} \cdot \log n}
points are necessary for any coreset of either problem in general metric spacesé
\end{itemize}
The two bounds for Euclidean space are tight up to a factor (epsilon^{-1} polylog (k/epsilon). In addition, our techniques improve state of the art bounds for clustering with higher powers of Euclidean distances as well.
The bound for general metrics in tight up to a factor polylog (k/epsilon).
View details
The Power of Uniform Sampling for Coresets
Chris Schwiegelshohn
Mads Toftrup
Robert Krauthgamer
Shaofeng Jiang
Vladimir Braverman
Xuan Wu
FOCS'22 (2022) (to appear)
Preview abstract
We develop a new coreset framework which has many advantages comparing with the well-
studied Feldman-Langberg framework [FL11] and the recent new framework by [CASS21b]. An
important feature of our framework is that it relates a uniform-version shattering dimension
of the ambient metric space with the existence of small coresets. While in [FL11], a weighted
version of such dimension is required. Reasonable upper bounds of such weighted shattering
dimension are often hard to prove or even do not exist.
Based on our new framework, we construct coresets for multiple important problems. First,
We use our framework to construct improved coresets for Barycenter. Secondly, we construct
improved coresets for 1-Median in low-dimensional Euclidean space. Specially, we construct
coresets of size Õ(ϵ^−1.5 ) in R^2 and coresets of size Õ(ϵ^−1.6 ) in R^3 .
View details
Improved Approximations for Euclidean k-means and k-median, via Nested Quasi-Independent Sets
Shyam Narayanan
54rd Annual ACM Symposium on Theory of Computing (STOC'22) (2022)
Preview abstract
Motivated by data analysis and machine learning applications, we consider the popular high-dimensional Euclidean $k$-median and $k$-means problems.
We propose a new primal-dual algorithm, inspired by the classic algorithm of Jain and Vazirani and the recent algorithm of Ahmadian et al.. Our algorithm achieves
an approximation ratio of respectively 2.40... and 5.95... for Euclidean $k$-median and $k$-means improving upon the
2.60... of Ahmadian et al. and the 6.12.. of Grandoni et al..
View details
A 2-Approximation for the Bounded Treewidth Sparsest Cut Problem in FPT Time
Tobias Mömke
Victor
23rd Conference on Integer Programming and Combinatorial Optimization (IPCO'22) (2022)
Preview abstract
In the non-uniform sparsest cut problem, we are given a supply graph $G$ and a demand graph $D$, both with the same vertex set $V$. The goal is to partition the set of vertices $V$ into two subsets in order to minimize the ratio of the total capacity on the supply edges crossing from one part to the other over the total demand of the crossing edges. In this work, we study the sparsest cut problem for supply graphs with bounded treewidth $k$. For this case, Gupta, Talwar, and Witmer [STOC 2013] obtained a 2-approximation with polynomial running time for fixed $k$, and it remained open the question of whether there exists a $c$-approximation algorithm for a constant $c$ independent of $k$ that runs in FPT time. We answer the above question in the affirmative. We show that there exists a 2-approximation algorithm for the non-uniform sparsest cut with bounded treewidth supply graphs that runs in FPT time. Our algorithm is based on rounding the optimal solution of a linear programming relaxation inspired by the Sherali-Adams hierarchy. In contrast to the classic Sherali-Adams construction, we construct a relaxation driven by a tree decomposition of the supply graph, and we include a carefully chosen set of lifting variables to encode information of subsets of nodes with super-constant size, while at the same time having a sufficiently small linear program that can be solved in FPT time.
View details
Improved Coresets for Euclidean k-Means
Chris Schwiegelshohn
David Saulpic
Kasper Green Larsen
Omar Ali Sheikh-Omar
Neurips'22 (2022)
Preview abstract
Given a set of n points in d dimensions, the Euclidean k-means problem consists of finding k centers such that the sum of squared distances from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset. The guarantee of the coreset is that for any candidate solution, the ratio between coreset cost and the cost of the original instance is less than a (1+epsilon) factor. The current state of the art coreset size for Euclidean k-means is O(k epsilon^{-2} min(k, epsilon^{-2})). This matches the lower bound of Omega(k epsilon^{-2}) up to the min(k, epsilon^{-2}) term. In this paper, we improve this bound to min(sqrt(k), epsilon^{-2}). In the regime where k = o(epsilon^{-2}), this is a strict improvement over the state of the art. In particular, ours is the first provable bound that breaks through the k^2 barrier while retaining an optimal dependency on epsilon.
View details
Preview abstract
Given n x n recording pairwise distances x, the Metric Violation Distance prob-
lem asks to compute the L_0 distance between x and the metric cone; i.e., modify the minimum
number of entries of x to make it a metric. Due to its large number of applications in various
data analysis and optimization tasks, this problem has been actively studied recently.
We present an Oplog nq-approximation algorithm for Metric Violation Distance, expo-
nentially improving the previous best approximation ratio of O(OPT ^{1/3}) of Fan, Raichel, and
Van Buskirk [SODA, 2018]. Furthermore, a major strength of our algorithm is its simplicity
and running time. We also study the related problem of Ultrametric Violation Distance,
where the goal is to compute the L0 distance to the cone of ultrametrics, and achieve a constant
factor approximation algorithm. The Ultrametric Violation Distance problem can be
regarded as an extension of the problem of fitting ultrametrics studied by Ailon and Charikar
[SIAM J. Computing, 2011] and by Cohen-Addad, Das, Kipouridis, Parotsidis, and Thorup
[FOCS, 2021] from L1 norm to L0 norm. We show that this problem can be favorably interpreted
as an instance of Correlation Clustering with an additional hierarchical structure, which
we solve using a new O(1)-approximation algorithm for correlation clustering that has the struc-
tural property that it outputs a refinement of the optimum clusters. An algorithm satisfying
such a property can be considered of independent interest. We also provide an O(log n log log n)
approximation algorithm for weighted instances. Finally, we investigate the complementary
version of these problems where one aims at choosing a maximum number of entries of x form-
ing an (ultra-)metric. In stark contrast with the minimization versions, we prove that these
maximization versions are hard to approximate within any constant factor assuming the Unique
Games Conjecture.
View details
Bypassing Surface Embeddings: Approximation Schemes for Network Design in Minor-free Metrics
54rd Annual ACM Symposium on Theory of Computing (STOC'22) (2022)
Preview abstract
Since the mid 90s, the study of the complexity of classic network design
problems such as the traveling salesman problem (TSP), the Steiner tree problem (ST),
or the $k$-MST problem on metric spaces such as low-dimensional Euclidean spaces,
doubling metrics, planar or minor-free metrics,
has led to major improvements of our understanding
of the structure of both these important metric spaces, and the underlying problems.
In their celebrated work, Arora and Mitchell gave quasi-polynomial-time approximation schemes (QPTAS)
for several network design problems in Euclidean space, that have later been improved
to polynomial and (near-)linear in some cases. Arora and Mitchell's QPTAS result has been
extended by Talwar [STOC'04] to doubling metrics showing that the Euclidean embedding
is not a necessary condition to design approximation schemes for network design problems.
In the case of planar metrics, the picture is already less satisfactory: The first
approximation schemes for Steiner tree only dates back to [] and nothing better than the general
case is known for $k$-MST, an open question that has thus been repeatedly asked.
Even when approximation schemes are known for the planar case, generalizing them to the minor-free
setting is often a major challenge because most of the topological properties are lost.
In this case, one of the most important question of the field is whether
we bypass these topological arguments to obtain a general framework for network
design problems in minor-free metrics?
In this paper, we answer by the affirmative by proposing a framework for network design
problems in minor-free metrics.
We give the first approximation schemes, with quasi-polynomial running time,
for $k$-MST, Steiner tree, Steiner forest, Prize-collecting Steiner tree in minor-free metrics.
The result on $k$-MST was not known before even for the planar case, while Steiner tree on
minor-free metrics has been one of the major challenge. Here we also obtain approximation
schmes for the two non-trivial generalizations Steiner forest and Prize-collecting Steiner tree.
View details
Improved Approximation Algorithms and Lower Bounds for Search-Diversification Problems
Amir Abboud
Euiwoong Lee
49th International Colloquium on Automata, Languages and Programming (ICALP) (2022)
An Improved Local Search Algorithm for k-Median
Anupam Gupta
David Saulpic
Hoon Oh
Lunjia Hu
Symposium On Discrete Algorithms SODA'22 (2022)
Preview abstract
We present a new local-search algorithm for the k-median clustering problem.
We show that local optima for this algorithm give an
(2.836+eps)-approximation; our result improves upon the (3+eps)-approximate
local-search algorithm of Arya et al. (2001). Moreover, a computer-aided
analysis of a natural extension suggests that this approach may lead to a
improvement over the best-known approximation guarantee for the problem (which
is 2.67).
The new ingredient in our algorithm is the use of a potential function based
on both the closest and second-closest facilities to each client.
Specifically, the potential is the sum over all clients, of the distance of
the client to its closest facility, plus (a small constant times) the
truncated distance to its second-closest facility. We move from one solution
to another only if the latter can be obtained by swapping a constant number of
facilities, and has a smaller potential than the former. This refined
potential allows us to avoid the bad local optima given by Arya et al. for the
local-search algorithm based only on the cost of the solution.
View details
A Massively Parallel Modularity-Maximizing Algorithm With Provable Guarantees
David Saulpic
Frederik Mallmann-Trenn
41st ACM Symposium on Principles of Distributed Computing 2022 (2022)
Preview abstract
Graph clustering is one of the most basic and pop-
ular unsupervised learning problem. Among the
different formulations of the problem, the modu-
larity objective has been particularly successful
for helping design impactful algorithms; Most
notably the Louvain algorithm has become one
of the most used algorithm for clustering graphs.
Yet, one major limitation of the Louvain algorithm
is its sequential nature which makes it impracti-
cal in distributive environments and on massive
datasets.
In this paper, we provide a parallel version of Lou-
vain which works in the massively parallel com-
putation model (MPC). We show that it achieves
optimal cluster recovery in the classic stochastic
block model in only a constant number of parallel
rounds, and so for the same regime of parameters
than the standard Louvain algorithm as shown
recently in Cohen-Addad et al. (2020).
View details
Preview abstract
Given a complete graph G = (V, E) where each edge is labeled + or −, the Correlation
Clustering problem asks to partition V into clusters to minimize the number of +edges be-
tween different clusters plus the number of −edges within the same cluster. Correlation
Clustering has been used to model a large number of clustering problems in practice, making
it one of the most widely studied clustering formulations. The approximability of Correla-
tion Clustering has been actively investigated [BBC04, CGW05, ACN08], culminating in a
2.06-approximation algorithm by Chawla, Makarychev, Schramm, and Yaroslavtsev [CMSY15].
The state-of-the-art result of Chawla, Makarychev, Schramm, and Yaroslavtsev is obtained by
designing a clever rounding scheme for the standard LP relaxation. Yet, the integrality gap of
this formulation is 2 and so it has remained a major open question to determine whether one
can even breach this bound of 2 for Correlation Clustering.
In this paper, we answer the question affirmatively by showing that there exists a (1.95 +ε)-
approximation algorithm based on O(1/ε^2) rounds of the Sherali-Adams hierarchy. In order to
round a solution to Sherali-Adams, we adapt the correlated rounding originally developed for
CSPs [BRS11, GS11, RT12]. Beyond the traditional triangle-based analysis, we also employ a
global charging scheme that amortizes the total cost of the rounding across different triangles.
We believe our results
could pave the way for applications of LP hierarchies beyond CSPs.
View details
Community Detection in the Heterogeneous Stochastic Block Model
David Saulpic
Frederik Mallmann-Trenn
35th Annual Conference on Learning Theory (COLT 2022) (2022)
Preview abstract
We consider the Heterogeneous Stochastic Block
Model (HeSBM) consisting of two unknown
planted communities C1 and C2 . Each node u
has two (unknown) parameters pu and qu with
pu > qu , describing the probability of having a
directed edges with nodes from the same com-
munity and with nodes from the other commu-
nity, respectively. This model allows for almost
arbitrary degree distributions.
We present a simple, linear time algorithm
that recovers the communities whenever
(pu −qu)/sqrt(pu) = Ω(log n/n). We also show
that this condition is necessary. Note that this
recovery threshold is the same as from the
standard Stochastic Block Model, in which all
nodes have the same pu and qu .
Our algorithm is based on local greedy improvement,
where nodes are moved into the part where
it has more edges. We show how to implement
those moves in parallel, such that each vertex
needs to be considered at most twice. We believe
that this algorithm can be easily implemented in
parallel, making it highly practical; and the theoretical
guarantee we demonstrate show a desirable robustness
to perturbations in the degree distribution
View details
On Facility Location Problem in the Local Differential Privacy Model
Chenglin Fan
Di Wang
Marco Gaboradi
Shi Li
Yunus Esencayi
25th International Conference on Artificial Intelligence and Statistics (AISTATS 2022) (2022)
Preview abstract
We study the uncapacitated facility location (UFL) problem under the constraints imposed by the local differential privacy (LDP). Recently, Gupta et al. (2009) and Esencayi et al. (2019) proposed lower and upper bounds for the UFL problem on the central differential privacy (DP) model where a curator first collects all data before being processed. In this paper, we focus on the LDP model, where we protect a client's participation in the facility location instance. Under the HST metric, we show that there is a non-interactive $\epsilon$-LDP algorithm achieving $O(n^{1/4}/\epsilon^2)$-approximation ratio, where n is the size of the metric. On the negative side, we show a lower bound of $\Omega(n^{1/4}/\sqrt{\epsilon})$ on the approximation ratio for any non-interactive $\epsilon$-LDP algorithm. Thus, our results are tight up to a factor of $\epsilon$. Moreover, unlike previous results, our results generalize for non-uniform facility costs.
View details
Scalable Differentially Private Clustering via Hierarchically Separated Trees
Chris Schwiegelshohn
David Saulpic
2022 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2022) (to appear)
Preview abstract
We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space.
By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods.
We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime.
Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
View details
Massively Parallel k-Means Clustering for Perturbation Resilient Instances
Peilin Zhong
International Conference on Machine Learning, ICML 2022 (2022)
Preview abstract
We consider $k$-means clustering of $n$ data points in Euclidean space in the Massively Parallel Computation (MPC) model, a computational model which is an abstraction of modern massively parallel computing system such as MapReduce.
There was a conditional lower bound showing that getting $O(1)$-approximate $k$-means solution for general input points using $o(\log n)$ rounds in the MPC model is impossible.
However, the real-world data points usually have better structures.
One instance of interest is the set of data points which is perturbation resilient [Bilu \& Linial'2010].
In particular, a point set is $\alpha$-perturbation resilient for $k$-means if perturbing pairwise distances by multiplicative factors in the range $[1,\alpha]$ does not change the optimum $k$-means clusters.
We bypass the worst case lower bound by considering the perturbation resilient input points and showing $o(\log n)$ rounds $k$-means clustering algorithms for these instances in the MPC model.
Specifically, we show a fully scalable $(1+\varepsilon)$-approximate $k$-means clustering algorithm for $O(\alpha)$-perturbation resilient instance in the MPC model using $O(\log\log n)$ rounds and $\widetilde{O}_{\varepsilon,d}(n^{1+1/\alpha^2})$ total space.
If the space per machine is sufficient larger than $k$, i.e., at least $k\cdot n^{\Omega(1)}$, we also develop an optimal $k$-means clustering algorithm for $O(\alpha)$-perturbation resilient instance in the MPC model using $O(\log\log n)$ rounds and $\widetilde{O}_d(n\cdot(n^{1/\alpha^2}+k))$ total space.
View details
Correlation Clustering in Constant Many Parallel Rounds
Ashkan Norouzi Fard
Jakub Tarnawski
Slobodan Mitrović
ICML (2022) (to appear)
Preview abstract
Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental scalability\nnote{I would remove "scalability": it is not clear that this will be demonstrated with mid-sized graphs} evaluation of our techniques.
View details
Preview abstract
In the correlation clustering problem the input is a signed graph where the sign indicates whether each pair of points should be placed in the same cluster or not. The goal of the problem is to compute a clustering which minimizes the number of disagreements with such recommendation. Thanks to its many practical applications, correlation clustering is a fundamental unsupervised learning problem and has been extensively studied in many different settings. In this paper we study the problem in the classic online setting with recourse; The vertices of the graphs arrive in an online manner and the goal is to maintain an approximate clustering while minimizing the number of times each vertex changes cluster.
Our main contribution is an algorithm that achieves logarithmic recourse per vertex in the worst case. We also complement this result with a tight lower bound. Finally we show experimentally that our algorithm achieves better performances than state-of-the-art algorithms on real world data.
View details
Near-Optimal Private and Scalable k-Clustering
Shyam Narayanan
NeurIPS 2022 (2022)
Preview abstract
We study the differentially private (DP) $k$-means and $k$-median clustering problems of $n$ points in $d$-dimensional Euclidean space in the massively parallel computation (MPC) model. We provide two near-optimal algorithms where the near-optimality is in three aspects: they both achieve (1). $O(1)$ parallel computation rounds, (2). near-linear in $n$ and polynomial in $k$ total computational work (i.e., near-linear running time in the sequential setting), (3). $O(1)$ relative approximation and $\text{poly}(k, d)$ additive error, where $\Omega(1)$ relative approximation is provably necessary even for any polynomial-time non-private algorithm, and $\Omega(k)$ additive error is a provable lower bound for any polynomial-time DP $k$-means/median algorithm. Our two algorithms provide a tradeoff between the relative approximation and the additive error: the first has $O(1)$ relative approximation and $\sim (k^{2.5} + k^{1.01} \sqrt{d})$ additive error, and the second one achieves $(1+\gamma)$ relative approximation to the optimal non-private algorithm for an arbitrary small constant $\gamma>0$ and with $\text{poly}(k, d)$ additive error for a larger polynomial dependence on $k$ and $d$.
To achieve our result, we develop a general framework which partitions the data and reduces the DP clustering problem for the entire dataset to the DP clustering problem for each part. To control the blow-up of the additive error introduced by each part, we develop a novel charging argument which might be of independent interest.
View details
Preview abstract
We study in this paper the geometric $(1, z)$-clustering problem: given $n$ points in $\R^d$, find the point $x$ that minimizes the sum of Euclidean distance, raised to the power $z$, over all input points. This problem interpolates between the well-known Fermat-Weber problem -- or geometric median problem-- where $z = 1$, and the Minimum Enclosing Ball problem, where $z = \infty$.
Our contribution is the design of a precise estimator that sample only a constant number of points. Namely, for any $\eps > 0$, we show that sampling uniformly at random $O(\eps^{-z})$ input points is enough to find a center such that the sum of distances to the power $z$ to that center is within a $(1+\eps)$-factor of the optimum. We also provide a lower bound, showing that any such algorithm must sample at least $\Omega\left(eps^{-z}\right)$ points.
This implies an algorithm that computes a $(1+\eps)$-approximation running in time $0(d \eps^{-z-5})$, generalizing the result from Cohen et al [STOC '16] to arbitrary $z$. Furthermore, an algorithm with low query complexity has good privacy guarantee: we show that our algorithm is $(0, O(1/n))$-differentially private. This can be used to construct the first differentially-private algorithm for $(k, z)$-clustering with approximation term independent of the dimension $d$, improving on the algorithm of Ghazi et al. [Neurips '20] that has an additive error $\sqrt d \cdot \poly(k, \log n)$.
View details
Parallel and Efficient Hierarchical k-Median Clustering
Ashkan Norouzi Fard
Christian Sohler
Ola Svensson
NeurIPS 2021
Preview abstract
As a fundamental unsupervised learning task, hierarchical
clustering has been extensively studied in the past decade. In
particular, standard metric formulations as hierarchical $k$-center, $k$-means,
and $k$-median received a lot of attention and the problems
have been studied extensively in different computation model.
Despite all this interest, not many efficient parallel algorithms
are known for those problems. In this paper we introduce a
new parallel algorithm for Euclidean
hierarchical $k$-median problem that using machine with memory $s$ (for $s\in
\Omega(\log^2 (n+\Delta+d))$) runs in $O\left(\log_{s} nd
\right)$ rounds, where $d$ is the dimension of the data set and
$\Delta$ is a polynomial upper-bound on the ratio between the maximum
and minimum
distance of two points in the input dataset. To the best of our
knowledge, this is the first \emph{parallel} algorithm
for the hierarchical $k$-median problem with theoretical guarantees.
We further complement our theoretical results with an empirical study
of our algorithm that shows its effectiveness in practice.
View details
Fitting Distances by Tree Metrics Minimizing the Total Error within a Constant Factor
Debarati Das
Evangelos Kipouridis
Mikkel Thorup
FOCS 2021
Preview abstract
We consider the numerical taxonomy problem of fitting a positive distance function $\calD:{S\choose 2}\rightarrow \mathbb R_{>0}$ by a tree metric. We want a tree $T$ with positive edge weights and including $S$ among the vertices so that their distances in $T$ match
those in $\calD$. A nice application is in evolutionary biology
where the tree $T$ aims to approximate the branching process leading
to the observed distances in $\calD$ [Cavalli-Sforza and Edwards 1967].
We consider the total error, that is the sum of distance errors over
all pairs of points. We present a polynomial time algorithm
minimizing the total error within a constant factor. We can do this
both for general trees, and for the special case of ultrametrics with
a root having the same distance to all vertices in $S$.
The problems are known to be APX-hard, so a constant factor is the best we can hope for.
The best previous approximation factor was
$\calO((\log n)(\log \log n))$ by Ailon and Charikar [2005] who wrote ``Determining whether an
$\calO(1)$ approximation can be obtained is a fascinating question".
View details
A New Coreset Framework for Clustering
Chris Schwiegelshohn
David Saulpic
53rd Annual ACM Symposium on Theory of Computing (STOC'21) (2021)
Preview abstract
Given a metric space, the (k, z)-clustering problem consists of finding k centers such that the
sum of the of distances raised to the power z of every point to its closest center is minimized. This
encapsulates the famous k-median (z = 1) and k-means (z = 2) clustering problems. Designing
small-space sketches of the data that approximately preserves the cost of the solutions, also
known as coresets, has been an important research direction over the last 15 years.
In this paper, we present a new, simple coreset framework that simultaneously improves upon
the best known bounds for a large variety of settings, ranging from Euclidean space, doubling
metric, minor-free metric, and the general metric cases: with Γ = min(ε^−2+ε^−z, k^ε−2) polylog(ε^−1), this framework constructs coreset with size O (Γ · k(d + log k)) in doubling metrics, improving upon the recent breakthrough of [Huang, Jiang, Li, Wu, FOCS’ 18], who presented a coreset with size O(k^3 d/ε^2) – hence an improvement of Ω(k^2/polylog(ε^−1)).
O(Γ · k · min(d, ε^−2 log k)) in d-dimensional Euclidean space, improving on the recent results of [Huang, Vishnoi, STOC’ 20], who presented a coreset of size O(k log k · ε^−2z · min(d, ε^−2 log k)) – hence an improvement of Ω(log k/(ε^z−1 polylog(ε^−1))).
O(Γ · k(t + log k)) for graphs with treewidth t, improving on [Baker, Braverman, Huang,
Jiang, Krauthgamer, Wu, ICML’20], who presented a coreset of size O(k^3 t/ε^2) for z = 1 only – hence an improvement of Ω(k/polylog(ε^−1)), and a generalization to general z.
O(Γ · k (log^2 k + log k/ε^4)) for shortest paths metrics of graphs excluding a fixed minor. This
improves on [Braverman, Jiang, Krauthgamer, Wu, SODA’21], who presented a coreset of
size O(k^3/ε^4) – hence an improvement of Ω(k^2/polylog(ε^−1)).
Size O(Γ · k log n) in general discrete metric spaces, improving on the results of [Feldman, Lamberg, STOC’11], who presented a coreset of size O(kε^−2z log n log k) – hence an improvement of Ω(log k/(ε^z−1polylog(ε^−1))).
A lower bound Ω( k log n/ ε) for k-Median in general metric spaces [Baker, Braverman, Huang,
Jiang, Krauthgamer, Wu, ICML’20] implies that in general metrics as well as metrics with
doubling dimension d, our bounds are optimal up to a poly log(1/ε)/ε factor. For graphs with
treewidth t, the lower bound of Ω(kt/ε) of [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML’20] shows that our bounds are optimal up to the same factor.
View details
On Approximability of Clustering Problems Without Candidate Centers
Euiwoong Lee
Karthik C. S.
Symposium On Discrete Algorithms SODA'21 (2021)
Preview abstract
The k-means objective is arguably the most widely-used cost function for modeling clustering tasks in a metric space. In practice and historically, k-means is thought of in a continuous setting, namely where the centers can be located anywhere in the metric space. For
example, the popular Lloyd’s heuristic locates a center at the mean of each cluster.
Despite persistent efforts on understanding the approximability of k-means, and other
classic clustering problems such as k-median and k-minsum, our knowledge of the hardness
of approximation factors of these problems remains quite poor. In this paper, we significantly improve upon the hardness of approximation factors known in the literature for
these objectives. We show that if the input lies in a general metric space, it is NP-hard to
approximate:
• Continuous k-median to a factor of 2 − o(1); this improves upon the previous inapproximability factor of 1.36 shown by Guha and Khuller (J. Algorithms ’99).
• Continuous k-means to a factor of 4 − o(1); this improves upon the previous inapproximability factor of 2.10 shown by Guha and Khuller (J. Algorithms ’99).
• k-minsum to a factor of 1.415; this improves upon the APX-hardness shown by Guruswami and Indyk (SODA ’03).
Our results shed new and perhaps counter-intuitive light on the differences between
clustering problems in the continuous setting versus the discrete setting (where the candidate centers are given as part of the input).
View details
Improving Ultrametrics Embeddings Through Coresets
Guillaume Lagarde
Rémi De Joannis de Verclos
International Conference on Machine Learning, ICML 2021 (2021)
Preview abstract
To tackle the curse of dimensionality in data analysis and
unsupervised learning, it is critical to be able to efficiently compute ``simple'' faithful representations of
the data that helps extract information, improves understanding and visualization of the structure.
When the dataset consists of $d$-dimensional vectors, simple representations
of the data may consist in trees or ultrametrics, and the goal is to best preserve the distances (i.e.: dissimilarity values)
between data elements.
To circumvent the quadratic running times of the most popular methods for fitting ultrametrics,
such as average, single, or complete linkage,~\citet{CKL20} recently presented a new algorithm
that for any $c \ge 1$, outputs in time $n^{1+O(1/c^2)}$ an ultrametric $\Delta$ such that for any two
points $u, v$, $\Delta(u, v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in
the ``best'' ultrametric representation.
We improve the above result and show how to improve the above guarantee from $5c$ to $\sqrt{2}c + \varepsilon$ while achieving
the same asymptotic running time. To further show the advantage of our new method, we experimentally analyze the performances
of our algorithm which indeed yields embeddings of significantly better quality for various real-world datasets.
View details
A Quasipolynomial (2 + ε)-Approximation for Planar Sparsest Cut
Anupam Gupta
Jason Li
Philip N. Klein
53rd Annual ACM Symposium on Theory of Computing (STOC'21) (2021)
Preview abstract
The (non-uniform) sparsest cut problem is the following classical
graph-partitioning problem: given a ``supply'' graph, and demands on
pairs of vertices, delete some subset of supply edges to minimize
the ratio of the supply edges cut to the total demand of the pairs
separated by this deletion. The sparsest cut problem has been a
driver for new algorithmic approaches. Despite much effort,
there are only a handful of nontrivial classes of supply graphs for
which constant-factor approximations are known.
We consider the problem for planar graphs, and give a
$(2+\eps)$-approximation algorithm that runs in quasipolynomial
time. Our approach defines a new structural decomposition of an
optimal solution using a ``radial cuts'' primitive. We combine this
decomposition with a Sherali-Adams-style linear programming
relaxation of the problem, which we then round. This should be
compared with the polynomial-time approximation algorithm of Rao~(1999),
which
uses the metric linear programming relaxation and $\ell_1$-embeddings,
and achieves an $O(\sqrt{\log n})$-approximation in polynomial time.
View details
Fast and Accurate k-means++ via Rejection Sampling
Ashkan Norouzi Fard
Christian Sohler
Ola Svensson
NeurIPS 2020
Preview abstract
We present the first distributed approximation algorithm for the Euclidean $k$-median problem with an
optimal trade-off between memory usage and the number of parallel rounds. Our algorithm even works in the setting where each machine has very limited memory $s\in \Omega(\log n)$ and it is work efficient. In the future, it would be interesting to obtain similar results for other clustering problems and to improve the approximation factor of our algorithm.
View details
Preview abstract
A classic problem in machine learning and data analysis it to partition
the vertices of a network in such a way that vertices in the same set are densely connected.
In practice, the most popular approaches rely on local search algorithms;
not only for the ease of implementation and the efficiency, but also because of the accuracy of these methods on
many real-world graphs. For example, the Louvain algorithm -- a local search based algorithm -- has quickly become
the method of choice for clustering in social networks, accumulating more than 10700 citations over the past 10 years.
However, explaining the success of these methods remains an open problem: in the worst-case, both
their running time and output can be arbitrarily bad.
In this work, we study the Louvain heuristic and aim at explaining its success and identifying the mechanisms that makes
it successful through a classic model for social networks, the Stochastic Block Model.
For a wide range of parameters, we give the first theoretical evidence that Louvain identifies the underlying natural
partition of a network in near-linear time.
View details
Approximation Schemes for Routing Problems in Minor-Free Metrics
Arnold Filtser
Hung Le
Philip N. Klein
IEEE Symposium on Foundations of Computer Science (FOCS'20) (2020)
Preview abstract
Understanding the structure of minor-free metrics, namely shortest path
metrics obtained over a weighted graph excluding a fixed minor, has been an
important research direction since the fundamental work of Robertson and
Seymour. A fundamental idea that helps both to understand the structural
properties of these metrics and lead to strong algorithmic results is to
construct a ``small-complexity'' graph that approximately preserves distances
between pairs of points of the metric. We show the two following structural
results for minor-free metrics:
_ Construction of a \emph{light} subset spanner. Given a subset of vertices
called terminals, and $\epsilon$, in polynomial time we construct a subgraph
that preserves all pairwise distances between terminals up to a multiplicative
$1+\epsilon$ factor, of total weight at most $O_{\epsilon}(1)$ times the
weight of the minimal Steiner tree spanning the terminals.
_ Construction of a stochastic metric embedding into low treewidth graphs with
expected additive distortion $\epsilon D$. Namely, given a minor free graph
$G=(V,E,w)$ of diameter $D$, and parameter $\epsilon$, we construct a
distribution $\mathcal{D}$ over dominating metric embeddings into
treewidth-$O_{\epsilon}(\log n)$ graphs such that $\forall u,v\in V$,
$\mathbb{E}_{f\sim\mathcal{D}}[d_H(f(u),f(v))]\le d_G(u,v)+\epsilon D$.
One of our important technical contributions is a novel framework that allows
us to reduce \emph{both problems} to problems on simpler graphs of
\emph{bounded diameter} that we solve using a new decomposition. Our results
have the following algorithmic consequences: (1) the first efficient
approximation scheme for subset TSP in minor-free metrics; (2) the first
approximation scheme for vehicle routing with bounded capacity in minor-free
metrics; (3) the first efficient approximation scheme for vehicle routing with
bounded capacity on bounded genus metrics.
View details
No Results Found