# Vincent Pierre Cohen-addad

### Research Areas

Authored Publications

Google Publications

Other Publications

Sort By

Streaming Euclidean MST to a Constant Factor

Amit Levi

Erik Waingarten

Xi Chen

54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)

Preview abstract
We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \R^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+\epsilon)$ approximations are possible in sublinear space. However, for high dimensional space the best known approximation for this problem was $\tilde{O}(\log n)$, due to [Chen, Jayaram, Levi, Waingarten, STOC'22], improving on the prior $O(\log^2 n)$ bound due to [Andoni, Indyk, Krauthgamer, SODA '08]. In this paper, we break the logarithmic barrier, and give the first constant factor sublinear space approximation to Euclidean MST. For any $\epsilon\geq 1$, our algorithm achieves an $\tilde{O}(\epsilon^{-2})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space.
We complement this by demonstrating that any single pass algorithm which obtains a better than $1.10$-approximation must use $\Omega(\sqrt{n})$ space, demonstrating that $(1+\epsilon)$ approximations are not possible in high-dimensions, and that our algorithm is tight up to a constant. Nevertheless, we demonstrate that $(1+\epsilon)$ approximations are possible in sublinear space with $O(1/\epsilon)$ passes over the stream. More generally, for any $\alpha \geq 2$, we give a $\alpha$-pass streaming algorithm which achieves a $O(\frac{1}{ \alpha \epsilon})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space.
All our streaming algorithms are linear sketches, and therefore extend to the massively-parallel computation model (MPC). Thus, our results imply the first $(1+\epsilon)$-approximation in a constant number of rounds in the MPC model. Previously, such a result was only known for low-dimensional space ([Andoni, Nikolov, Onak, Yaroslavtsev, STOC'15]), or either required $O(\log n)$ rounds or suffered a $O(\log n)$ approximation.
View details

Breaching the 2 LMP Approximation Barrier for Facility Location with Applications to k-Median

Chris Schwiegelshohn

Euiwoong Lee

Fabrizio Grandoni

SODA'23 (2023) (to appear)

Preview abstract
The $k$-Median problem is one the most fundamental clustering problems: Given $n$ points in a metric space and an integer parameter $k$ our goal is to select precisely $k$ points (called centers) so as to minimize the sum of the distances from each point to the closest center. Obtaining better and better approximation algorithms for $k$-Median
is a central open problem.
Over the years, two main approaches have emerged: the first one is the standard Local Search heuristic, yielding a $(3+\eps)$ approximation for any constant $\eps>0$ which is known to be tight.
The second approach is based on a Lagrangian Multiplier
Preserving (LMP) $\alpha$ approximation algorithm for the related facility location problem. This algorithm is used to build an $\alpha$ approximate fractional bipoint solution for $k$-Median, which is then rounded via a $\rho$ approximate bipoint rounding algorithm. Altogether this gives an $\alpha\cdot \rho$ approximation. A lot of progress was made on improving $\rho$, from the initial $2$ by Jain and Vazirani [FOCS'99, J.ACM'01], to $1.367$ by Li and Svensson [STOC'13, SICOMP'16], and finally to $1.338$ by Byrka et al. [SODA'15, TALG'17]. However for almost 20 years no progress was made on $\alpha$, where the current best result is the classical LMP $2$ approximation algorithm JMS for facility location by Jain et al. [STOC'02, \fab{J.ACM'03}] based on the Dual-Fitting technique.
View details

Correlation Clustering in Constant Many Parallel Rounds

Ashkan Norouzi Fard

Jakub Tarnawski

Nikos Parotsidis

Slobodan Mitrović

ICML (2022) (to appear)

Preview abstract
Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental scalability\nnote{I would remove "scalability": it is not clear that this will be demonstrated with mid-sized graphs} evaluation of our techniques.
View details

Massively Parallel k-Means Clustering for Perturbation Resilient Instances

Peilin Zhong

International Conference on Machine Learning, ICML 2022 (2022)

Preview abstract
We consider $k$-means clustering of $n$ data points in Euclidean space in the Massively Parallel Computation (MPC) model, a computational model which is an abstraction of modern massively parallel computing system such as MapReduce.
There was a conditional lower bound showing that getting $O(1)$-approximate $k$-means solution for general input points using $o(\log n)$ rounds in the MPC model is impossible.
However, the real-world data points usually have better structures.
One instance of interest is the set of data points which is perturbation resilient [Bilu \& Linial'2010].
In particular, a point set is $\alpha$-perturbation resilient for $k$-means if perturbing pairwise distances by multiplicative factors in the range $[1,\alpha]$ does not change the optimum $k$-means clusters.
We bypass the worst case lower bound by considering the perturbation resilient input points and showing $o(\log n)$ rounds $k$-means clustering algorithms for these instances in the MPC model.
Specifically, we show a fully scalable $(1+\varepsilon)$-approximate $k$-means clustering algorithm for $O(\alpha)$-perturbation resilient instance in the MPC model using $O(\log\log n)$ rounds and $\widetilde{O}_{\varepsilon,d}(n^{1+1/\alpha^2})$ total space.
If the space per machine is sufficient larger than $k$, i.e., at least $k\cdot n^{\Omega(1)}$, we also develop an optimal $k$-means clustering algorithm for $O(\alpha)$-perturbation resilient instance in the MPC model using $O(\log\log n)$ rounds and $\widetilde{O}_d(n\cdot(n^{1/\alpha^2}+k))$ total space.
View details

Towards Optimal Coreset Bounds for Euclidean k-Median and k-Means

Chris Schwiegelshohn

David Saulpic

Kasper Green Larsen

54rd Annual ACM Symposium on Theory of Computing (STOC'22) (2022)

Preview abstract
The classic Euclidean $k$-Median and $k$-means problems consists of finding a set of $k$ points, such that the sum of Euclidean distances, respectively squared Euclidean distances, of every data point to its closest center is minimized. Coresets are a small, usually weighted subset of the input point set preserving the cost of any candidate solution up to a multiplicative $(1\pm \varepsilon)$ factor. Most coreset constructions focus on improving the size, i.e. the number of distinct points in the coreset. The arguably biggest open question in this line of research is to find tight bounds for Euclidean $k$-median and $k$-means. In this paper, we show the following.
\begin{itemize}
\item A coreset construction consisting of $O(k\cdot \log k \cdot \varepsilon^{-3}\cdot \log^{O(1)}\varepsilon^{-1})$ points for the k-median problem in arbitrary dimensions.
\item An unconditional lower bound showing that $\Omega(k\cdot \varepsilon^{-2})$ points are necessary for any coreset for either problem.
\item An unconditional lower bound showing that $\Omega(k \cdot \varepsilon^{-2} \cdot \log n}
points are necessary for any coreset of either problem in general metric spacesé
\end{itemize}
The two bounds for Euclidean space are tight up to a factor (epsilon^{-1} polylog (k/epsilon). In addition, our techniques improve state of the art bounds for clustering with higher powers of Euclidean distances as well.
The bound for general metrics in tight up to a factor polylog (k/epsilon).
View details

Improved Approximation Algorithms and Lower Bounds for Search-Diversification Problems

Amir Abboud

Euiwoong Lee

49th International Colloquium on Automata, Languages and Programming (ICALP) (2022)

Community Detection in the Heterogeneous Stochastic Block Model

David Saulpic

Frederik Mallmann-Trenn

35th Annual Conference on Learning Theory (COLT 2022) (2022)

Preview abstract
We consider the Heterogeneous Stochastic Block
Model (HeSBM) consisting of two unknown
planted communities C1 and C2 . Each node u
has two (unknown) parameters pu and qu with
pu > qu , describing the probability of having a
directed edges with nodes from the same com-
munity and with nodes from the other commu-
nity, respectively. This model allows for almost
arbitrary degree distributions.
We present a simple, linear time algorithm
that recovers the communities whenever
(pu −qu)/sqrt(pu) = Ω(log n/n). We also show
that this condition is necessary. Note that this
recovery threshold is the same as from the
standard Stochastic Block Model, in which all
nodes have the same pu and qu .
Our algorithm is based on local greedy improvement,
where nodes are moved into the part where
it has more edges. We show how to implement
those moves in parallel, such that each vertex
needs to be considered at most twice. We believe
that this algorithm can be easily implemented in
parallel, making it highly practical; and the theoretical
guarantee we demonstrate show a desirable robustness
to perturbations in the degree distribution
View details

Near-Optimal Correlation Clustering with Privacy

Ashkan Norouzi Fard

Chenglin Fan

Jakub Tarnawski

Nikos Parotsidis

Slobodan Mitrović

NeurIPS 2022 (2022) (to appear)

Preview abstract
Correlation clustering is a central problem in unsupervised learning, with applications spanning community detection, duplicate detection, automated labeling and many more. In the correlation clustering problem one receives as input a set of nodes and for each node a list of co-clustering preferences, and the goal is to output a clustering that minimizes the disagreement with the specified nodes' preferences. In this paper, we introduce a simple and computationally efficient algorithm for the correlation clustering problem with provable privacy guarantees. Our additive error is stronger than the one shown in prior work and is optimal up to polylogarithmic factors for fixed privacy parameters.
View details

An Improved Local Search Algorithm for k-Median

Anupam Gupta

David Saulpic

Hoon Oh

Lunjia Hu

Symposium On Discrete Algorithms SODA'22 (2022)

Preview abstract
We present a new local-search algorithm for the k-median clustering problem.
We show that local optima for this algorithm give an
(2.836+eps)-approximation; our result improves upon the (3+eps)-approximate
local-search algorithm of Arya et al. (2001). Moreover, a computer-aided
analysis of a natural extension suggests that this approach may lead to a
improvement over the best-known approximation guarantee for the problem (which
is 2.67).
The new ingredient in our algorithm is the use of a potential function based
on both the closest and second-closest facilities to each client.
Specifically, the potential is the sum over all clients, of the distance of
the client to its closest facility, plus (a small constant times) the
truncated distance to its second-closest facility. We move from one solution
to another only if the latter can be obtained by swapping a constant number of
facilities, and has a smaller potential than the former. This refined
potential allows us to avoid the bad local optima given by Arya et al. for the
local-search algorithm based only on the cost of the solution.
View details

Near-Optimal Private and Scalable k-Clustering

Shyam Narayanan

NeurIPS 2022 (2022)

Preview abstract
We study the differentially private (DP) $k$-means and $k$-median clustering problems of $n$ points in $d$-dimensional Euclidean space in the massively parallel computation (MPC) model. We provide two near-optimal algorithms where the near-optimality is in three aspects: they both achieve (1). $O(1)$ parallel computation rounds, (2). near-linear in $n$ and polynomial in $k$ total computational work (i.e., near-linear running time in the sequential setting), (3). $O(1)$ relative approximation and $\text{poly}(k, d)$ additive error, where $\Omega(1)$ relative approximation is provably necessary even for any polynomial-time non-private algorithm, and $\Omega(k)$ additive error is a provable lower bound for any polynomial-time DP $k$-means/median algorithm. Our two algorithms provide a tradeoff between the relative approximation and the additive error: the first has $O(1)$ relative approximation and $\sim (k^{2.5} + k^{1.01} \sqrt{d})$ additive error, and the second one achieves $(1+\gamma)$ relative approximation to the optimal non-private algorithm for an arbitrary small constant $\gamma>0$ and with $\text{poly}(k, d)$ additive error for a larger polynomial dependence on $k$ and $d$.
To achieve our result, we develop a general framework which partitions the data and reduces the DP clustering problem for the entire dataset to the DP clustering problem for each part. To control the blow-up of the additive error introduced by each part, we develop a novel charging argument which might be of independent interest.
View details

A Massively Parallel Modularity-Maximizing Algorithm With Provable Guarantees

David Saulpic

Frederik Mallmann-Trenn

41st ACM Symposium on Principles of Distributed Computing 2022 (2022)

Preview abstract
Graph clustering is one of the most basic and pop-
ular unsupervised learning problem. Among the
different formulations of the problem, the modu-
larity objective has been particularly successful
for helping design impactful algorithms; Most
notably the Louvain algorithm has become one
of the most used algorithm for clustering graphs.
Yet, one major limitation of the Louvain algorithm
is its sequential nature which makes it impracti-
cal in distributive environments and on massive
datasets.
In this paper, we provide a parallel version of Lou-
vain which works in the massively parallel com-
putation model (MPC). We show that it achieves
optimal cluster recovery in the classic stochastic
block model in only a constant number of parallel
rounds, and so for the same regime of parameters
than the standard Louvain algorithm as shown
recently in Cohen-Addad et al. (2020).
View details

Bypassing Surface Embeddings: Approximation Schemes for Network Design in Minor-free Metrics

54rd Annual ACM Symposium on Theory of Computing (STOC'22) (2022)

Preview abstract
Since the mid 90s, the study of the complexity of classic network design
problems such as the traveling salesman problem (TSP), the Steiner tree problem (ST),
or the $k$-MST problem on metric spaces such as low-dimensional Euclidean spaces,
doubling metrics, planar or minor-free metrics,
has led to major improvements of our understanding
of the structure of both these important metric spaces, and the underlying problems.
In their celebrated work, Arora and Mitchell gave quasi-polynomial-time approximation schemes (QPTAS)
for several network design problems in Euclidean space, that have later been improved
to polynomial and (near-)linear in some cases. Arora and Mitchell's QPTAS result has been
extended by Talwar [STOC'04] to doubling metrics showing that the Euclidean embedding
is not a necessary condition to design approximation schemes for network design problems.
In the case of planar metrics, the picture is already less satisfactory: The first
approximation schemes for Steiner tree only dates back to [] and nothing better than the general
case is known for $k$-MST, an open question that has thus been repeatedly asked.
Even when approximation schemes are known for the planar case, generalizing them to the minor-free
setting is often a major challenge because most of the topological properties are lost.
In this case, one of the most important question of the field is whether
we bypass these topological arguments to obtain a general framework for network
design problems in minor-free metrics?
In this paper, we answer by the affirmative by proposing a framework for network design
problems in minor-free metrics.
We give the first approximation schemes, with quasi-polynomial running time,
for $k$-MST, Steiner tree, Steiner forest, Prize-collecting Steiner tree in minor-free metrics.
The result on $k$-MST was not known before even for the planar case, while Steiner tree on
minor-free metrics has been one of the major challenge. Here we also obtain approximation
schmes for the two non-trivial generalizations Steiner forest and Prize-collecting Steiner tree.
View details

Scalable Differentially Private Clustering via Hierarchically Separated Trees

Chris Schwiegelshohn

David Saulpic

2022 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2022) (to appear)

Preview abstract
We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space.
By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods.
We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime.
Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
View details

Improved Approximations for Euclidean k-means and k-median, via Nested Quasi-Independent Sets

Shyam Narayanan

54rd Annual ACM Symposium on Theory of Computing (STOC'22) (2022)

Preview abstract
Motivated by data analysis and machine learning applications, we consider the popular high-dimensional Euclidean $k$-median and $k$-means problems.
We propose a new primal-dual algorithm, inspired by the classic algorithm of Jain and Vazirani and the recent algorithm of Ahmadian et al.. Our algorithm achieves
an approximation ratio of respectively 2.40... and 5.95... for Euclidean $k$-median and $k$-means improving upon the
2.60... of Ahmadian et al. and the 6.12.. of Grandoni et al..
View details

A 2-Approximation for the Bounded Treewidth Sparsest Cut Problem in FPT Time

Tobias Mömke

Victor

23rd Conference on Integer Programming and Combinatorial Optimization (IPCO'22) (2022)

Preview abstract
In the non-uniform sparsest cut problem, we are given a supply graph $G$ and a demand graph $D$, both with the same vertex set $V$. The goal is to partition the set of vertices $V$ into two subsets in order to minimize the ratio of the total capacity on the supply edges crossing from one part to the other over the total demand of the crossing edges. In this work, we study the sparsest cut problem for supply graphs with bounded treewidth $k$. For this case, Gupta, Talwar, and Witmer [STOC 2013] obtained a 2-approximation with polynomial running time for fixed $k$, and it remained open the question of whether there exists a $c$-approximation algorithm for a constant $c$ independent of $k$ that runs in FPT time. We answer the above question in the affirmative. We show that there exists a 2-approximation algorithm for the non-uniform sparsest cut with bounded treewidth supply graphs that runs in FPT time. Our algorithm is based on rounding the optimal solution of a linear programming relaxation inspired by the Sherali-Adams hierarchy. In contrast to the classic Sherali-Adams construction, we construct a relaxation driven by a tree decomposition of the supply graph, and we include a carefully chosen set of lifting variables to encode information of subsets of nodes with super-constant size, while at the same time having a sufficiently small linear program that can be solved in FPT time.
View details

Improved Coresets for Euclidean k-Means

Chris Schwiegelshohn

David Saulpic

Kasper Green Larsen

Omar Ali Sheikh-Omar

Neurips'22 (2022)

Preview abstract
Given a set of n points in d dimensions, the Euclidean k-means problem consists of finding k centers such that the sum of squared distances from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset. The guarantee of the coreset is that for any candidate solution, the ratio between coreset cost and the cost of the original instance is less than a (1+epsilon) factor. The current state of the art coreset size for Euclidean k-means is O(k epsilon^{-2} min(k, epsilon^{-2})). This matches the lower bound of Omega(k epsilon^{-2}) up to the min(k, epsilon^{-2}) term. In this paper, we improve this bound to min(sqrt(k), epsilon^{-2}). In the regime where k = o(epsilon^{-2}), this is a strict improvement over the state of the art. In particular, ours is the first provable bound that breaks through the k^2 barrier while retaining an optimal dependency on epsilon.
View details

The Power of Uniform Sampling for Coresets

Chris Schwiegelshohn

Mads Toftrup

Robert Krauthgamer

Shaofeng Jiang

Vladimir Braverman

Xuan Wu

FOCS'22 (2022) (to appear)

Preview abstract
We develop a new coreset framework which has many advantages comparing with the well-
studied Feldman-Langberg framework [FL11] and the recent new framework by [CASS21b]. An
important feature of our framework is that it relates a uniform-version shattering dimension
of the ambient metric space with the existence of small coresets. While in [FL11], a weighted
version of such dimension is required. Reasonable upper bounds of such weighted shattering
dimension are often hard to prove or even do not exist.
Based on our new framework, we construct coresets for multiple important problems. First,
We use our framework to construct improved coresets for Barycenter. Secondly, we construct
improved coresets for 1-Median in low-dimensional Euclidean space. Specially, we construct
coresets of size Õ(ϵ^−1.5 ) in R^2 and coresets of size Õ(ϵ^−1.6 ) in R^3 .
View details

On Facility Location Problem in the Local Differential Privacy Model

Chenglin Fan

Di Wang

Marco Gaboradi

Shi Li

Yunus Esencayi

25th International Conference on Artificial Intelligence and Statistics (AISTATS 2022) (2022)

Preview abstract
We study the uncapacitated facility location (UFL) problem under the constraints imposed by the local differential privacy (LDP). Recently, Gupta et al. (2009) and Esencayi et al. (2019) proposed lower and upper bounds for the UFL problem on the central differential privacy (DP) model where a curator first collects all data before being processed. In this paper, we focus on the LDP model, where we protect a client's participation in the facility location instance. Under the HST metric, we show that there is a non-interactive $\epsilon$-LDP algorithm achieving $O(n^{1/4}/\epsilon^2)$-approximation ratio, where n is the size of the metric. On the negative side, we show a lower bound of $\Omega(n^{1/4}/\sqrt{\epsilon})$ on the approximation ratio for any non-interactive $\epsilon$-LDP algorithm. Thus, our results are tight up to a factor of $\epsilon$. Moreover, unlike previous results, our results generalize for non-uniform facility costs.
View details

On Approximability of Clustering Problems Without Candidate Centers

Euiwoong Lee

Karthik C. S.

Symposium On Discrete Algorithms SODA'21 (2021)

Preview abstract
The k-means objective is arguably the most widely-used cost function for modeling clustering tasks in a metric space. In practice and historically, k-means is thought of in a continuous setting, namely where the centers can be located anywhere in the metric space. For
example, the popular Lloyd’s heuristic locates a center at the mean of each cluster.
Despite persistent efforts on understanding the approximability of k-means, and other
classic clustering problems such as k-median and k-minsum, our knowledge of the hardness
of approximation factors of these problems remains quite poor. In this paper, we significantly improve upon the hardness of approximation factors known in the literature for
these objectives. We show that if the input lies in a general metric space, it is NP-hard to
approximate:
• Continuous k-median to a factor of 2 − o(1); this improves upon the previous inapproximability factor of 1.36 shown by Guha and Khuller (J. Algorithms ’99).
• Continuous k-means to a factor of 4 − o(1); this improves upon the previous inapproximability factor of 2.10 shown by Guha and Khuller (J. Algorithms ’99).
• k-minsum to a factor of 1.415; this improves upon the APX-hardness shown by Guruswami and Indyk (SODA ’03).
Our results shed new and perhaps counter-intuitive light on the differences between
clustering problems in the continuous setting versus the discrete setting (where the candidate centers are given as part of the input).
View details

Fitting Distances by Tree Metrics Minimizing the Total Error within a Constant Factor

Debarati Das

Evangelos Kipouridis

Mikkel Thorup

Nikos Parotsidis

FOCS 2021

Preview abstract
We consider the numerical taxonomy problem of fitting a positive distance function $\calD:{S\choose 2}\rightarrow \mathbb R_{>0}$ by a tree metric. We want a tree $T$ with positive edge weights and including $S$ among the vertices so that their distances in $T$ match
those in $\calD$. A nice application is in evolutionary biology
where the tree $T$ aims to approximate the branching process leading
to the observed distances in $\calD$ [Cavalli-Sforza and Edwards 1967].
We consider the total error, that is the sum of distance errors over
all pairs of points. We present a polynomial time algorithm
minimizing the total error within a constant factor. We can do this
both for general trees, and for the special case of ultrametrics with
a root having the same distance to all vertices in $S$.
The problems are known to be APX-hard, so a constant factor is the best we can hope for.
The best previous approximation factor was
$\calO((\log n)(\log \log n))$ by Ailon and Charikar [2005] who wrote ``Determining whether an
$\calO(1)$ approximation can be obtained is a fascinating question".
View details

Improving Ultrametrics Embeddings Through Coresets

Guillaume Lagarde

Rémi De Joannis de Verclos

International Conference on Machine Learning, ICML 2021 (2021)

Preview abstract
To tackle the curse of dimensionality in data analysis and
unsupervised learning, it is critical to be able to efficiently compute ``simple'' faithful representations of
the data that helps extract information, improves understanding and visualization of the structure.
When the dataset consists of $d$-dimensional vectors, simple representations
of the data may consist in trees or ultrametrics, and the goal is to best preserve the distances (i.e.: dissimilarity values)
between data elements.
To circumvent the quadratic running times of the most popular methods for fitting ultrametrics,
such as average, single, or complete linkage,~\citet{CKL20} recently presented a new algorithm
that for any $c \ge 1$, outputs in time $n^{1+O(1/c^2)}$ an ultrametric $\Delta$ such that for any two
points $u, v$, $\Delta(u, v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in
the ``best'' ultrametric representation.
We improve the above result and show how to improve the above guarantee from $5c$ to $\sqrt{2}c + \varepsilon$ while achieving
the same asymptotic running time. To further show the advantage of our new method, we experimentally analyze the performances
of our algorithm which indeed yields embeddings of significantly better quality for various real-world datasets.
View details

A Quasipolynomial (2 + ε)-Approximation for Planar Sparsest Cut

Anupam Gupta

Jason Li

Philip N. Klein

53rd Annual ACM Symposium on Theory of Computing (STOC'21) (2021)

Preview abstract
The (non-uniform) sparsest cut problem is the following classical
graph-partitioning problem: given a ``supply'' graph, and demands on
pairs of vertices, delete some subset of supply edges to minimize
the ratio of the supply edges cut to the total demand of the pairs
separated by this deletion. The sparsest cut problem has been a
driver for new algorithmic approaches. Despite much effort,
there are only a handful of nontrivial classes of supply graphs for
which constant-factor approximations are known.
We consider the problem for planar graphs, and give a
$(2+\eps)$-approximation algorithm that runs in quasipolynomial
time. Our approach defines a new structural decomposition of an
optimal solution using a ``radial cuts'' primitive. We combine this
decomposition with a Sherali-Adams-style linear programming
relaxation of the problem, which we then round. This should be
compared with the polynomial-time approximation algorithm of Rao~(1999),
which
uses the metric linear programming relaxation and $\ell_1$-embeddings,
and achieves an $O(\sqrt{\log n})$-approximation in polynomial time.
View details

A New Coreset Framework for Clustering

Chris Schwiegelshohn

David Saulpic

53rd Annual ACM Symposium on Theory of Computing (STOC'21) (2021)

Preview abstract
Given a metric space, the (k, z)-clustering problem consists of finding k centers such that the
sum of the of distances raised to the power z of every point to its closest center is minimized. This
encapsulates the famous k-median (z = 1) and k-means (z = 2) clustering problems. Designing
small-space sketches of the data that approximately preserves the cost of the solutions, also
known as coresets, has been an important research direction over the last 15 years.
In this paper, we present a new, simple coreset framework that simultaneously improves upon
the best known bounds for a large variety of settings, ranging from Euclidean space, doubling
metric, minor-free metric, and the general metric cases: with Γ = min(ε^−2+ε^−z, k^ε−2) polylog(ε^−1), this framework constructs coreset with size O (Γ · k(d + log k)) in doubling metrics, improving upon the recent breakthrough of [Huang, Jiang, Li, Wu, FOCS’ 18], who presented a coreset with size O(k^3 d/ε^2) – hence an improvement of Ω(k^2/polylog(ε^−1)).
O(Γ · k · min(d, ε^−2 log k)) in d-dimensional Euclidean space, improving on the recent results of [Huang, Vishnoi, STOC’ 20], who presented a coreset of size O(k log k · ε^−2z · min(d, ε^−2 log k)) – hence an improvement of Ω(log k/(ε^z−1 polylog(ε^−1))).
O(Γ · k(t + log k)) for graphs with treewidth t, improving on [Baker, Braverman, Huang,
Jiang, Krauthgamer, Wu, ICML’20], who presented a coreset of size O(k^3 t/ε^2) for z = 1 only – hence an improvement of Ω(k/polylog(ε^−1)), and a generalization to general z.
O(Γ · k (log^2 k + log k/ε^4)) for shortest paths metrics of graphs excluding a fixed minor. This
improves on [Braverman, Jiang, Krauthgamer, Wu, SODA’21], who presented a coreset of
size O(k^3/ε^4) – hence an improvement of Ω(k^2/polylog(ε^−1)).
Size O(Γ · k log n) in general discrete metric spaces, improving on the results of [Feldman, Lamberg, STOC’11], who presented a coreset of size O(kε^−2z log n log k) – hence an improvement of Ω(log k/(ε^z−1polylog(ε^−1))).
A lower bound Ω( k log n/ ε) for k-Median in general metric spaces [Baker, Braverman, Huang,
Jiang, Krauthgamer, Wu, ICML’20] implies that in general metrics as well as metrics with
doubling dimension d, our bounds are optimal up to a poly log(1/ε)/ε factor. For graphs with
treewidth t, the lower bound of Ω(kt/ε) of [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML’20] shows that our bounds are optimal up to the same factor.
View details

Parallel and Efficient Hierarchical k-Median Clustering

Ashkan Norouzi Fard

Christian Sohler

Ola Svensson

NeurIPS 2021

Preview abstract
As a fundamental unsupervised learning task, hierarchical
clustering has been extensively studied in the past decade. In
particular, standard metric formulations as hierarchical $k$-center, $k$-means,
and $k$-median received a lot of attention and the problems
have been studied extensively in different computation model.
Despite all this interest, not many efficient parallel algorithms
are known for those problems. In this paper we introduce a
new parallel algorithm for Euclidean
hierarchical $k$-median problem that using machine with memory $s$ (for $s\in
\Omega(\log^2 (n+\Delta+d))$) runs in $O\left(\log_{s} nd
\right)$ rounds, where $d$ is the dimension of the data set and
$\Delta$ is a polynomial upper-bound on the ratio between the maximum
and minimum
distance of two points in the input dataset. To the best of our
knowledge, this is the first \emph{parallel} algorithm
for the hierarchical $k$-median problem with theoretical guarantees.
We further complement our theoretical results with an empirical study
of our algorithm that shows its effectiveness in practice.
View details

Approximation Schemes for Routing Problems in Minor-Free Metrics

Arnold Filtser

Hung Le

Philip N. Klein

IEEE Symposium on Foundations of Computer Science (FOCS'20) (2020)

Preview abstract
Understanding the structure of minor-free metrics, namely shortest path
metrics obtained over a weighted graph excluding a fixed minor, has been an
important research direction since the fundamental work of Robertson and
Seymour. A fundamental idea that helps both to understand the structural
properties of these metrics and lead to strong algorithmic results is to
construct a ``small-complexity'' graph that approximately preserves distances
between pairs of points of the metric. We show the two following structural
results for minor-free metrics:
_ Construction of a \emph{light} subset spanner. Given a subset of vertices
called terminals, and $\epsilon$, in polynomial time we construct a subgraph
that preserves all pairwise distances between terminals up to a multiplicative
$1+\epsilon$ factor, of total weight at most $O_{\epsilon}(1)$ times the
weight of the minimal Steiner tree spanning the terminals.
_ Construction of a stochastic metric embedding into low treewidth graphs with
expected additive distortion $\epsilon D$. Namely, given a minor free graph
$G=(V,E,w)$ of diameter $D$, and parameter $\epsilon$, we construct a
distribution $\mathcal{D}$ over dominating metric embeddings into
treewidth-$O_{\epsilon}(\log n)$ graphs such that $\forall u,v\in V$,
$\mathbb{E}_{f\sim\mathcal{D}}[d_H(f(u),f(v))]\le d_G(u,v)+\epsilon D$.
One of our important technical contributions is a novel framework that allows
us to reduce \emph{both problems} to problems on simpler graphs of
\emph{bounded diameter} that we solve using a new decomposition. Our results
have the following algorithmic consequences: (1) the first efficient
approximation scheme for subset TSP in minor-free metrics; (2) the first
approximation scheme for vehicle routing with bounded capacity in minor-free
metrics; (3) the first efficient approximation scheme for vehicle routing with
bounded capacity on bounded genus metrics.
View details

Fast and Accurate k-means++ via Rejection Sampling

Ashkan Norouzi Fard

Christian Sohler

Ola Svensson

NeurIPS 2020

Preview abstract
We present the first distributed approximation algorithm for the Euclidean $k$-median problem with an
optimal trade-off between memory usage and the number of parallel rounds. Our algorithm even works in the setting where each machine has very limited memory $s\in \Omega(\log n)$ and it is work efficient. In the future, it would be interesting to obtain similar results for other clustering problems and to improve the approximation factor of our algorithm.
View details

No Results Found