Jump to Content
Vincent Pierre Cohen-addad

Vincent Pierre Cohen-addad

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the precision parameter $\varepsilon^{-1}$ or $k$. Furthermore, there is no coreset construction that succeeds with probability $1-1/n$ and whose size does not depend on the number of input points, $n$. This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are $2.633$ for $k$-median [Ahmadian, Norouzi-Fard, Svensson, Ward, SIAM J. Comput. 20], $9+\varepsilon$ for $k$-means [Kanungo et al. Comput. Geom. 04]. Those are the best results, even when allowing a complexity FPT in the number of clusters $k$: this stands in sharp contrast with the $(1+\varepsilon)$-approximation achievable in that case, when allowing randomization. In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We show how to compute a dimension reduction onto $\varepsilon^{-O(1)} \log k$ dimensions in time $k^{O\left( \varepsilon^{-O(1)} + \log \log k\right)} \text{poly}(nd)$, and how to build a coreset of size $O\left( k^2 \log^3 k \varepsilon^{-O(1)}\right)$ in time $2^{\eps^{O(1)} k\log^3 k} + k^{O\left( \varepsilon^{-O(1)} + \log \log k\right)} \text{poly}(nd)$. In the case where $k$ is small, this answers an open question of [Feldman WIDM'20] and [Munteanu and Schwiegelshohn, Künstliche Intell. 18] on whether it is possible to efficiently compute coresets deterministically. We also construct a deterministic algorithm for computing $(1+\varepsilon)$-approximation to $k$-median and $k$-means in high dimensional Euclidean spaces in time $2^{k^2 \log^3 k/\varepsilon^{O(1)}} \text{poly}(nd)$, close to the best randomized complexity of $2^{(k/\varepsilon)^{O(1)}} nd$ (see [Kumar, Sabharwal, Sen, JACM 10] and [Bhattacharya, Jaiswal, Kumar, TCS'18]). Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS 22] by a factor $k$. View details
    Graph Searching with Predictions
    Anupam Gupta
    Siddhartha Banerjee
    Zhouzi Li
    ITCS'23 (2023) (to appear)
    Preview abstract Consider an agent (say a robot) exploring an unknown graph in search of some goal state. As it walks around the graph, it learns the nodes and their neighbors. The agent only knows where the goal state is when it reaches it. How do we reach this goal while moving only a small distance? This problem seems hopeless, even on trees of bounded degree, unless we give the agent some help. In our case, this help comes in the form of *predictions*: each node $v$ provides a prediction $f(v)$ of its distance to the goal vertex. Naturally if the predictions are correct, we can reach the goal quickly. What if some of the predictions are erroneous? This naturally models graph search problems where we are given some quality function to guide us towards the goal: in our framework the quality is given by the distance predictions. In this work, we initiate a study of this problem with predictions. We consider the problem on trees (where the problem is already non-trivial) and give deterministic realgorithms whose movement cost is only $O(OPT + ERR)$, where $OPT$ is the distance from the start to the goal vertex, and the $ERR$ is the total number of vertices whose predictions are erroneous. We also consider a ``planning'' version of the problem where the graph and predictions are known at the beginning, so the agent can use this global information to quickly reach the goal. For the planning version, we give a different algorithm for trees, and then an algorithm for all (weighted) graphs with bounded doubling dimension. View details
    Preview abstract We study the fine-grained complexity of the famous $k$-center problem in the metric induced by a graph with $n$ vertices and $m$ edges. The problem is NP-hard to approximate within a factor strictly better than $2$, and several $2$-approximation algorithms are known. Two of the most well-known approaches for the $2$-approximation are (1) finding a maximal distance $r$-independent set (where the minimum pairwise distance is greater than $r$) and (2) Gonzalez's algorithm that iteratively adds the center farthest from the currently chosen centers. For the approach based on distance-$r$ independent sets, Thorup [SIAM J. Comput. '05] already gave a nearly linear time algorithm. While Thorup's algorithm is not complicated, it still requires tools such as an approximate oracle for neighborhood size by Cohen [J. Comput. Syst. Sci. '97]. Our main result is a nearly straightforward algorithm that improves the running time by an $O(\log n$) factor. It results in an $(2+\eps)$-approximation for $k$-center in $O((m + n \log n)\log n \log(n/\eps))$ time. For Gonzalez's algorithm [Theor. Comput. Sci. 85], we show that the simple $\widetilde{O}(mk)$-time implementation is nearly optimal if we insist the {\em exact} implementation. On the other hand, we show that an $(1+\eps)$-approximate version of the algorithm is efficiently implementable, leading to an $(2+\eps)$-approximation algorithm in running time $O((m + n \log n)\log^2 n / \eps)$. We also show that, unlike in the distance $r$-independent set-based algorithm, the dependency of $1/\eps$ in the running time is essentially optimal for $(1 + \eps)$-approximate Gonzalez's. View details
    Streaming Euclidean MST to a Constant Factor
    Amit Levi
    Erik Waingarten
    Xi Chen
    54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)
    Preview abstract We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \R^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+\epsilon)$ approximations are possible in sublinear space. However, for high dimensional space the best known approximation for this problem was $\tilde{O}(\log n)$, due to [Chen, Jayaram, Levi, Waingarten, STOC'22], improving on the prior $O(\log^2 n)$ bound due to [Andoni, Indyk, Krauthgamer, SODA '08]. In this paper, we break the logarithmic barrier, and give the first constant factor sublinear space approximation to Euclidean MST. For any $\epsilon\geq 1$, our algorithm achieves an $\tilde{O}(\epsilon^{-2})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space. We complement this by demonstrating that any single pass algorithm which obtains a better than $1.10$-approximation must use $\Omega(\sqrt{n})$ space, demonstrating that $(1+\epsilon)$ approximations are not possible in high-dimensions, and that our algorithm is tight up to a constant. Nevertheless, we demonstrate that $(1+\epsilon)$ approximations are possible in sublinear space with $O(1/\epsilon)$ passes over the stream. More generally, for any $\alpha \geq 2$, we give a $\alpha$-pass streaming algorithm which achieves a $O(\frac{1}{ \alpha \epsilon})$ approximation in $n^{O(\epsilon)} d^{O(1)}$ space. All our streaming algorithms are linear sketches, and therefore extend to the massively-parallel computation model (MPC). Thus, our results imply the first $(1+\epsilon)$-approximation in a constant number of rounds in the MPC model. Previously, such a result was only known for low-dimensional space ([Andoni, Nikolov, Onak, Yaroslavtsev, STOC'15]), or either required $O(\log n)$ rounds or suffered a $O(\log n)$ approximation. View details
    Preview abstract We consider the classic Euclidean k-median and k-means objective on insertion-only streams, where the goal is to maintain a (1 + ε)-approximation to the k-median or k-means, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, merge-and-reduce, bicriteria approximation, sensitivity sampling, etc. Despite this intense effort to obtain smaller and smaller sketches for these problems, all known techniques require storing at least Ω(log (n∆)) words of memory, where n is size of the input and ∆ is the aspect ratio. In this paper, we show how to finally bypass this bound and obtain the first algorithm that achieves a (1 + ε)-approximation to the more general (k, z)-clustering problem, using only ̃O ( dk / ε^2) · (2^{z log z} ) · min ( 1/ε^z , k) · poly(log log(n∆)) words of memory. View details
    Preview abstract We prove that there is a randomized polynomial-time algorithm that given an edge-weighted graph G excluding a fixed-minor Q on n vertices and an accuracy parameter ε > 0, constructs an edge-weighted graph H and an embedding η : V (G) → V (H) with the following properties: • For any constant size Q, the treewidth of H is polynomial in ε^−1, log n, and the logarithm of the stretch of the distance metric in G. • The expected multiplicative distortion is (1 + ε): for every pair of vertices u, v of G, we have dist_H(η(u), η(v)) ⩾ dist_G(u, v) always and E[distH(η(u), η(v))] ⩽ (1 + ε)dist_G(u, v). Our embedding is the first to achieve polylogarithmic treewidth of the host graph and comes close to the lower bound by Carroll and Goel, who showed that any embedding of a planar graph with O(1) expected distortion requires the host graph to have treewidth Ω(log n). It also provides a unified framework for obtaining randomized quasi-polynomial-time approximation schemes for a variety of problems including network design, clustering or routing problems, in minor-free metrics where the optimization goal is the sum of selected distances. Applications include the capacitated vehicle routing problem, and capacitated clustering problems. View details
    Private estimation algorithms for stochastic block models and mixture models
    Hongjie Chen
    Tommaso D'Orsi
    Jacob Imola
    David Steurer
    Stefan Tiegel
    54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)
    Preview abstract We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient (ϵ,δ)-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an (ϵ,δ)-differentially private algorithm that recovers the centers of the k-mixture when the minimum separation is at least O(k^{1/t}/√t). For all choices of t, this algorithm requires sample complexity n≥k^O(1)d^O(t) and time complexity (nd)^O(t). Prior work required minimum separation at least O(√k) as well as an explicit upper bound on the Euclidean norm of the centers. View details
    Preview abstract The $k$-Median problem is one the most fundamental clustering problems: Given $n$ points in a metric space and an integer parameter $k$ our goal is to select precisely $k$ points (called centers) so as to minimize the sum of the distances from each point to the closest center. Obtaining better and better approximation algorithms for $k$-Median is a central open problem. Over the years, two main approaches have emerged: the first one is the standard Local Search heuristic, yielding a $(3+\eps)$ approximation for any constant $\eps>0$ which is known to be tight. The second approach is based on a Lagrangian Multiplier Preserving (LMP) $\alpha$ approximation algorithm for the related facility location problem. This algorithm is used to build an $\alpha$ approximate fractional bipoint solution for $k$-Median, which is then rounded via a $\rho$ approximate bipoint rounding algorithm. Altogether this gives an $\alpha\cdot \rho$ approximation. A lot of progress was made on improving $\rho$, from the initial $2$ by Jain and Vazirani [FOCS'99, J.ACM'01], to $1.367$ by Li and Svensson [STOC'13, SICOMP'16], and finally to $1.338$ by Byrka et al. [SODA'15, TALG'17]. However for almost 20 years no progress was made on $\alpha$, where the current best result is the classical LMP $2$ approximation algorithm JMS for facility location by Jain et al. [STOC'02, \fab{J.ACM'03}] based on the Dual-Fitting technique. View details
    Preview abstract We consider the classic correlation clustering problem: Given a complete graphs where edges are labelled either + or -, the goal is to find a partition of the vertices that minimizes the number of the + edges across parts plus the number of the - edges within parts. Recently, Cohen-Addad, Lee and Newman [FOCS’22] showed the first 1.995-approximation for the problem using the Sherali-Adams hierarchy hence breaking through the integrality gap of 2 of the clas- sic linear program and improving upon the 2.06-approximation of Chawla, Makarychev, Schramm and Yaroslavtsev [STOC’15]. We significantly improve the result by providing a 1.73-approximation for the problem, making a significant step below 2. Our approach brings together a new preprocessing of correlation clustering instances that enables a new LP formulation which combined with the approach of Cohen-Addad, Lee and Newman [FOCS’22] and a new rounding method yields the improved bound. View details
    Near-Optimal Correlation Clustering with Privacy
    Ashkan Norouzi Fard
    Chenglin Fan
    Jakub Tarnawski
    Slobodan Mitrović
    NeurIPS 2022 (2022) (to appear)
    Preview abstract Correlation clustering is a central problem in unsupervised learning, with applications spanning community detection, duplicate detection, automated labeling and many more. In the correlation clustering problem one receives as input a set of nodes and for each node a list of co-clustering preferences, and the goal is to output a clustering that minimizes the disagreement with the specified nodes' preferences. In this paper, we introduce a simple and computationally efficient algorithm for the correlation clustering problem with provable privacy guarantees. Our additive error is stronger than the one shown in prior work and is optimal up to polylogarithmic factors for fixed privacy parameters. View details
    Towards Optimal Coreset Bounds for Euclidean k-Median and k-Means
    Chris Schwiegelshohn
    David Saulpic
    Kasper Green Larsen
    54rd Annual ACM Symposium on Theory of Computing (STOC'22) (2022)
    Preview abstract The classic Euclidean $k$-Median and $k$-means problems consists of finding a set of $k$ points, such that the sum of Euclidean distances, respectively squared Euclidean distances, of every data point to its closest center is minimized. Coresets are a small, usually weighted subset of the input point set preserving the cost of any candidate solution up to a multiplicative $(1\pm \varepsilon)$ factor. Most coreset constructions focus on improving the size, i.e. the number of distinct points in the coreset. The arguably biggest open question in this line of research is to find tight bounds for Euclidean $k$-median and $k$-means. In this paper, we show the following. \begin{itemize} \item A coreset construction consisting of $O(k\cdot \log k \cdot \varepsilon^{-3}\cdot \log^{O(1)}\varepsilon^{-1})$ points for the k-median problem in arbitrary dimensions. \item An unconditional lower bound showing that $\Omega(k\cdot \varepsilon^{-2})$ points are necessary for any coreset for either problem. \item An unconditional lower bound showing that $\Omega(k \cdot \varepsilon^{-2} \cdot \log n} points are necessary for any coreset of either problem in general metric spacesé \end{itemize} The two bounds for Euclidean space are tight up to a factor (epsilon^{-1} polylog (k/epsilon). In addition, our techniques improve state of the art bounds for clustering with higher powers of Euclidean distances as well. The bound for general metrics in tight up to a factor polylog (k/epsilon). View details
    The Power of Uniform Sampling for Coresets
    Chris Schwiegelshohn
    Mads Toftrup
    Robert Krauthgamer
    Shaofeng Jiang
    Vladimir Braverman
    Xuan Wu
    FOCS'22 (2022) (to appear)
    Preview abstract We develop a new coreset framework which has many advantages comparing with the well- studied Feldman-Langberg framework [FL11] and the recent new framework by [CASS21b]. An important feature of our framework is that it relates a uniform-version shattering dimension of the ambient metric space with the existence of small coresets. While in [FL11], a weighted version of such dimension is required. Reasonable upper bounds of such weighted shattering dimension are often hard to prove or even do not exist. Based on our new framework, we construct coresets for multiple important problems. First, We use our framework to construct improved coresets for Barycenter. Secondly, we construct improved coresets for 1-Median in low-dimensional Euclidean space. Specially, we construct coresets of size Õ(ϵ^−1.5 ) in R^2 and coresets of size Õ(ϵ^−1.6 ) in R^3 . View details
    Preview abstract Motivated by data analysis and machine learning applications, we consider the popular high-dimensional Euclidean $k$-median and $k$-means problems. We propose a new primal-dual algorithm, inspired by the classic algorithm of Jain and Vazirani and the recent algorithm of Ahmadian et al.. Our algorithm achieves an approximation ratio of respectively 2.40... and 5.95... for Euclidean $k$-median and $k$-means improving upon the 2.60... of Ahmadian et al. and the 6.12.. of Grandoni et al.. View details
    A 2-Approximation for the Bounded Treewidth Sparsest Cut Problem in FPT Time
    Tobias Mömke
    Victor
    23rd Conference on Integer Programming and Combinatorial Optimization (IPCO'22) (2022)
    Preview abstract In the non-uniform sparsest cut problem, we are given a supply graph $G$ and a demand graph $D$, both with the same vertex set $V$. The goal is to partition the set of vertices $V$ into two subsets in order to minimize the ratio of the total capacity on the supply edges crossing from one part to the other over the total demand of the crossing edges. In this work, we study the sparsest cut problem for supply graphs with bounded treewidth $k$. For this case, Gupta, Talwar, and Witmer [STOC 2013] obtained a 2-approximation with polynomial running time for fixed $k$, and it remained open the question of whether there exists a $c$-approximation algorithm for a constant $c$ independent of $k$ that runs in FPT time. We answer the above question in the affirmative. We show that there exists a 2-approximation algorithm for the non-uniform sparsest cut with bounded treewidth supply graphs that runs in FPT time. Our algorithm is based on rounding the optimal solution of a linear programming relaxation inspired by the Sherali-Adams hierarchy. In contrast to the classic Sherali-Adams construction, we construct a relaxation driven by a tree decomposition of the supply graph, and we include a carefully chosen set of lifting variables to encode information of subsets of nodes with super-constant size, while at the same time having a sufficiently small linear program that can be solved in FPT time. View details
    Improved Coresets for Euclidean k-Means
    Chris Schwiegelshohn
    David Saulpic
    Kasper Green Larsen
    Omar Ali Sheikh-Omar
    Neurips'22 (2022)
    Preview abstract Given a set of n points in d dimensions, the Euclidean k-means problem consists of finding k centers such that the sum of squared distances from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset. The guarantee of the coreset is that for any candidate solution, the ratio between coreset cost and the cost of the original instance is less than a (1+epsilon) factor. The current state of the art coreset size for Euclidean k-means is O(k epsilon^{-2} min(k, epsilon^{-2})). This matches the lower bound of Omega(k epsilon^{-2}) up to the min(k, epsilon^{-2}) term. In this paper, we improve this bound to min(sqrt(k), epsilon^{-2}). In the regime where k = o(epsilon^{-2}), this is a strict improvement over the state of the art. In particular, ours is the first provable bound that breaks through the k^2 barrier while retaining an optimal dependency on epsilon. View details
    Fitting Metrics and Ultrametrics with Minimum Disagreements
    Arnaud de Mesmay
    Chenglin Fan
    Euiwoong Lee
    FOCS'22 (2022) (to appear)
    Preview abstract Given n x n recording pairwise distances x, the Metric Violation Distance prob- lem asks to compute the L_0 distance between x and the metric cone; i.e., modify the minimum number of entries of x to make it a metric. Due to its large number of applications in various data analysis and optimization tasks, this problem has been actively studied recently. We present an Oplog nq-approximation algorithm for Metric Violation Distance, expo- nentially improving the previous best approximation ratio of O(OPT ^{1/3}) of Fan, Raichel, and Van Buskirk [SODA, 2018]. Furthermore, a major strength of our algorithm is its simplicity and running time. We also study the related problem of Ultrametric Violation Distance, where the goal is to compute the L0 distance to the cone of ultrametrics, and achieve a constant factor approximation algorithm. The Ultrametric Violation Distance problem can be regarded as an extension of the problem of fitting ultrametrics studied by Ailon and Charikar [SIAM J. Computing, 2011] and by Cohen-Addad, Das, Kipouridis, Parotsidis, and Thorup [FOCS, 2021] from L1 norm to L0 norm. We show that this problem can be favorably interpreted as an instance of Correlation Clustering with an additional hierarchical structure, which we solve using a new O(1)-approximation algorithm for correlation clustering that has the struc- tural property that it outputs a refinement of the optimum clusters. An algorithm satisfying such a property can be considered of independent interest. We also provide an O(log n log log n) approximation algorithm for weighted instances. Finally, we investigate the complementary version of these problems where one aims at choosing a maximum number of entries of x form- ing an (ultra-)metric. In stark contrast with the minimization versions, we prove that these maximization versions are hard to approximate within any constant factor assuming the Unique Games Conjecture. View details
    Preview abstract Since the mid 90s, the study of the complexity of classic network design problems such as the traveling salesman problem (TSP), the Steiner tree problem (ST), or the $k$-MST problem on metric spaces such as low-dimensional Euclidean spaces, doubling metrics, planar or minor-free metrics, has led to major improvements of our understanding of the structure of both these important metric spaces, and the underlying problems. In their celebrated work, Arora and Mitchell gave quasi-polynomial-time approximation schemes (QPTAS) for several network design problems in Euclidean space, that have later been improved to polynomial and (near-)linear in some cases. Arora and Mitchell's QPTAS result has been extended by Talwar [STOC'04] to doubling metrics showing that the Euclidean embedding is not a necessary condition to design approximation schemes for network design problems. In the case of planar metrics, the picture is already less satisfactory: The first approximation schemes for Steiner tree only dates back to [] and nothing better than the general case is known for $k$-MST, an open question that has thus been repeatedly asked. Even when approximation schemes are known for the planar case, generalizing them to the minor-free setting is often a major challenge because most of the topological properties are lost. In this case, one of the most important question of the field is whether we bypass these topological arguments to obtain a general framework for network design problems in minor-free metrics? In this paper, we answer by the affirmative by proposing a framework for network design problems in minor-free metrics. We give the first approximation schemes, with quasi-polynomial running time, for $k$-MST, Steiner tree, Steiner forest, Prize-collecting Steiner tree in minor-free metrics. The result on $k$-MST was not known before even for the planar case, while Steiner tree on minor-free metrics has been one of the major challenge. Here we also obtain approximation schmes for the two non-trivial generalizations Steiner forest and Prize-collecting Steiner tree. View details
    Improved Approximation Algorithms and Lower Bounds for Search-Diversification Problems
    Amir Abboud
    Euiwoong Lee
    49th International Colloquium on Automata, Languages and Programming (ICALP) (2022)
    Preview abstract We study several questions related to diversifying search results. We give improved approximation algorithms in each of the problem, together with some lower bounds. \begin{enumerate} \item We give a polynomial-time approximation scheme (PTAS) for a diversified search ranking problem~\cite{BansalJKN10} whose objective is to minimizes the discounted cumulative gain. Our PTAS runs in time $n^{2^{O(\log(1/\eps)/\eps)}} \cdot m^{O(1)}$ where $n$ denote the number of elements in the databases and $m$ denote the constraints. Complementing this result, we show that no PTAS can run in time $f(\eps) \cdot (nm)^{2^{o(1/\eps)}}$ assuming Gap-ETH and therefore our running time is nearly tight. Both our upper and lower bounds answer open questions from~\cite{BansalJKN10} \item We next consider the Max-Sum Dispersion problem, whose objective is to select $k$ out of $n$ elements from a database that maximizes the dispersion, which is define as the sum of the pairwise distances under a given metric. We give a quasipolynomial-time approximation scheme (QPTAS) for the problem which runs in time $n^{O_{\eps}(\log n)}$. This improves upon previously known polynomial-time algorithms with approximate ratios 0.5~\cite{HassinRT97,BorodinJLY17} Furthermore, we observe that reductions from previous work rule out approximation schemes that run in $n^{\tilde{o}_\eps(\log n)}$ time assuming ETH. \item Finally, we consider a generalization of Max-Sum Dispersion called Max-Sum Diversification. In addition to the sum of pairwise distance, the objective also includes another function $f$. For monotone submodular function $f$, we give a quasipolynomial-time algorithm with approximation ratio arbitrarily close to $(1 - 1/e)$. This improves upon the best polynomial-time algorithm which has approximation ratio $0.5$~\cite{BorodinJLY17}. Furthermore, the $(1 - 1/e)$ factor is also tight as achieving better-than-$(1 - 1/e)$ approximation is NP-hard~\cite{Feige98}. View details
    An Improved Local Search Algorithm for k-Median
    Anupam Gupta
    David Saulpic
    Hoon Oh
    Lunjia Hu
    Symposium On Discrete Algorithms SODA'22 (2022)
    Preview abstract We present a new local-search algorithm for the k-median clustering problem. We show that local optima for this algorithm give an (2.836+eps)-approximation; our result improves upon the (3+eps)-approximate local-search algorithm of Arya et al. (2001). Moreover, a computer-aided analysis of a natural extension suggests that this approach may lead to a improvement over the best-known approximation guarantee for the problem (which is 2.67). The new ingredient in our algorithm is the use of a potential function based on both the closest and second-closest facilities to each client. Specifically, the potential is the sum over all clients, of the distance of the client to its closest facility, plus (a small constant times) the truncated distance to its second-closest facility. We move from one solution to another only if the latter can be obtained by swapping a constant number of facilities, and has a smaller potential than the former. This refined potential allows us to avoid the bad local optima given by Arya et al. for the local-search algorithm based only on the cost of the solution. View details
    A Massively Parallel Modularity-Maximizing Algorithm With Provable Guarantees
    David Saulpic
    Frederik Mallmann-Trenn
    41st ACM Symposium on Principles of Distributed Computing 2022 (2022)
    Preview abstract Graph clustering is one of the most basic and pop- ular unsupervised learning problem. Among the different formulations of the problem, the modu- larity objective has been particularly successful for helping design impactful algorithms; Most notably the Louvain algorithm has become one of the most used algorithm for clustering graphs. Yet, one major limitation of the Louvain algorithm is its sequential nature which makes it impracti- cal in distributive environments and on massive datasets. In this paper, we provide a parallel version of Lou- vain which works in the massively parallel com- putation model (MPC). We show that it achieves optimal cluster recovery in the classic stochastic block model in only a constant number of parallel rounds, and so for the same regime of parameters than the standard Louvain algorithm as shown recently in Cohen-Addad et al. (2020). View details
    Correlation Clustering with Sherali-Adams
    Alantha Newman
    Euiwoong Lee
    FOCS'22 (2022) (to appear)
    Preview abstract Given a complete graph G = (V, E) where each edge is labeled + or −, the Correlation Clustering problem asks to partition V into clusters to minimize the number of +edges be- tween different clusters plus the number of −edges within the same cluster. Correlation Clustering has been used to model a large number of clustering problems in practice, making it one of the most widely studied clustering formulations. The approximability of Correla- tion Clustering has been actively investigated [BBC04, CGW05, ACN08], culminating in a 2.06-approximation algorithm by Chawla, Makarychev, Schramm, and Yaroslavtsev [CMSY15]. The state-of-the-art result of Chawla, Makarychev, Schramm, and Yaroslavtsev is obtained by designing a clever rounding scheme for the standard LP relaxation. Yet, the integrality gap of this formulation is 2 and so it has remained a major open question to determine whether one can even breach this bound of 2 for Correlation Clustering. In this paper, we answer the question affirmatively by showing that there exists a (1.95 +ε)- approximation algorithm based on O(1/ε^2) rounds of the Sherali-Adams hierarchy. In order to round a solution to Sherali-Adams, we adapt the correlated rounding originally developed for CSPs [BRS11, GS11, RT12]. Beyond the traditional triangle-based analysis, we also employ a global charging scheme that amortizes the total cost of the rounding across different triangles. We believe our results could pave the way for applications of LP hierarchies beyond CSPs. View details
    Community Detection in the Heterogeneous Stochastic Block Model
    David Saulpic
    Frederik Mallmann-Trenn
    35th Annual Conference on Learning Theory (COLT 2022) (2022)
    Preview abstract We consider the Heterogeneous Stochastic Block Model (HeSBM) consisting of two unknown planted communities C1 and C2 . Each node u has two (unknown) parameters pu and qu with pu > qu , describing the probability of having a directed edges with nodes from the same com- munity and with nodes from the other commu- nity, respectively. This model allows for almost arbitrary degree distributions. We present a simple, linear time algorithm that recovers the communities whenever (pu −qu)/sqrt(pu) = Ω(log n/n). We also show that this condition is necessary. Note that this recovery threshold is the same as from the standard Stochastic Block Model, in which all nodes have the same pu and qu . Our algorithm is based on local greedy improvement, where nodes are moved into the part where it has more edges. We show how to implement those moves in parallel, such that each vertex needs to be considered at most twice. We believe that this algorithm can be easily implemented in parallel, making it highly practical; and the theoretical guarantee we demonstrate show a desirable robustness to perturbations in the degree distribution View details
    On Facility Location Problem in the Local Differential Privacy Model
    Chenglin Fan
    Di Wang
    Marco Gaboradi
    Shi Li
    Yunus Esencayi
    25th International Conference on Artificial Intelligence and Statistics (AISTATS 2022) (2022)
    Preview abstract We study the uncapacitated facility location (UFL) problem under the constraints imposed by the local differential privacy (LDP). Recently, Gupta et al. (2009) and Esencayi et al. (2019) proposed lower and upper bounds for the UFL problem on the central differential privacy (DP) model where a curator first collects all data before being processed. In this paper, we focus on the LDP model, where we protect a client's participation in the facility location instance. Under the HST metric, we show that there is a non-interactive $\epsilon$-LDP algorithm achieving $O(n^{1/4}/\epsilon^2)$-approximation ratio, where n is the size of the metric. On the negative side, we show a lower bound of $\Omega(n^{1/4}/\sqrt{\epsilon})$ on the approximation ratio for any non-interactive $\epsilon$-LDP algorithm. Thus, our results are tight up to a factor of $\epsilon$. Moreover, unlike previous results, our results generalize for non-uniform facility costs. View details
    Preview abstract We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines. View details
    Preview abstract We consider $k$-means clustering of $n$ data points in Euclidean space in the Massively Parallel Computation (MPC) model, a computational model which is an abstraction of modern massively parallel computing system such as MapReduce. There was a conditional lower bound showing that getting $O(1)$-approximate $k$-means solution for general input points using $o(\log n)$ rounds in the MPC model is impossible. However, the real-world data points usually have better structures. One instance of interest is the set of data points which is perturbation resilient [Bilu \& Linial'2010]. In particular, a point set is $\alpha$-perturbation resilient for $k$-means if perturbing pairwise distances by multiplicative factors in the range $[1,\alpha]$ does not change the optimum $k$-means clusters. We bypass the worst case lower bound by considering the perturbation resilient input points and showing $o(\log n)$ rounds $k$-means clustering algorithms for these instances in the MPC model. Specifically, we show a fully scalable $(1+\varepsilon)$-approximate $k$-means clustering algorithm for $O(\alpha)$-perturbation resilient instance in the MPC model using $O(\log\log n)$ rounds and $\widetilde{O}_{\varepsilon,d}(n^{1+1/\alpha^2})$ total space. If the space per machine is sufficient larger than $k$, i.e., at least $k\cdot n^{\Omega(1)}$, we also develop an optimal $k$-means clustering algorithm for $O(\alpha)$-perturbation resilient instance in the MPC model using $O(\log\log n)$ rounds and $\widetilde{O}_d(n\cdot(n^{1/\alpha^2}+k))$ total space. View details
    Preview abstract Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental scalability\nnote{I would remove "scalability": it is not clear that this will be demonstrated with mid-sized graphs} evaluation of our techniques. View details
    Preview abstract In the correlation clustering problem the input is a signed graph where the sign indicates whether each pair of points should be placed in the same cluster or not. The goal of the problem is to compute a clustering which minimizes the number of disagreements with such recommendation. Thanks to its many practical applications, correlation clustering is a fundamental unsupervised learning problem and has been extensively studied in many different settings. In this paper we study the problem in the classic online setting with recourse; The vertices of the graphs arrive in an online manner and the goal is to maintain an approximate clustering while minimizing the number of times each vertex changes cluster. Our main contribution is an algorithm that achieves logarithmic recourse per vertex in the worst case. We also complement this result with a tight lower bound. Finally we show experimentally that our algorithm achieves better performances than state-of-the-art algorithms on real world data. View details
    Preview abstract We study the differentially private (DP) $k$-means and $k$-median clustering problems of $n$ points in $d$-dimensional Euclidean space in the massively parallel computation (MPC) model. We provide two near-optimal algorithms where the near-optimality is in three aspects: they both achieve (1). $O(1)$ parallel computation rounds, (2). near-linear in $n$ and polynomial in $k$ total computational work (i.e., near-linear running time in the sequential setting), (3). $O(1)$ relative approximation and $\text{poly}(k, d)$ additive error, where $\Omega(1)$ relative approximation is provably necessary even for any polynomial-time non-private algorithm, and $\Omega(k)$ additive error is a provable lower bound for any polynomial-time DP $k$-means/median algorithm. Our two algorithms provide a tradeoff between the relative approximation and the additive error: the first has $O(1)$ relative approximation and $\sim (k^{2.5} + k^{1.01} \sqrt{d})$ additive error, and the second one achieves $(1+\gamma)$ relative approximation to the optimal non-private algorithm for an arbitrary small constant $\gamma>0$ and with $\text{poly}(k, d)$ additive error for a larger polynomial dependence on $k$ and $d$. To achieve our result, we develop a general framework which partitions the data and reduces the DP clustering problem for the entire dataset to the DP clustering problem for each part. To control the blow-up of the additive error introduced by each part, we develop a novel charging argument which might be of independent interest. View details
    Preview abstract We study in this paper the geometric $(1, z)$-clustering problem: given $n$ points in $\R^d$, find the point $x$ that minimizes the sum of Euclidean distance, raised to the power $z$, over all input points. This problem interpolates between the well-known Fermat-Weber problem -- or geometric median problem-- where $z = 1$, and the Minimum Enclosing Ball problem, where $z = \infty$. Our contribution is the design of a precise estimator that sample only a constant number of points. Namely, for any $\eps > 0$, we show that sampling uniformly at random $O(\eps^{-z})$ input points is enough to find a center such that the sum of distances to the power $z$ to that center is within a $(1+\eps)$-factor of the optimum. We also provide a lower bound, showing that any such algorithm must sample at least $\Omega\left(eps^{-z}\right)$ points. This implies an algorithm that computes a $(1+\eps)$-approximation running in time $0(d \eps^{-z-5})$, generalizing the result from Cohen et al [STOC '16] to arbitrary $z$. Furthermore, an algorithm with low query complexity has good privacy guarantee: we show that our algorithm is $(0, O(1/n))$-differentially private. This can be used to construct the first differentially-private algorithm for $(k, z)$-clustering with approximation term independent of the dimension $d$, improving on the algorithm of Ghazi et al. [Neurips '20] that has an additive error $\sqrt d \cdot \poly(k, \log n)$. View details
    Preview abstract As a fundamental unsupervised learning task, hierarchical clustering has been extensively studied in the past decade. In particular, standard metric formulations as hierarchical $k$-center, $k$-means, and $k$-median received a lot of attention and the problems have been studied extensively in different computation model. Despite all this interest, not many efficient parallel algorithms are known for those problems. In this paper we introduce a new parallel algorithm for Euclidean hierarchical $k$-median problem that using machine with memory $s$ (for $s\in \Omega(\log^2 (n+\Delta+d))$) runs in $O\left(\log_{s} nd \right)$ rounds, where $d$ is the dimension of the data set and $\Delta$ is a polynomial upper-bound on the ratio between the maximum and minimum distance of two points in the input dataset. To the best of our knowledge, this is the first \emph{parallel} algorithm for the hierarchical $k$-median problem with theoretical guarantees. We further complement our theoretical results with an empirical study of our algorithm that shows its effectiveness in practice. View details
    Preview abstract We consider the numerical taxonomy problem of fitting a positive distance function $\calD:{S\choose 2}\rightarrow \mathbb R_{>0}$ by a tree metric. We want a tree $T$ with positive edge weights and including $S$ among the vertices so that their distances in $T$ match those in $\calD$. A nice application is in evolutionary biology where the tree $T$ aims to approximate the branching process leading to the observed distances in $\calD$ [Cavalli-Sforza and Edwards 1967]. We consider the total error, that is the sum of distance errors over all pairs of points. We present a polynomial time algorithm minimizing the total error within a constant factor. We can do this both for general trees, and for the special case of ultrametrics with a root having the same distance to all vertices in $S$. The problems are known to be APX-hard, so a constant factor is the best we can hope for. The best previous approximation factor was $\calO((\log n)(\log \log n))$ by Ailon and Charikar [2005] who wrote ``Determining whether an $\calO(1)$ approximation can be obtained is a fascinating question". View details
    A New Coreset Framework for Clustering
    Chris Schwiegelshohn
    David Saulpic
    53rd Annual ACM Symposium on Theory of Computing (STOC'21) (2021)
    Preview abstract Given a metric space, the (k, z)-clustering problem consists of finding k centers such that the sum of the of distances raised to the power z of every point to its closest center is minimized. This encapsulates the famous k-median (z = 1) and k-means (z = 2) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as coresets, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases: with Γ = min(ε^−2+ε^−z, k^ε−2) polylog(ε^−1), this framework constructs coreset with size O (Γ · k(d + log k)) in doubling metrics, improving upon the recent breakthrough of [Huang, Jiang, Li, Wu, FOCS’ 18], who presented a coreset with size O(k^3 d/ε^2) – hence an improvement of Ω(k^2/polylog(ε^−1)). O(Γ · k · min(d, ε^−2 log k)) in d-dimensional Euclidean space, improving on the recent results of [Huang, Vishnoi, STOC’ 20], who presented a coreset of size O(k log k · ε^−2z · min(d, ε^−2 log k)) – hence an improvement of Ω(log k/(ε^z−1 polylog(ε^−1))). O(Γ · k(t + log k)) for graphs with treewidth t, improving on [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML’20], who presented a coreset of size O(k^3 t/ε^2) for z = 1 only – hence an improvement of Ω(k/polylog(ε^−1)), and a generalization to general z. O(Γ · k (log^2 k + log k/ε^4)) for shortest paths metrics of graphs excluding a fixed minor. This improves on [Braverman, Jiang, Krauthgamer, Wu, SODA’21], who presented a coreset of size O(k^3/ε^4) – hence an improvement of Ω(k^2/polylog(ε^−1)). Size O(Γ · k log n) in general discrete metric spaces, improving on the results of [Feldman, Lamberg, STOC’11], who presented a coreset of size O(kε^−2z log n log k) – hence an improvement of Ω(log k/(ε^z−1polylog(ε^−1))). A lower bound Ω( k log n/ ε) for k-Median in general metric spaces [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML’20] implies that in general metrics as well as metrics with doubling dimension d, our bounds are optimal up to a poly log(1/ε)/ε factor. For graphs with treewidth t, the lower bound of Ω(kt/ε) of [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML’20] shows that our bounds are optimal up to the same factor. View details
    On Approximability of Clustering Problems Without Candidate Centers
    Euiwoong Lee
    Karthik C. S.
    Symposium On Discrete Algorithms SODA'21 (2021)
    Preview abstract The k-means objective is arguably the most widely-used cost function for modeling clustering tasks in a metric space. In practice and historically, k-means is thought of in a continuous setting, namely where the centers can be located anywhere in the metric space. For example, the popular Lloyd’s heuristic locates a center at the mean of each cluster. Despite persistent efforts on understanding the approximability of k-means, and other classic clustering problems such as k-median and k-minsum, our knowledge of the hardness of approximation factors of these problems remains quite poor. In this paper, we significantly improve upon the hardness of approximation factors known in the literature for these objectives. We show that if the input lies in a general metric space, it is NP-hard to approximate: • Continuous k-median to a factor of 2 − o(1); this improves upon the previous inapproximability factor of 1.36 shown by Guha and Khuller (J. Algorithms ’99). • Continuous k-means to a factor of 4 − o(1); this improves upon the previous inapproximability factor of 2.10 shown by Guha and Khuller (J. Algorithms ’99). • k-minsum to a factor of 1.415; this improves upon the APX-hardness shown by Guruswami and Indyk (SODA ’03). Our results shed new and perhaps counter-intuitive light on the differences between clustering problems in the continuous setting versus the discrete setting (where the candidate centers are given as part of the input). View details
    Improving Ultrametrics Embeddings Through Coresets
    Guillaume Lagarde
    Rémi De Joannis de Verclos
    International Conference on Machine Learning, ICML 2021 (2021)
    Preview abstract To tackle the curse of dimensionality in data analysis and unsupervised learning, it is critical to be able to efficiently compute ``simple'' faithful representations of the data that helps extract information, improves understanding and visualization of the structure. When the dataset consists of $d$-dimensional vectors, simple representations of the data may consist in trees or ultrametrics, and the goal is to best preserve the distances (i.e.: dissimilarity values) between data elements. To circumvent the quadratic running times of the most popular methods for fitting ultrametrics, such as average, single, or complete linkage,~\citet{CKL20} recently presented a new algorithm that for any $c \ge 1$, outputs in time $n^{1+O(1/c^2)}$ an ultrametric $\Delta$ such that for any two points $u, v$, $\Delta(u, v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the ``best'' ultrametric representation. We improve the above result and show how to improve the above guarantee from $5c$ to $\sqrt{2}c + \varepsilon$ while achieving the same asymptotic running time. To further show the advantage of our new method, we experimentally analyze the performances of our algorithm which indeed yields embeddings of significantly better quality for various real-world datasets. View details
    A Quasipolynomial (2 + ε)-Approximation for Planar Sparsest Cut
    Anupam Gupta
    Jason Li
    Philip N. Klein
    53rd Annual ACM Symposium on Theory of Computing (STOC'21) (2021)
    Preview abstract The (non-uniform) sparsest cut problem is the following classical graph-partitioning problem: given a ``supply'' graph, and demands on pairs of vertices, delete some subset of supply edges to minimize the ratio of the supply edges cut to the total demand of the pairs separated by this deletion. The sparsest cut problem has been a driver for new algorithmic approaches. Despite much effort, there are only a handful of nontrivial classes of supply graphs for which constant-factor approximations are known. We consider the problem for planar graphs, and give a $(2+\eps)$-approximation algorithm that runs in quasipolynomial time. Our approach defines a new structural decomposition of an optimal solution using a ``radial cuts'' primitive. We combine this decomposition with a Sherali-Adams-style linear programming relaxation of the problem, which we then round. This should be compared with the polynomial-time approximation algorithm of Rao~(1999), which uses the metric linear programming relaxation and $\ell_1$-embeddings, and achieves an $O(\sqrt{\log n})$-approximation in polynomial time. View details
    Preview abstract We present the first distributed approximation algorithm for the Euclidean $k$-median problem with an optimal trade-off between memory usage and the number of parallel rounds. Our algorithm even works in the setting where each machine has very limited memory $s\in \Omega(\log n)$ and it is work efficient. In the future, it would be interesting to obtain similar results for other clustering problems and to improve the approximation factor of our algorithm. View details
    On the Power of Louvain for Graph Clustering
    Adrian Kosowski
    David Saulpic
    Frederik Mallmann-Trenn
    Neurips'20 (2020)
    Preview abstract A classic problem in machine learning and data analysis it to partition the vertices of a network in such a way that vertices in the same set are densely connected. In practice, the most popular approaches rely on local search algorithms; not only for the ease of implementation and the efficiency, but also because of the accuracy of these methods on many real-world graphs. For example, the Louvain algorithm -- a local search based algorithm -- has quickly become the method of choice for clustering in social networks, accumulating more than 10700 citations over the past 10 years. However, explaining the success of these methods remains an open problem: in the worst-case, both their running time and output can be arbitrarily bad. In this work, we study the Louvain heuristic and aim at explaining its success and identifying the mechanisms that makes it successful through a classic model for social networks, the Stochastic Block Model. For a wide range of parameters, we give the first theoretical evidence that Louvain identifies the underlying natural partition of a network in near-linear time. View details
    Approximation Schemes for Routing Problems in Minor-Free Metrics
    Arnold Filtser
    Hung Le
    Philip N. Klein
    IEEE Symposium on Foundations of Computer Science (FOCS'20) (2020)
    Preview abstract Understanding the structure of minor-free metrics, namely shortest path metrics obtained over a weighted graph excluding a fixed minor, has been an important research direction since the fundamental work of Robertson and Seymour. A fundamental idea that helps both to understand the structural properties of these metrics and lead to strong algorithmic results is to construct a ``small-complexity'' graph that approximately preserves distances between pairs of points of the metric. We show the two following structural results for minor-free metrics: _ Construction of a \emph{light} subset spanner. Given a subset of vertices called terminals, and $\epsilon$, in polynomial time we construct a subgraph that preserves all pairwise distances between terminals up to a multiplicative $1+\epsilon$ factor, of total weight at most $O_{\epsilon}(1)$ times the weight of the minimal Steiner tree spanning the terminals. _ Construction of a stochastic metric embedding into low treewidth graphs with expected additive distortion $\epsilon D$. Namely, given a minor free graph $G=(V,E,w)$ of diameter $D$, and parameter $\epsilon$, we construct a distribution $\mathcal{D}$ over dominating metric embeddings into treewidth-$O_{\epsilon}(\log n)$ graphs such that $\forall u,v\in V$, $\mathbb{E}_{f\sim\mathcal{D}}[d_H(f(u),f(v))]\le d_G(u,v)+\epsilon D$. One of our important technical contributions is a novel framework that allows us to reduce \emph{both problems} to problems on simpler graphs of \emph{bounded diameter} that we solve using a new decomposition. Our results have the following algorithmic consequences: (1) the first efficient approximation scheme for subset TSP in minor-free metrics; (2) the first approximation scheme for vehicle routing with bounded capacity in minor-free metrics; (3) the first efficient approximation scheme for vehicle routing with bounded capacity on bounded genus metrics. View details
    No Results Found