Jump to Content

Massively Parallel k-Means Clustering for Perturbation Resilient Instances

Peilin Zhong
International Conference on Machine Learning, ICML 2022 (2022)
Google Scholar

Abstract

We consider $k$-means clustering of $n$ data points in Euclidean space in the Massively Parallel Computation (MPC) model, a computational model which is an abstraction of modern massively parallel computing system such as MapReduce. There was a conditional lower bound showing that getting $O(1)$-approximate $k$-means solution for general input points using $o(\log n)$ rounds in the MPC model is impossible. However, the real-world data points usually have better structures. One instance of interest is the set of data points which is perturbation resilient [Bilu \& Linial'2010]. In particular, a point set is $\alpha$-perturbation resilient for $k$-means if perturbing pairwise distances by multiplicative factors in the range $[1,\alpha]$ does not change the optimum $k$-means clusters. We bypass the worst case lower bound by considering the perturbation resilient input points and showing $o(\log n)$ rounds $k$-means clustering algorithms for these instances in the MPC model. Specifically, we show a fully scalable $(1+\varepsilon)$-approximate $k$-means clustering algorithm for $O(\alpha)$-perturbation resilient instance in the MPC model using $O(\log\log n)$ rounds and $\widetilde{O}_{\varepsilon,d}(n^{1+1/\alpha^2})$ total space. If the space per machine is sufficient larger than $k$, i.e., at least $k\cdot n^{\Omega(1)}$, we also develop an optimal $k$-means clustering algorithm for $O(\alpha)$-perturbation resilient instance in the MPC model using $O(\log\log n)$ rounds and $\widetilde{O}_d(n\cdot(n^{1/\alpha^2}+k))$ total space.