Jump to Content

Parallel and Efficient Hierarchical k-Median Clustering

Ashkan Norouzi Fard
Christian Sohler
Ola Svensson
NeurIPS 2021
Google Scholar

Abstract

As a fundamental unsupervised learning task, hierarchical clustering has been extensively studied in the past decade. In particular, standard metric formulations as hierarchical $k$-center, $k$-means, and $k$-median received a lot of attention and the problems have been studied extensively in different computation model. Despite all this interest, not many efficient parallel algorithms are known for those problems. In this paper we introduce a new parallel algorithm for Euclidean hierarchical $k$-median problem that using machine with memory $s$ (for $s\in \Omega(\log^2 (n+\Delta+d))$) runs in $O\left(\log_{s} nd \right)$ rounds, where $d$ is the dimension of the data set and $\Delta$ is a polynomial upper-bound on the ratio between the maximum and minimum distance of two points in the input dataset. To the best of our knowledge, this is the first \emph{parallel} algorithm for the hierarchical $k$-median problem with theoretical guarantees. We further complement our theoretical results with an empirical study of our algorithm that shows its effectiveness in practice.