Parallel and Efficient Hierarchical k-Median Clustering
Abstract
As a fundamental unsupervised learning task, hierarchical
clustering has been extensively studied in the past decade. In
particular, standard metric formulations as hierarchical $k$-center, $k$-means,
and $k$-median received a lot of attention and the problems
have been studied extensively in different computation model.
Despite all this interest, not many efficient parallel algorithms
are known for those problems. In this paper we introduce a
new parallel algorithm for Euclidean
hierarchical $k$-median problem that using machine with memory $s$ (for $s\in
\Omega(\log^2 (n+\Delta+d))$) runs in $O\left(\log_{s} nd
\right)$ rounds, where $d$ is the dimension of the data set and
$\Delta$ is a polynomial upper-bound on the ratio between the maximum
and minimum
distance of two points in the input dataset. To the best of our
knowledge, this is the first \emph{parallel} algorithm
for the hierarchical $k$-median problem with theoretical guarantees.
We further complement our theoretical results with an empirical study
of our algorithm that shows its effectiveness in practice.