Google Research

Parallel and Efficient Hierarchical k-Median Clustering

NeurIPS 2021

Abstract

As a fundamental unsupervised learning task, hierarchical clustering has been extensively studied in the past decade. In particular, standard metric formulations as hierarchical $k$-center, $k$-means, and $k$-median received a lot of attention and the problems have been studied extensively in different computation model. Despite all this interest, not many efficient parallel algorithms are known for those problems. In this paper we introduce a new parallel algorithm for Euclidean hierarchical $k$-median problem that using machine with memory $s$ (for $s\in \Omega(\log^2 (n+\Delta+d))$) runs in $O\left(\log_{s} nd \right)$ rounds, where $d$ is the dimension of the data set and $\Delta$ is a polynomial upper-bound on the ratio between the maximum and minimum distance of two points in the input dataset. To the best of our knowledge, this is the first \emph{parallel} algorithm for the hierarchical $k$-median problem with theoretical guarantees. We further complement our theoretical results with an empirical study of our algorithm that shows its effectiveness in practice.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work