Jump to Content

A Framework for Parallelizing Hierarchical Clustering Methods

Benjamin Moseley
Kefu Lu
Thomas Lavastida
ECML PKDD 2019
Google Scholar

Abstract

Hierarchical clustering is a widely used tool in machine learning, several sequential hierarchical clustering algorithms are known and well-studied. These algorithms include top down divisive approaches such as bisecting $k$-means, $k$-median or $k$-center, and bottom-up agglomerative approaches such as single-linkage, average-linkage, and centroid-linkage. As data sets are becoming larger, these sequential methods are becoming ineffective. \comment{Most of these methods have serial dependencies and are difficult to adapt to parallel and distributed architectures. Due to this, a limited number of parallel and distributed implementations have been developed.} In this paper we develop distributed algorithms for bisecting $k$-means, $k$-median and $k$-center as well as centroid-linkage. We first establishes strong worst case bounds on the running time of the methods. Then we show experimentally that the algorithms have also good performance in practice.