Google Research

A Framework for Parallelizing Hierarchical Clustering Methods

ECML PKDD 2019

Abstract

Hierarchical clustering is a widely used tool in machine learning, several sequential hierarchical clustering algorithms are known and well-studied. These algorithms include top down divisive approaches such as bisecting $k$-means, $k$-median or $k$-center, and bottom-up agglomerative approaches such as single-linkage, average-linkage, and centroid-linkage.

As data sets are becoming larger, these sequential methods are becoming ineffective. \comment{Most of these methods have serial dependencies and are difficult to adapt to parallel and distributed architectures. Due to this, a limited number of parallel and distributed implementations have been developed.} In this paper we develop distributed algorithms for bisecting $k$-means, $k$-median and $k$-center as well as centroid-linkage. We first establishes strong worst case bounds on the running time of the methods. Then we show experimentally that the algorithms have also good performance in practice.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work