Google Research

Parallel Hierarchical Agglomerative Clustering

KDD 2020 (2020) (to appear)

Abstract

Hierarchical agglomerative clustering (HAC) is one of the most preferred clustering algorithms due to its unmatched effectiveness over a wide variety of applications and its ability to use any custom linkage function. A major blocker for adopting HAC is the computational cost -- HAC requires sequential $O(n^2 \log n)$ linkage function evaluations, making it infeasible to run on datasets of even moderate size. In this paper, we present PHAC, a new agglomerative clustering method that scales more gracefully to massive datasets than its widely-used sister algorithm, hierarchical agglomerative clustering (HAC). Like HAC, PHAC builds tree structures in a bottom-up way by means of any, user-defined, linkage function. PHAC uses a threshold-based round scheme that allows multiple nodes to be agglomerated independently in parallel in a given round, giving its computational advantage over HAC. We show empirically that our algorithm produces high quality clustering as well as scales to huge datasets. We achieve new state-of-the-art dendogram purity on many public large scale datasets such as ALOI or Imagenet ILSVRC12, improving the previous results by up to 7 points. We additionally apply our algorithm to a massive dataset of 36B user web search queries and show via human cluster coherence evaluation the effectiveness of our algorithm.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work