Affinity Clustering: Hierarchical Clustering at Scale
Graph clustering is a fundamental task in many data-mining and machine-learning pipelines. In particular, identifying good hierarchical clustering structure is at the same time a fundamental and challenging problem for several applications. In many applications, the amount of data to analyze is increasing at an astonishing rate each day. Hence there is a need for new solutions to efficiently compute effective hierarchical clusterings on such huge data. In this paper, we propose algorithms to address this problem. First, we analyze minimum spanning tree-based clustering algorithms and their corresponding hierarchical clusterings. In particular we consider classic single-linkage clustering based on Kruskal's algorithm and a variation of Boruvka algorithm that we call affinity clustering and prove new interesting properties of these clusterings via the concept of certificates. Then we present new algorithms in the MapReduce model and their efficient real world implementations via Distributed Hash Tables (DHTs). Our MapReduce algorithms indeed improve upon the previous MapReduce algorithms for finding a minimum spanning tree in graphs as well. Finally we show experimentally that our algorithms are scalable for huge data and competitive with state-of-the-art algorithms. In particular we show that Affinity Clustering is in practice superior to several state-of-the-art clustering algorithms.