Scalable K-Means by ranked retrieval

Andrei Broder; Lluis Garcia-Pueyo; Vanja Josifovski; Sergei Vassilvitskii; Srihari Venkatesan

Scalable K-Means by ranked retrieval

Andrei Broder

Lluis Garcia-Pueyo

Vanja Josifovski

Sergei Vassilvitskii

Srihari Venkatesan

Proceedings of the 7th ACM international conference on Web search and data mining, ACM, New York, NY, USA (2014), pp. 233-242

Google Scholar

Abstract

The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality.

The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.

Research Areas

Information retrieval

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Scalable K-Means by ranked retrieval

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs