Scalable diversity maximization via small-size composable core-sets
Maximizing diversity is a central problem in information re- trieval and data mining systems with prominent applications in web search, recommender systems, news aggregators and product search. In this paper, we study a diversity maxi- mization problem (a.k.a. maximum dispersion problem) in which given a set of n objects in a metric space, one wants to find a subset of k objects with the maximum sum of pairwise distances. To solve this problem in a scalable distributed manner, we apply a novel distributed framework for tackling large-scale problems known as randomized composable core-sets: divide the big data set into smaller parts, solve the problem for each part, combine the solutions from each part, and solve the problem on the union of these solutions. Our algorithms improve significantly over the approximation guarantees of state-of-the-art core-set-based algorithms while using min- imum possible intermediate output size. In particular, we present a simple distributed algorithm that achieves an al- most optimal communication complexity, and moreover, it asymptotically achieves approximation factor of 1/2 which is the best possible approximation factor for the global opti- mization problem under certain complexity theory assump- tions. Our algorithms are scalable and practical as shown by our extensive empirical evaluation with large datasets and they can be easily adapted to the major distributed comput- ing systems like MapReduce. Furthermore, we show empir- ically that, in real-life instances, our algorithms reach close- to-optimal solutions with approximation factor of > 90%. This approximation factor is far exceeding the approxima- tion barrier for the problem and provide useful output.