Scalable diversity maximization via small-size composable core-sets
Abstract
Maximizing diversity is a central problem in information re-
trieval and data mining systems with prominent applications
in web search, recommender systems, news aggregators and
product search. In this paper, we study a diversity maxi-
mization problem (a.k.a. maximum dispersion problem) in
which given a set of n objects in a metric space, one wants to
find a subset of k objects with the maximum sum of pairwise
distances.
To solve this problem in a scalable distributed manner, we
apply a novel distributed framework for tackling large-scale
problems known as randomized composable core-sets: divide
the big data set into smaller parts, solve the problem for
each part, combine the solutions from each part, and solve
the problem on the union of these solutions. Our algorithms
improve significantly over the approximation guarantees of
state-of-the-art core-set-based algorithms while using min-
imum possible intermediate output size. In particular, we
present a simple distributed algorithm that achieves an al-
most optimal communication complexity, and moreover, it
asymptotically achieves approximation factor of 1/2 which
is the best possible approximation factor for the global opti-
mization problem under certain complexity theory assump-
tions.
Our algorithms are scalable and practical as shown by
our extensive empirical evaluation with large datasets and
they can be easily adapted to the major distributed comput-
ing systems like MapReduce. Furthermore, we show empir-
ically that, in real-life instances, our algorithms reach close-
to-optimal solutions with approximation factor of > 90%.
This approximation factor is far exceeding the approxima-
tion barrier for the problem and provide useful output.
trieval and data mining systems with prominent applications
in web search, recommender systems, news aggregators and
product search. In this paper, we study a diversity maxi-
mization problem (a.k.a. maximum dispersion problem) in
which given a set of n objects in a metric space, one wants to
find a subset of k objects with the maximum sum of pairwise
distances.
To solve this problem in a scalable distributed manner, we
apply a novel distributed framework for tackling large-scale
problems known as randomized composable core-sets: divide
the big data set into smaller parts, solve the problem for
each part, combine the solutions from each part, and solve
the problem on the union of these solutions. Our algorithms
improve significantly over the approximation guarantees of
state-of-the-art core-set-based algorithms while using min-
imum possible intermediate output size. In particular, we
present a simple distributed algorithm that achieves an al-
most optimal communication complexity, and moreover, it
asymptotically achieves approximation factor of 1/2 which
is the best possible approximation factor for the global opti-
mization problem under certain complexity theory assump-
tions.
Our algorithms are scalable and practical as shown by
our extensive empirical evaluation with large datasets and
they can be easily adapted to the major distributed comput-
ing systems like MapReduce. Furthermore, we show empir-
ically that, in real-life instances, our algorithms reach close-
to-optimal solutions with approximation factor of > 90%.
This approximation factor is far exceeding the approxima-
tion barrier for the problem and provide useful output.