Distributed submodular cover: Succinctly summarizing massive data
Abstract
How can one find a subset, ideally as small as possible, that well represents a
massive dataset? I.e., its corresponding utility, measured according to a suitable
utility function, should be comparable to that of the whole dataset. In this paper,
we formalize this challenge as a submodular cover problem. Here, the utility is
assumed to exhibit submodularity, a natural diminishing returns condition prevalent
in many data summarization applications. The classical greedy algorithm is
known to provide solutions with logarithmic approximation guarantees compared
to the optimum solution. However, this sequential, centralized approach is impractical
for truly large-scale problems. In this work, we develop the first distributed
algorithm – DISCOVER – for submodular set cover that is easily implementable
using MapReduce-style computations. We theoretically analyze our approach,
and present approximation guarantees for the solutions returned by DISCOVER.
We also study a natural trade-off between the communication cost and the number
of rounds required to obtain such a solution. In our extensive experiments,
we demonstrate the effectiveness of our approach on several applications, including
active set selection, exemplar based clustering, and vertex cover on tens of
millions of data points using Spark.
massive dataset? I.e., its corresponding utility, measured according to a suitable
utility function, should be comparable to that of the whole dataset. In this paper,
we formalize this challenge as a submodular cover problem. Here, the utility is
assumed to exhibit submodularity, a natural diminishing returns condition prevalent
in many data summarization applications. The classical greedy algorithm is
known to provide solutions with logarithmic approximation guarantees compared
to the optimum solution. However, this sequential, centralized approach is impractical
for truly large-scale problems. In this work, we develop the first distributed
algorithm – DISCOVER – for submodular set cover that is easily implementable
using MapReduce-style computations. We theoretically analyze our approach,
and present approximation guarantees for the solutions returned by DISCOVER.
We also study a natural trade-off between the communication cost and the number
of rounds required to obtain such a solution. In our extensive experiments,
we demonstrate the effectiveness of our approach on several applications, including
active set selection, exemplar based clustering, and vertex cover on tens of
millions of data points using Spark.