Jump to Content

A New Coreset Framework for Clustering

Chris Schwiegelshohn
David Saulpic
53rd Annual ACM Symposium on Theory of Computing (STOC'21) (2021)
Google Scholar

Abstract

Given a metric space, the (k, z)-clustering problem consists of finding k centers such that the sum of the of distances raised to the power z of every point to its closest center is minimized. This encapsulates the famous k-median (z = 1) and k-means (z = 2) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as coresets, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases: with Γ = min(ε^−2+ε^−z, k^ε−2) polylog(ε^−1), this framework constructs coreset with size O (Γ · k(d + log k)) in doubling metrics, improving upon the recent breakthrough of [Huang, Jiang, Li, Wu, FOCS’ 18], who presented a coreset with size O(k^3 d/ε^2) – hence an improvement of Ω(k^2/polylog(ε^−1)). O(Γ · k · min(d, ε^−2 log k)) in d-dimensional Euclidean space, improving on the recent results of [Huang, Vishnoi, STOC’ 20], who presented a coreset of size O(k log k · ε^−2z · min(d, ε^−2 log k)) – hence an improvement of Ω(log k/(ε^z−1 polylog(ε^−1))). O(Γ · k(t + log k)) for graphs with treewidth t, improving on [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML’20], who presented a coreset of size O(k^3 t/ε^2) for z = 1 only – hence an improvement of Ω(k/polylog(ε^−1)), and a generalization to general z. O(Γ · k (log^2 k + log k/ε^4)) for shortest paths metrics of graphs excluding a fixed minor. This improves on [Braverman, Jiang, Krauthgamer, Wu, SODA’21], who presented a coreset of size O(k^3/ε^4) – hence an improvement of Ω(k^2/polylog(ε^−1)). Size O(Γ · k log n) in general discrete metric spaces, improving on the results of [Feldman, Lamberg, STOC’11], who presented a coreset of size O(kε^−2z log n log k) – hence an improvement of Ω(log k/(ε^z−1polylog(ε^−1))). A lower bound Ω( k log n/ ε) for k-Median in general metric spaces [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML’20] implies that in general metrics as well as metrics with doubling dimension d, our bounds are optimal up to a poly log(1/ε)/ε factor. For graphs with treewidth t, the lower bound of Ω(kt/ε) of [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML’20] shows that our bounds are optimal up to the same factor.