A New Coreset Framework for Clustering
Abstract
Given a metric space, the (k, z)-clustering problem consists of finding k centers such that the
sum of the of distances raised to the power z of every point to its closest center is minimized. This
encapsulates the famous k-median (z = 1) and k-means (z = 2) clustering problems. Designing
small-space sketches of the data that approximately preserves the cost of the solutions, also
known as coresets, has been an important research direction over the last 15 years.
In this paper, we present a new, simple coreset framework that simultaneously improves upon
the best known bounds for a large variety of settings, ranging from Euclidean space, doubling
metric, minor-free metric, and the general metric cases: with Γ = min(ε^−2+ε^−z, k^ε−2) polylog(ε^−1), this framework constructs coreset with size O (Γ · k(d + log k)) in doubling metrics, improving upon the recent breakthrough of [Huang, Jiang, Li, Wu, FOCS’ 18], who presented a coreset with size O(k^3 d/ε^2) – hence an improvement of Ω(k^2/polylog(ε^−1)).
O(Γ · k · min(d, ε^−2 log k)) in d-dimensional Euclidean space, improving on the recent results of [Huang, Vishnoi, STOC’ 20], who presented a coreset of size O(k log k · ε^−2z · min(d, ε^−2 log k)) – hence an improvement of Ω(log k/(ε^z−1 polylog(ε^−1))).
O(Γ · k(t + log k)) for graphs with treewidth t, improving on [Baker, Braverman, Huang,
Jiang, Krauthgamer, Wu, ICML’20], who presented a coreset of size O(k^3 t/ε^2) for z = 1 only – hence an improvement of Ω(k/polylog(ε^−1)), and a generalization to general z.
O(Γ · k (log^2 k + log k/ε^4)) for shortest paths metrics of graphs excluding a fixed minor. This
improves on [Braverman, Jiang, Krauthgamer, Wu, SODA’21], who presented a coreset of
size O(k^3/ε^4) – hence an improvement of Ω(k^2/polylog(ε^−1)).
Size O(Γ · k log n) in general discrete metric spaces, improving on the results of [Feldman, Lamberg, STOC’11], who presented a coreset of size O(kε^−2z log n log k) – hence an improvement of Ω(log k/(ε^z−1polylog(ε^−1))).
A lower bound Ω( k log n/ ε) for k-Median in general metric spaces [Baker, Braverman, Huang,
Jiang, Krauthgamer, Wu, ICML’20] implies that in general metrics as well as metrics with
doubling dimension d, our bounds are optimal up to a poly log(1/ε)/ε factor. For graphs with
treewidth t, the lower bound of Ω(kt/ε) of [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML’20] shows that our bounds are optimal up to the same factor.
sum of the of distances raised to the power z of every point to its closest center is minimized. This
encapsulates the famous k-median (z = 1) and k-means (z = 2) clustering problems. Designing
small-space sketches of the data that approximately preserves the cost of the solutions, also
known as coresets, has been an important research direction over the last 15 years.
In this paper, we present a new, simple coreset framework that simultaneously improves upon
the best known bounds for a large variety of settings, ranging from Euclidean space, doubling
metric, minor-free metric, and the general metric cases: with Γ = min(ε^−2+ε^−z, k^ε−2) polylog(ε^−1), this framework constructs coreset with size O (Γ · k(d + log k)) in doubling metrics, improving upon the recent breakthrough of [Huang, Jiang, Li, Wu, FOCS’ 18], who presented a coreset with size O(k^3 d/ε^2) – hence an improvement of Ω(k^2/polylog(ε^−1)).
O(Γ · k · min(d, ε^−2 log k)) in d-dimensional Euclidean space, improving on the recent results of [Huang, Vishnoi, STOC’ 20], who presented a coreset of size O(k log k · ε^−2z · min(d, ε^−2 log k)) – hence an improvement of Ω(log k/(ε^z−1 polylog(ε^−1))).
O(Γ · k(t + log k)) for graphs with treewidth t, improving on [Baker, Braverman, Huang,
Jiang, Krauthgamer, Wu, ICML’20], who presented a coreset of size O(k^3 t/ε^2) for z = 1 only – hence an improvement of Ω(k/polylog(ε^−1)), and a generalization to general z.
O(Γ · k (log^2 k + log k/ε^4)) for shortest paths metrics of graphs excluding a fixed minor. This
improves on [Braverman, Jiang, Krauthgamer, Wu, SODA’21], who presented a coreset of
size O(k^3/ε^4) – hence an improvement of Ω(k^2/polylog(ε^−1)).
Size O(Γ · k log n) in general discrete metric spaces, improving on the results of [Feldman, Lamberg, STOC’11], who presented a coreset of size O(kε^−2z log n log k) – hence an improvement of Ω(log k/(ε^z−1polylog(ε^−1))).
A lower bound Ω( k log n/ ε) for k-Median in general metric spaces [Baker, Braverman, Huang,
Jiang, Krauthgamer, Wu, ICML’20] implies that in general metrics as well as metrics with
doubling dimension d, our bounds are optimal up to a poly log(1/ε)/ε factor. For graphs with
treewidth t, the lower bound of Ω(kt/ε) of [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML’20] shows that our bounds are optimal up to the same factor.