Google Research

Towards Optimal Coreset Bounds for Euclidean k-Median and k-Means

54rd Annual ACM Symposium on Theory of Computing (STOC'22) (2022) (to appear)

Abstract

The classic Euclidean $k$-Median and $k$-means problems consists of finding a set of $k$ points, such that the sum of Euclidean distances, respectively squared Euclidean distances, of every data point to its closest center is minimized. Coresets are a small, usually weighted subset of the input point set preserving the cost of any candidate solution up to a multiplicative $(1\pm \varepsilon)$ factor. Most coreset constructions focus on improving the size, i.e. the number of distinct points in the coreset. The arguably biggest open question in this line of research is to find tight bounds for Euclidean $k$-median and $k$-means. In this paper, we show the following. \begin{itemize} \item A coreset construction consisting of $O(k\cdot \log k \cdot \varepsilon^{-3}\cdot \log^{O(1)}\varepsilon^{-1})$ points for the k-median problem in arbitrary dimensions.

\item An unconditional lower bound showing that $\Omega(k\cdot \varepsilon^{-2})$ points are necessary for any coreset for either problem.

\item An unconditional lower bound showing that $\Omega(k \cdot \varepsilon^{-2} \cdot \log n} points are necessary for any coreset of either problem in general metric spacesé \end{itemize} The two bounds for Euclidean space are tight up to a factor (epsilon^{-1} polylog (k/epsilon). In addition, our techniques improve state of the art bounds for clustering with higher powers of Euclidean distances as well. The bound for general metrics in tight up to a factor polylog (k/epsilon).

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work