Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation

David Saulpic
Chris Schwiegelshohn
FOCS'23 (2023) (to appear)
Google Scholar

Abstract

In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role.
For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the precision parameter $\varepsilon^{-1}$ or $k$. Furthermore, there is no coreset construction that succeeds with probability $1-1/n$ and whose size does not depend on the number of input points, $n$.
This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman WIREs Data Mining Knowl. Discov'20].


Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are $2.633$ for $k$-median [Ahmadian, Norouzi-Fard, Svensson, Ward, SIAM J. Comput. 20], $9+\varepsilon$ for $k$-means [Kanungo et al. Comput. Geom. 04]. Those are the best results, even when allowing a complexity FPT in the number of clusters $k$: this stands in sharp contrast with the $(1+\varepsilon)$-approximation achievable in that case, when allowing randomization.


In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We show how to compute a dimension reduction onto $\varepsilon^{-O(1)} \log k$ dimensions in time $k^{O\left( \varepsilon^{-O(1)} + \log \log k\right)} \text{poly}(nd)$, and how to build a coreset of size $O\left( k^2 \log^3 k \varepsilon^{-O(1)}\right)$ in time $2^{\eps^{O(1)} k\log^3 k} + k^{O\left( \varepsilon^{-O(1)} + \log \log k\right)} \text{poly}(nd)$. In the case where $k$ is small, this answers an open question of [Feldman WIDM'20] and [Munteanu and Schwiegelshohn, Künstliche Intell. 18] on whether it is possible to efficiently compute coresets deterministically.


We also construct a deterministic algorithm for computing $(1+\varepsilon)$-approximation to $k$-median and $k$-means in high dimensional Euclidean spaces in time $2^{k^2 \log^3 k/\varepsilon^{O(1)}} \text{poly}(nd)$, close to the best randomized complexity of $2^{(k/\varepsilon)^{O(1)}} nd$ (see [Kumar, Sabharwal, Sen, JACM 10] and [Bhattacharya, Jaiswal, Kumar, TCS'18]).

Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS 22] by a factor $k$.