Streaming Euclidean k-median and k-means to a (1 + ε)-approximation with o_{k,ε}(log n) Memory Words

David P. Woodruff
Samson Zhou
FOCS'23(2023) (to appear)
Google Scholar


We consider the classic Euclidean k-median and k-means objective on insertion-only streams, where the goal is to maintain a (1 + ε)-approximation to the k-median or k-means, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, merge-and-reduce, bicriteria approximation, sensitivity sampling, etc. Despite this intense effort to obtain smaller and smaller sketches for these problems, all known techniques require storing at least Ω(log (n∆)) words of memory, where n is size of the input and ∆ is the aspect ratio. In this paper, we show how to finally bypass this bound and obtain the first algorithm that achieves a (1 + ε)-approximation to the more general (k, z)-clustering problem, using only ̃O ( dk / ε^2) · (2^{z log z} ) · min ( 1/ε^z , k) · poly(log log(n∆)) words of memory.