Truly Perfect Samplers for Data Streams and Sliding Windows

David P. Woodruff
Samson Zhou
Principles of Database Systems (PODS) 2022 (2022)

Abstract

In the G-sampling problem, the goal is to output an index i of a vector f∈ℝn, such that for all coordinates j∈[n],
Pr[i=j]=(1±ϵ)G(fj)∑k∈[n]G(fk)+γ,
where G:ℝ→ℝ≥0 is some non-negative function. If ϵ=0 and γ=1/poly(n), the sampler is called perfect. In the data stream model, f is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect Lp samplers in turnstile streams, where G(x)=|x|p, using polylog(n) space for p∈(0,2]. However, to date all known sampling algorithms are not truly perfect, since their output distribution is only point-wise γ=1/poly(n) close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study of truly perfect samplers, with ϵ=γ=0, and comprehensively investigate their complexity in the data stream and sliding window models.