Google Research

Succinct approximate counting of skewed data

  • David Talbot
IJCAI-09 Proceedings (2009), pp. 1243-1248

Abstract

Practical data analysis relies on the ability to count observations of objects succinctly and efficiently. Unfortunately the space usage of an exact estimator grows with the size of the a priori set from which objects are drawn while the time required to maintain such an estimator grows with the size of the data set. We present static and on-line approximation schemes that avoid these limitations when approximate frequency estimates are acceptable. Our Log-Frequency Sketch extends the approximate counting algorithm of Morris [1978] to estimate frequencies with bounded relative error via a single pass over a data set. It uses constant space per object when the frequencies follow a power law and can be maintained in constant time per observation. We give an (ε,δ)-approximation scheme which we verify empirically on a large natural language data set where, for instance, 95 percent of frequencies are estimated with relative error less than 0.25 using fewer than 11 bits per object in the static case and 15 bits per object on-line.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work