Estimating Uncertainty for Massive Data Streams

Nicholas Chamandy; Omkar Muralidharan; Amir Najmi; Siddartha Naidu

Estimating Uncertainty for Massive Data Streams

Nicholas Chamandy

Omkar Muralidharan

Amir Najmi

Siddartha Naidu

Google (2012)

Google Scholar

Abstract

We address the problem of estimating the variability of an estimator computed from a massive data stream. While nearly-linear statistics can be computed exactly or approximately from “Google- scale” data, second-order analysis is a challenge. Unfortunately, massive sample sizes do not obviate the need for uncertainty calculations: modern data often have heavy tails, large coefficients of variation, tiny effect sizes, and generally exhibit bad behaviour. We describe in detail this New Frontier in statistics, outline the computing infrastructure required, and motivate the need for modification of existing methods. We introduce two procedures for basic uncertainty estimation, one derived from the bootstrap and the other from a form of subsampling. Their costs and theoretical properties are briefly discussed, and their use is demonstrated using Google data.

Research Areas

Algorithms and theory

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Estimating Uncertainty for Massive Data Streams

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs