Google Research

On Supplementing Training Data by Half-Sampling

American Statistical Association, Alexandria, VA (2019) (to appear)

Abstract

Machine learning (ML) models typically train on one dataset, then assess performance on another. We consider the case of training on a given dataset, then determining which (large batch of) unlabeled candidates to label in order to improve the model further. Each candidate we score by its associated prediction error(s). We concentrate on the large batch case for two reasons: (1) While choose-1-then-update (batch of size 1) successfully avoids near-duplicates, a choose-N-then-update (batch of size N) needs additional constraints to avoid over-selecting near-duplicates. (2) Just as large data volumes enable ML, updates to these large data volumes tend also to come in largish batches. Model uncertainty we estimate by 50-percent samples without replacement. Using a two-level orthogonal array with n − 1 columns, the resulting maximally balanced half-samples achieve high efficiency; the result is one model for each column of the orthogonal array. We use the associated n-dimensional representation of prediction uncertainty to choose which N candidates to label. We illustrate by fitting keras-based neural networks to about 20 percent of the MNIST handwritten digit dataset.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work