Residual Based Sampling for Online Low Rank Approximation

Google Scholar

Abstract

We propose online algorithms for Column Subset Selection (CSS) and Principal Component
Analysis (PCA), two methods that are widely employed for data analysis, summarization, and
visualization. Given a data matrix A that is revealed one column at a time, the online CSS
problems asks to keep a small set of columns, S, that best approximates the space spanned by
the columns of A. As each column arrives, the algorithm must irrevocably decide whether to
add it to S, or to ignore it. In the online PCA problem, the goal is to output a projection of
each column to a low dimensional subspace. In other words, the algorithm must provide an
embedding for each column as it arrives, which cannot be changed as new columns arrive.

While both of these problems have been studied in the online setting, only additive approximations were known prior to our work. The core of our approach is an adaptive sampling technique that gives a practical and efficient algorithm for both of these problems. We prove that by sampling columns using their “residual norm” (i.e. their norm orthogonal to directions sampled so far), we end up with a significantly better dependence between the number of columns sampled, and the desired error in the approximation.

We further show how to combine our algorithm “in series” with prior algorithms. In particular, using the results of Boutsidis et al. [4] and Frieze et al. [14] that have additive guarantees, we show how to improve the bounds on the error of our algorithm.