Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible
Abstract
We show theoretically and experimentally that both data whitening and second order optimization erase information about the training dataset, and can prevent any generalization for high dimensional datasets. First we show that if the input layer of a model is a dense linear layer, then the datapoint-datapoint second moment matrix contains all information which can be used to make predictions. Second, we show that for high dimensional datasets, where the number of features is at least as large as the number of datapoints, and where the whitening transform is computed on the full (train+test) dataset, whitening erases all information in this datapoint-datapoint second moment matrix. Generalization is thus completely impossible for models trained on high dimensional whitened datasets. Second order optimization of a linear model is identical to first order optimization of the same model after data whitening. Second order optimization can thus also prevent any generalization in similar situations. We experimentally verify these predictions for models trained on whitened data, and for linear models trained with an online Newton optimizer. We further experimentally demonstrate that generalization continues to be harmed even when the theoretical constraints on input dimensionality (for whitening), or linearity of the model (for second order optimization) are relaxed.