Google Research

Self-Supervised Learning of Video-Induced Visual Invariances

Conference on Computer Vision and Pattern Recognition (2020)


We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI). We make use of the natural hierarchy consisting of (i) frame level invariances (e.g. color and contrast robustness), (ii) shot/clip level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video level invariances (semantic relationships of scenes across shots/clips) to define a holistic self-supervised loss. We train the proposed model on the YouTube-8M dataset and show that this approach leads to state-of-the-art self-supervised results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB). We then show how to co-train the model jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 with $10x$ fewer labeled images.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work