Self-Supervised Learning of Video-Induced Visual Invariances

Michael Tobias Tschannen; Josip Djolonga; Marvin Ritter; Aravindh Mahendran; Neil Houlsby; Sylvain Gelly; Mario Lučić

Self-Supervised Learning of Video-Induced Visual Invariances

Michael Tobias Tschannen

Josip Djolonga

Marvin Ritter

Aravindh Mahendran

Neil Houlsby

Sylvain Gelly

Mario Lučić

Conference on Computer Vision and Pattern Recognition (2020)

Download Google Scholar

Abstract

We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI). We make use of the natural hierarchy consisting of (i) frame level invariances (e.g. color and contrast robustness), (ii) shot/clip level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video level invariances (semantic relationships of scenes across shots/clips) to define a holistic self-supervised loss. We train the proposed model on the YouTube-8M dataset and show that this approach leads to state-of-the-art self-supervised results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB). We then show how to co-train the model jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 with $10x$ fewer labeled images.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Self-Supervised Learning of Video-Induced Visual Invariances

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs