Contextualized Spatial-Temporal Contrastive Learning with Self-Supervision

Liangzhe Yuan; Rui Qian; Yin Cui; Boqing Gong; Florian Schroff; Ming-Hsuan Yang; Hartwig Adam; Ting Liu

Contextualized Spatial-Temporal Contrastive Learning with Self-Supervision

Liangzhe Yuan

Rui Qian

Yin Cui

Boqing Gong

Florian Schroff

Ming-Hsuan Yang

Hartwig Adam

Ting Liu

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 13977-13986

Download Google Scholar

Abstract

Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on learning holistic image and video representations, such an objective becomes sub-optimal for learning spatio-temporally fine-grained features in videos, where scenes and instances evolve through space and time. In this paper, we present Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) to effectively learn spatio-temporally fine-grained video representations via self-supervision. We first design a region-based pretext task which requires the model to transform in-stance representations from one view to another, guided by context features. Further, we introduce a simple network design that successfully reconciles the simultaneous learning process of both holistic and local representations. We evaluate our learned representations on a variety of downstream tasks and show that ConST-CL achieves competitive results on 6 datasets, including Kinetics, UCF, HMDB, AVA-Kinetics, AVA and OTB.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Contextualized Spatial-Temporal Contrastive Learning with Self-Supervision

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs