VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Shen Yan; Tao Zhu; Zirui Wang; Yuan Cao; Mi Zhang; Soham Ghosh; Yonghui Wu; Jiahui Yu

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Shen Yan

Tao Zhu

Zirui Wang

Yuan Cao

Mi Zhang

Soham Ghosh

Yonghui Wu

Jiahui Yu

arxiv.org, Cornell University (2023)

Download Google Scholar

Abstract

We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs