Google Research

End-to-end Generative Pretraining for Multimodal Video Captioning


Recent video and language pretraining frameworks lack the ability to generate sentences, and are limited in transferring to generative tasks such as multimodal video captioning. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled instructional videos where the pretrained model is effectively transferred to video captioning tasks. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of the captions in the unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective -- we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. We use this objective to train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for video captioning on four standard benchmarks, as well as on other video understanding tasks such as VideoQA, video retrieval and action classification.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work