MAGVIT: Masked Generative Video Transformer

Lijun Yu; Yong Cheng; Kihyuk Sohn; José Lezama; Han Zhang; Huiwen Chang; Alex Hauptmann; Ming-Hsuan Yang; Yuan Hao; Irfan Essa; Lu Jiang

MAGVIT: Masked Generative Video Transformer

Lijun Yu

Yong Cheng

Kihyuk Sohn

José Lezama

Han Zhang

Huiwen Chang

Alex Hauptmann

Ming-Hsuan Yang

Yuan Hao

Irfan Essa

Lu Jiang

CVPR (2023)

Download Google Scholar

Abstract

This paper introduces a Masked Generative Video Transformer, named MAGVIT, for multi-task video generation. We train a single MAGVIT model and apply it to multiple video generation tasks at inference time. To this end, two new designs are proposed: an improved 3D tokenizer model to quantize a video into spatial-temporal visual tokens, and a novel technique to embed conditions inside the mask to facilitate multi-task training.
We conduct extensive experiments to demonstrate the compelling quality, efficiency, and flexibility of the proposed model.
First, MAGVIT radically improves the previous best fidelity on two video generation tasks.
In terms of efficiency, MAGVIT offers leading video generation speed at inference time, which is estimated to be one or two orders-of-magnitudes faster than other models. As for flexibility, we verified that a single trained MAGVIT is able to generically perform 8+ tasks at several video benchmarks from drastically different visual domains. We will open source our framework and models.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

MAGVIT: Masked Generative Video Transformer

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs