This paper introduces a Masked Generative Video Transformer, named MAGVIT, for multi-task video generation. We train a single MAGVIT model and apply it to multiple video generation tasks at inference time. To this end, two new designs are proposed: an improved 3D tokenizer model to quantize a video into spatial-temporal visual tokens, and a novel technique to embed conditions inside the mask to facilitate multi-task training.
We conduct extensive experiments to demonstrate the compelling quality, efficiency, and flexibility of the proposed model.
First, MAGVIT radically improves the previous best fidelity on two video generation tasks.
In terms of efficiency, MAGVIT offers leading video generation speed at inference time, which is estimated to be one or two orders-of-magnitudes faster than other models. As for flexibility, we verified that a single trained MAGVIT is able to generically perform 8+ tasks at several video benchmarks from drastically different visual domains. We will open source our framework and models.View details
This paper studies non-autoregressive transformers for the image synthesis task from the lens of discrete diffusion models. We find that generative methods based on non-autoregressive transformers suffer from decoding compounding error due to the parallel sampling of visual tokens. To alleviate it, we introduce discrete predictor-corrector diffusion models (DPC). Predictor-corrector samplers are a recently introduced class of samplers for diffusion models which improve upon ancestral samplers by correcting the sampling distribution of intermediate diffusion states using MCMC methods. In DPC, the Langevin corrector, which does not have a direct counterpart in discrete space, is replaced with a discrete MCMC transition defined by a learned corrector kernel. The corrector kernel is trained to make the correction steps achieve asymptotic convergence, in distribution, to the real marginal of the intermediate diffusion states. Our experiments show that equipped with DPC, discrete diffusion models can achieve comparable quality to continuous diffusion models, while having orders of magnitude faster sampling times. DPC improves upon existing discrete latent space models for class-conditional image generation on ImageNet, and outperforms recent diffusion models and GANs, according to visual evaluation user studies.View details
Recent conditional image generation methods produce images of
remarkable diversity, fidelity and realism. However, the majority of
these methods allow conditioning only on labels or text prompts, which
limits their level of control over the generation result. In this
paper, we introduce MaskSketch, a masked image generation method that
allows spatial conditioning of the generation result, using a guiding
sketch as an extra conditioning signal during sampling. MaskSketch
utilizes a pre-trained masked image generator, requires no model
training or paired supervision, and works with input sketches of
different levels of abstraction. We propose a novel parallel sampling
scheme that leverages the structural information encoded in the
intermediate self-attention maps of a masked generative transformer,
such as scene layout and object shape. Our results show that
MaskSketch achieves high image realism and fidelity to the guiding
structure. Evaluated on standard benchmark datasets, MaskSketch
outperforms state-of-the-art methods for sketch-to-image translation,
as well as generic image-to-image translation approaches.View details
Transferring knowledge from an image synthesis model trained on a large dataset is a promising direction for learning generative image models from various domains efficiently. While previous works have studied GAN models, we present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers. To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence, and introduce a new prompt design for our task. We study on a variety of visual domains, including visual task adaptation benchmark, with varying amount of training images, and show effectiveness of knowledge transfer and a significantly better image generation quality over existing works.View details
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/View details
Non-autoregressive generative transformers recently demonstrated impressive image generation performance, and orders of magnitude faster sampling than their autoregressive counterparts. However, optimal parallel sampling from the true joint distribution of visual tokens remains an open challenge. In this paper we introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregressive generative transformer. Given a masked-and-reconstructed real image, the Token-Critic model is trained to distinguish which visual tokens belong to the original image and which were sampled by the generative transformer. During non-autoregressive iterative sampling, Token-Critic is used to select which tokens to accept and which to reject and resample. Coupled with Token-Critic, a state-of-the-art generative transformer significantly improves its performance, and outperforms recent diffusion models and GANs in terms of the trade-off between generated image quality and diversity, in the challenging class-conditional ImageNet generation.View details
No Results Found
We're always looking for more talented, passionate people.