- Jinwoo Shin
- Kihyuk Sohn
- Sihyun Yu
- Subin Kim
Abstract
Despite the remarkable progress in deep generative models, synthesizing highresolution and temporally coherent videos still remains a challenge due to their highdimensionality and complex temporal dynamics along with large spatial variations. Recent works on diffusion models have shown their potential to solve this challenge, yet they suffer from severe computation inefficiency for generation and thus limit the scalability. To handle this issue, we propose a novel generative model for videos, coined projected latent video diffusion model (PVDM), a probabilistic diffusion model which learns a video distribution in a low-dimensional latent space. Specifically, PVDM is composed of two components: (a) an autoencoder that projects a given video as 2D-shaped latent vectors that factorize the complex cubic structure of video pixels and (b) a diffusion model architecture specialized for our new factorized latent space and the training/sampling procedure to synthesize videos of arbitrary length with a single model. Experiments on various benchmarks demonstrate the effectiveness of PVDM compared with previous video generation methods; e.g., PVDM obtains the FVD score of 548.1 on UCF-101, a 61.7% improved result compared with 1431.0 of the prior state-of-the-art.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work