Video Probabilistic Diffusion Models in Projected Latent Space

Jinwoo Shin
Kihyuk Sohn
Sihyun Yu
Subin Kim
ICLR 2023(2023)
Google Scholar


Despite the remarkable progress in deep generative models, synthesizing highresolution and temporally coherent videos still remains a challenge due to their highdimensionality and complex temporal dynamics along with large spatial variations. Recent works on diffusion models have shown their potential to solve this challenge, yet they suffer from severe computation inefficiency for generation and thus limit the scalability. To handle this issue, we propose a novel generative model for videos, coined projected latent video diffusion model (PVDM), a probabilistic diffusion model which learns a video distribution in a low-dimensional latent space. Specifically, PVDM is composed of two components: (a) an autoencoder that projects a given video as 2D-shaped latent vectors that factorize the complex cubic structure of video pixels and (b) a diffusion model architecture specialized for our new factorized latent space and the training/sampling procedure to synthesize videos of arbitrary length with a single model. Experiments on various benchmarks demonstrate the effectiveness of PVDM compared with previous video generation methods; e.g., PVDM obtains the FVD score of 548.1 on UCF-101, a 61.7% improved result compared with 1431.0 of the prior state-of-the-art.

Research Areas