DreamPose: Fashion Video Synthesis with Stable Diffusion

Johanna Karras
Aleksander Hołyński
Ting-Chun Wang


We present DreamPose, a diffusion model-based method to generate fashion videos from still images. Given an image and pose sequence, our method realistically animates both human and fabric motions as a function of body poses. Unlike past image-to-video approaches, we transform a pretrained text-to-image (T2I) stable diffusion model into an pose-guided video synthesis model, achieving high-quality results at a fraction of the computational cost of traditional video diffusion methods [13]. In our approach, we introduce a novel encoder architecture that enables Stable Diffusion to be conditioned directly on image embeddings, eliminating the need for intermediate text embeddings of any kind. We additionally demonstrate that concatenating target poses with the input noise is a simple yet effective means to condition the output frame on poses. Our quantitative and qualitative results show that DreamPose achieves state-of-the-art results on fashion video synthesis.

Research Areas