TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Kai-Wei Chang
Hritik Bansal
Aditya Grover
Michal Yarom
2024

Abstract

Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., `a red panda climbing a tree'). However, it is pertinent to synthesize multi-scene generated videos since they are ubiquitous in the real-world (e.g., `a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. In particular, we introduce a simple and novel inductive bias to the text-conditioning mechanism in the T2V architecture that makes the model aware of the temporal alignment between the video scenes and their respective scene descriptions. As a result, we find that the T2V model can generate visually consistent (e.g., entity and background) videos that adhere to the multi-scene text descriptions. Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We observe that TALC-finetuned model outperforms the baseline methods by 15.5 points on the overall score, average of visual consistency and text adherence, across a diverse task prompts and number of generated scenes under the automatic and human evaluations.
×