VideoCon: Robust Video-Language Alignment Evaluation via Contrast Captions
Abstract
The alignment of diverse data modalities, especially video and text, is a significant challenge in AI. This study introduces VideoCon, a novel dataset for robust video-language alignment evaluation. It provides contrast captions for originally matched video-captions, complemented with natural language explanations (NLEs) that delineate the differences between the video and the contrast captions. Notably, VideoCon emphasizes temporally challenging scenarios to enhance the robustness of evaluations. To address misalignments observed in previous models, we propose AlignVideo, a video-language model trained on VideoCon that demonstrates enhanced alignment capabilities. Experiments reveal that AlignVideo surpasses existing baselines in video-text alignment and generates more precise NLEs. Moreover, it showcases state-of-the-art performance in zero-shot downstream tasks, emphasizing complex video understanding, such as action recognition and temporal event sequencing. Our work paves the way for advancements in video-text alignment evaluation and model development.