Google Research



VideoCC contains roughly 10M video-caption pairs from 6M unique videos. It is created using an automatic pipeline starting from the Conceptual Captions Image-Captioning Dataset. The captions are automatically transferred to short 10 second video clips using image similarity alone. The VideoCC dataset covers a diverse set of topics and can be used to train video captioning or video generation models.