Jump to Content

Speech2Action:Cross-modal Supervision for Action Recognition


Can we guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie scripts describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a speech to action classifier on 1k movie scripts downloaded from IMSDb and show that such a classifier performs well for certain classes, and when applied to the speech segments of a large \textit{unlabelled} movie corpus (288k videos, 188M speech segments), provides weak labels for over 800k video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single labelled action example.

Research Areas