Enhancing Video Summarization via Vision-Language Embedding
Abstract
This paper addresses video summarization, or the problem
of distilling a raw video into a shorter form while still
capturing the original story. We show that visual representations
supervised by freeform language make a good
fit for this application by extending a recent submodular
summarization approach with representativeness and
interestingness objectives computed on features from a joint
vision-language embedding space. We perform an evaluation
on two diverse datasets, UT Egocentric and
TV Episodes, and show that our new objectives give
improved summarization ability compared to standard visual
features alone. Our experiments also show that the
vision-language embedding need not be trained on domainspecific
data, but can be learned from standard still image
vision-language datasets and transferred to video. A further
benefit of our model is the ability to guide a summary using
freeform text input at test time, allowing user customization.