What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision

Jonathan Malmaud
Jonathan Huang
Vivek Rathod
Andrew Rabinovich
North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015) (to appear)
Google Scholar

Abstract

We present a novel method for aligning a sequence
of instructions to a video of someone
carrying out a task. In particular, we focus
on the cooking domain, where the instructions
correspond to the recipe. Our technique
relies on an HMM to align the recipe steps
to the (automatically generated) speech transcript.
We then refine this alignment using
a state-of-the-art visual food detector, based
on a deep convolutional neural network. We
show that our technique outperforms simpler
techniques based on keyword spotting. It also
enables interesting applications, such as automatically
illustrating recipes with keyframes,
and searching within a video for events of interest.