What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision
Abstract
We present a novel method for aligning a sequence
of instructions to a video of someone
carrying out a task. In particular, we focus
on the cooking domain, where the instructions
correspond to the recipe. Our technique
relies on an HMM to align the recipe steps
to the (automatically generated) speech transcript.
We then refine this alignment using
a state-of-the-art visual food detector, based
on a deep convolutional neural network. We
show that our technique outperforms simpler
techniques based on keyword spotting. It also
enables interesting applications, such as automatically
illustrating recipes with keyframes,
and searching within a video for events of interest.
of instructions to a video of someone
carrying out a task. In particular, we focus
on the cooking domain, where the instructions
correspond to the recipe. Our technique
relies on an HMM to align the recipe steps
to the (automatically generated) speech transcript.
We then refine this alignment using
a state-of-the-art visual food detector, based
on a deep convolutional neural network. We
show that our technique outperforms simpler
techniques based on keyword spotting. It also
enables interesting applications, such as automatically
illustrating recipes with keyframes,
and searching within a video for events of interest.