What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision

Jonathan Malmaud; Jonathan Huang; Vivek Rathod; Nicholas Johnston; Andrew Rabinovich; Kevin Murphy

What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision

Jonathan Malmaud

Jonathan Huang

Vivek Rathod

Nicholas Johnston

Andrew Rabinovich

Kevin Murphy

North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015) (to appear)

Google Scholar

Abstract

We present a novel method for aligning a sequence
of instructions to a video of someone
carrying out a task. In particular, we focus
on the cooking domain, where the instructions
correspond to the recipe. Our technique
relies on an HMM to align the recipe steps
to the (automatically generated) speech transcript.
We then refine this alignment using
a state-of-the-art visual food detector, based
on a deep convolutional neural network. We
show that our technique outperforms simpler
techniques based on keyword spotting. It also
enables interesting applications, such as automatically
illustrating recipes with keyframes,
and searching within a video for events of interest.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs