Looking to Listen at the Cocktail Party: Audio-visual Speech Separation

Ariel Ephrat; Inbar Mosseri; Oran Lang; Tali Dekel; Kevin Wilson; Bill Freeman; Miki Rubinstein

Looking to Listen at the Cocktail Party: Audio-visual Speech Separation

Ariel Ephrat

Inbar Mosseri

Oran Lang

Tali Dekel

Kevin Wilson

Bill Freeman

Miki Rubinstein

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

Download Google Scholar

Abstract

We present a model for isolating and enhancing speech of desired speakers in a video. The input is a video with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. We leverage both audio and visual features for this task, which are fed into a joint audio-visual source separation model we designed and trained using thousands of hours of video segments with clean speech from our new dataset, AVSpeech-90K. We present results for various real, practical scenarios involving heated debates and interviews, noisy bars and screaming children, only requiring users to specify the face of the person in the video whose speech they would like to isolate.

Research Areas

Machine perception

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Looking to Listen at the Cocktail Party: Audio-visual Speech Separation

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs