Looking to Listen at the Cocktail Party: Audio-visual Speech Separation

Ariel Ephrat

Inbar Mosseri

Oran Lang

Tali Dekel

Kevin Wilson

Bill Freeman

Miki Rubinstein

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

Download Google Scholar

Abstract

We present a model for isolating and enhancing speech of desired speakers in a video. The input is a video with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. We leverage both audio and visual features for this task, which are fed into a joint audio-visual source separation model we designed and trained using thousands of hours of video segments with clean speech from our new dataset, AVSpeech-90K. We present results for various real, practical scenarios involving heated debates and interviews, noisy bars and screaming children, only requiring users to specify the face of the person in the video whose speech they would like to isolate.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Looking to Listen at the Cocktail Party: Audio-visual Speech Separation

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Looking to Listen at the Cocktail Party: Audio-visual Speech Separation

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities