Looking to Listen at the Cocktail Party: Audio-visual Speech Separation

IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2018)


We present a model for isolating and enhancing speech of desired speakers in a video. The input is a video with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. We leverage both audio and visual features for this task, which are fed into a joint audio-visual source separation model we designed and trained using thousands of hours of video segments with clean speech from our new dataset, AVSpeech-90K. We present results for various real, practical scenarios involving heated debates and interviews, noisy bars and screaming children, only requiring users to specify the face of the person in the video whose speech they would like to isolate.