Don’t Listen to What You Can’t See: The Importance of Negative Examples for Audio-Visual On-Screen Sound Separation

ECCV 2022 Workshop on AV4D: Visual Learning of Sounds in Spaces


For the task of audio-visual on-screen sound separation, we illustrate the importance of using evaluation sets that includes not only positive examples (videos with on-screen sounds), but also negative examples (videos that only contain off-screen sounds). Given an evaluation set that includes such examples, we provide metrics and a calibration procedure to allow fair comparison of different models with a single metric, which is analogous to calibrating binary classifiers to achieve a desired false alarm rate. In addition, we propose a method of probing on-screen sound separation models by masking objects in input video frames. Using this method, we probe the sensitivity of our recently-proposed AudioScopeV2 model, and discover that its robustness to removing on-screen sound objects is improved by providing supervised examples in training.

