Attention Bottlenecks for Multimodal Fusion
Abstract
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.
Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks. A common approach for building multimodal models is to simply combine multiple of these modality-specific architectures using late-stage fusion of final representations or predictions (\textit{`late-fusion'}). Instead, we propose a new architecture that learns to model both unimodal and cross-modal information at earlier stages, without imposing any modality specific priors. We investigate two pathways for the exchange of cross-modal information, \textit{vertical attention} (by restricting crossmodal fusion to certain layers) and \textit{horizontal attention}, via the use of `fusion bottleneck' tokens, that encourage the model to extract and exchange relevant information between modalities in an efficient manner.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks. A common approach for building multimodal models is to simply combine multiple of these modality-specific architectures using late-stage fusion of final representations or predictions (\textit{`late-fusion'}). Instead, we propose a new architecture that learns to model both unimodal and cross-modal information at earlier stages, without imposing any modality specific priors. We investigate two pathways for the exchange of cross-modal information, \textit{vertical attention} (by restricting crossmodal fusion to certain layers) and \textit{horizontal attention}, via the use of `fusion bottleneck' tokens, that encourage the model to extract and exchange relevant information between modalities in an efficient manner.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.