# Attention Bottlenecks for Multimodal Fusion

(2021)

## Abstract

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.
Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks. A common approach for building multimodal models is to simply combine multiple of these modality-specific architectures using late-stage fusion of final representations or predictions (\textit{late-fusion'&rbrace;). Instead, we propose a new architecture that learns to model both unimodal and cross-modal information at earlier stages, without imposing any modality specific priors. We investigate two pathways for the exchange of cross-modal information, \textit&lbrace;vertical attention&rbrace; (by restricting crossmodal fusion to certain layers) and \textit&lbrace;horizontal attention&rbrace;, via the use offusion bottleneck' tokens, that encourage the model to extract and exchange relevant information between modalities in an efficient manner. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.