MuRAG: Multimodal Retrieval-Augmented Generator
Abstract
Language Models have been shown to store massive amounts of world knowledge implicitly in their parameters. However, even with ever-larger networks, models often fail to encode infrequent information such as rare entities/events, while paying the price of massively increasing computational costs. Recently, retrieval-augmented models, such as REALM, RAG, and RETRO, were proposed to incorporate world knowledge into language models by leveraging an external non-parametric index, achieving impressive performance with constrained model sizes. However, these methods are restricted to retrieving only textual knowledge, neglecting the ubiquitous amount of knowledge in other modalities like images - much of which contains information not covered by any text. To address this limitation, we propose the first Multimodal Retrieval-Augmented Transformer (MuRAG), which accesses an external non-parametric multimodal memory to augment language model pre-training. MuRAG is pre-trained with a mixture of large-scale image-text and text-only corpora using a joint contrastive and generative loss. In experiments, we evaluate MuRAG's performance on two downstream datasets that require retrieving and reasoning over both images and text to answer a given query, WebQA, and MultimodalQA. Our results show that MuRAG's outperforms competitive baselines by more than 10\% accuracy - achieving the best-known performance on those tasks.