Jump to Content

Wenhu Chen

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Language Models have been shown to store massive amounts of world knowledge implicitly in their parameters. However, even with ever-larger networks, models often fail to encode infrequent information such as rare entities/events, while paying the price of massively increasing computational costs. Recently, retrieval-augmented models, such as REALM, RAG, and RETRO, were proposed to incorporate world knowledge into language models by leveraging an external non-parametric index, achieving impressive performance with constrained model sizes. However, these methods are restricted to retrieving only textual knowledge, neglecting the ubiquitous amount of knowledge in other modalities like images - much of which contains information not covered by any text. To address this limitation, we propose the first Multimodal Retrieval-Augmented Transformer (MuRAG), which accesses an external non-parametric multimodal memory to augment language model pre-training. MuRAG is pre-trained with a mixture of large-scale image-text and text-only corpora using a joint contrastive and generative loss. In experiments, we evaluate MuRAG's performance on two downstream datasets that require retrieving and reasoning over both images and text to answer a given query, WebQA, and MultimodalQA. Our results show that MuRAG's outperforms competitive baselines by more than 10\% accuracy - achieving the best-known performance on those tasks. View details
    Preview abstract Visual language such as charts and plots are ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art models are end-to-end multimodal Transformers pretrained with dedicated plot derendering and numerical reasoning objectives. However, the models reasoning capabilities still fall short and will generally fail on complex queries. In this paper, we decompose the multimodal reasoning problem into first, a modality conversion problem from image to text, then a purely textual reasoning problem. Through combining a pretrained image-to-text model and an LLM for the task of chart/figure reasoning. Compared with a SOTA model finetuned on >10k data points, our plug-and-play model DePlot-LLM achieves >20% improvement over finetuned SOTA with just one-shot prompting. View details
    No Results Found