DePlot: One-shot visual language understanding by plot-to-text translation

Chenxi Pang; Fangyu Liu; Francesco Piccinno; Julian Martin Eisenschlos; Kenton Lee; Mandar Joshi; Nigel Collier; Syrine Krichene; Wenhu Chen; Yasemin Altun

DePlot: One-shot visual language understanding by plot-to-text translation

Chenxi Pang

Fangyu Liu

Francesco Piccinno

Julian Martin Eisenschlos

Kenton Lee

Mandar Joshi

Nigel Collier

Syrine Krichene

Wenhu Chen

Yasemin Altun

Under review (2022)

Google Scholar

Abstract

Visual language such as charts and plots are ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art models are end-to-end multimodal Transformers pretrained with dedicated plot derendering and numerical reasoning objectives. However, the models reasoning capabilities still fall short and will generally fail on complex queries.
In this paper, we decompose the multimodal reasoning problem into first, a modality conversion problem from image to text, then a purely textual reasoning problem. Through combining a pretrained image-to-text model and an LLM for the task of chart/figure reasoning. Compared with a SOTA model finetuned on >10k data points, our plug-and-play model DePlot-LLM achieves >20% improvement over finetuned SOTA with just one-shot prompting.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

DePlot: One-shot visual language understanding by plot-to-text translation

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs