MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Chenxi Pang; Fangyu Liu; Francesco Piccinno; Julian Martin Eisenschlos; Kenton Lee; Mandar Joshi; Nigel Collier; Syrine Krichene; Yasemin Altun

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Chenxi Pang

Fangyu Liu

Francesco Piccinno

Julian Martin Eisenschlos

Kenton Lee

Mandar Joshi

Nigel Collier

Syrine Krichene

Yasemin Altun

Under review (2022)

Google Scholar

Abstract

Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose a set of pretraining tasks to enhance visual language models' capabilities in jointly modeling charts/plots and language data. We initialize with Pix2Struct, a recently proposed image-to-text visual language model and continue pretraining with our proposed objectives. We argue that numerical reasoning and plot deconstruction enable a model with the key capabilities of (1) extracting key information and (2) reasoning on the extracted information. On standard benchmarks such as PlotQA and ChartQA, our continually pretrained MatCha model outperforms state-of-the-art methods by as much as ~20%. We also examine how well does MatCha pretraining transfer to domains such as screenshot, textbook, and poster figures. We observe improvement over the base Pix2Struct checkpoint by 1.2% on average, verifying the usefulness of MatCha pretraining on broader visual language tasks.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs