LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding

Cheng Li

Mike Bendersky

Mingyang Zhang

Spurthi Amba Hombaiah

Tao Chen

Te-Lin Wu

VIGIL-NAACL21 (2021)

Google Scholar

Abstract

Document layout comprises both structural and visual (\eg font size) information that are vital but often ignored by machine learning models. The few existing models which do use layout information only consider \textit{textual} contents, and overlook the existence of contents in other modalities such as images. Additionally, spatial interactions of presented contents in a layout was never fully exploited. On the other hand, a series of document understanding tasks are calling out for layout information. One example is given a position in a document, which image is the best to fit in. To address current models' limitations and tackle layout-aware document understanding tasks, we first parse a document into blocks whose content can be textual, tabular, or multimedia (\eg images) using a proprietary tool. We then propose a novel hierarchical framework, LAMPreT, to encode the blocks. Our LAMPreT model encodes each block with a multimodal transformer in the lower-level, and aggregates the block-level representations and connections utilizing a specifically designed transformer at the higher-level. We design hierarchical pre-training objectives where the lower-level model is trained with the standard masked language modeling (MLM) loss and the multimodal alignment loss, and the higher-level model is trained with three layout-aware objectives: (1) block-order predictions, (2) masked block predictions, and (3) image fitting predictions. We test the proposed model on two layout-aware tasks -- image suggestions and text block filling, and show the effectiveness of our proposed hierarchical architecture as well as pre-training techniques.

Research Areas

Natural Language Processing

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities