FormNetV2: Inductive Multimodal Graph Contrastive Learning for Form Document Information Extraction

Chen-Yu Lee; Chun-Liang Li; Hao Zhang; Timothy Dozat; Vincent Perot; Guolong Su; Xiang Zhang; Kihyuk Sohn; Nikolai Glushnev; Renshen Wang; Joshua Ainslie; Shangbang Long; Siyang Qin; Yasuhisa Fujii; Nan Hua; Tomas Pfister

FormNetV2: Inductive Multimodal Graph Contrastive Learning for Form Document Information Extraction

Chen-Yu Lee

Chun-Liang Li

Hao Zhang

Timothy Dozat

Vincent Perot

Guolong Su

Xiang Zhang

Kihyuk Sohn

Nikolai Glushnev

Renshen Wang

Joshua Ainslie

Shangbang Long

Siyang Qin

Yasuhisa Fujii

Nan Hua

Tomas Pfister

ACL (2023)

Google Scholar

Abstract

The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

FormNetV2: Inductive Multimodal Graph Contrastive Learning for Form Document Information Extraction

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs