- Chen-Yu Lee
- Chun-Liang Li
- Hao Zhang
- Timothy Dozat
- Vincent Perot
- Guolong Su
- Xiang Zhang
- Kihyuk Sohn
- Nikolai Glushnev
- Renshen Wang
- Joshua Ainslie
- Shangbang Long
- Siyang Qin
- Yasuhisa Fujii
- Nan Hua
- Tomas Pfister
Abstract
The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work