VRDU: A Benchmark for Visually-rich Document Understanding
Abstract
Understanding visually-rich business documents to extract structured data and automate business workflows has been receiving attention both in academia and industry. Although recent multi-modal language models have achieved impressive results, we argue that existing benchmarks do not reflect the complexity of real documents seen in industry, and therefore not suitable for measuring progress in practical settings.
In this work, we identify the desiderata for a more comprehensive benchmark and propose one we call VRDU for Visually Rich Document Understanding. VRDU contains two datasets that represent several challenges: rich schema including diverse data types as well as nested entities, complex templates including tables and multi-column layouts and diversity of different layouts within a single document type. We design few-shot and conventional experiment settings along with a carefully designed matching algorithm to evaluate extraction results. We report the performance of strong baselines and observe three conclusions: (1) generalizing to new templates from a document type is still very challenging, (2) few-shot performance continues to have a lot of headroom, and (3) models struggle with nested repeated fields such as line-items in an invoice. We plan to open source the benchmark and the evaluation toolkit. We hope that it helps inspire and guide future research in this challenging area.
In this work, we identify the desiderata for a more comprehensive benchmark and propose one we call VRDU for Visually Rich Document Understanding. VRDU contains two datasets that represent several challenges: rich schema including diverse data types as well as nested entities, complex templates including tables and multi-column layouts and diversity of different layouts within a single document type. We design few-shot and conventional experiment settings along with a carefully designed matching algorithm to evaluate extraction results. We report the performance of strong baselines and observe three conclusions: (1) generalizing to new templates from a document type is still very challenging, (2) few-shot performance continues to have a lot of headroom, and (3) models struggle with nested repeated fields such as line-items in an invoice. We plan to open source the benchmark and the evaluation toolkit. We hope that it helps inspire and guide future research in this challenging area.