Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models

Yichao Zhou; James Wendt; Navneet Potti; Jing Xie; Sandeep Tata

Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models

Yichao Zhou

James Wendt

Navneet Potti

Jing Xie

Sandeep Tata

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, ACL, pp. 3847-3860

Download Google Scholar

Abstract

Building automatic extraction models for visually rich documents like invoices, receipts, bills, tax forms, etc. has received significant attention lately. A key bottleneck in developing extraction models for new document types is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. In this paper, we propose selective labeling as a solution to this problem. The key insight is to simplify the labeling task to provide “yes/no” labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by 10× with a negligible loss in accuracy.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs