Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models

Yichao Zhou

James Wendt

Navneet Potti

Jing Xie

Sandeep Tata

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, ACL, pp. 3847-3860

Download Google Scholar

Abstract

Building automatic extraction models for visually rich documents like invoices, receipts, bills, tax forms, etc. has received significant attention lately. A key bottleneck in developing extraction models for new document types is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. In this paper, we propose selective labeling as a solution to this problem. The key insight is to simplify the labeling task to provide “yes/no” labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by 10× with a negligible loss in accuracy.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities