Representation Learning for Information Extraction  from Form-like Documents

Bodhisattwa Majumder; Navneet Potti; Sandeep Tata; James B. Wendt; Qi Zhao; Marc Najork

Representation Learning for Information Extraction from Form-like Documents

Bodhisattwa Majumder

Navneet Potti

Sandeep Tata

James B. Wendt

Qi Zhao

Marc Najork

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), pp. 6495-6504

Download Google Scholar

Abstract

We propose a novel approach using representation learning for tackling the problem of extracting structured information from form-like document images. We propose an extraction system that uses knowledge of the types of the target fields to generate extraction candidates, and a neural network architecture that learns a dense representation of each candidate based on neighboring words in the document. These learned representations are not only useful in solving the extraction task for unseen document templates from two different domains, but are also interpretable, as we show using loss cases.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Representation Learning for Information Extraction from Form-like Documents

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs