Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Ahmad Abdulkader; Matthew R. Casey

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Ahmad Abdulkader

Matthew R. Casey

Proceedings of the 10th international conference on document analysis and recognition, IEEE (2009)

Download Google Scholar

Abstract

We propose a low cost method for the correction of
the output of OCR engines through the use of human
labor. The method employs an error estimator neural
network that learns to assess the error probability of
every word from ground-truth data. The error
estimator uses features computed from the outputs of
multiple OCR engines. The output probability error
estimate is used to decide which words are inspected
by humans. The error estimator is trained to optimize
the area under the word error ROC leading to an
improved efficiency of the human correction process. A
significant reduction in cost is achieved by clustering
similar words together during the correction process.
We also show how active learning techniques are used
to further improve the efficiency of the error estimator.

Research Areas

Machine perception

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs