Improving Book OCR by Adaptive Language and Image Models

Dar-Shyang Lee; Ray Smith

Improving Book OCR by Adaptive Language and Image Models

Dar-Shyang Lee

Ray Smith

Proceedings of 2012 10th IAPR International Workshop on Document Analysis Systems, IEEE, pp. 115-119

Google Scholar

Abstract

In order to cope with the vast diversity of book content
and typefaces, it is important for OCR systems to leverage the
strong consistency within a book but adapt to variations across
books. In this work, we describe a system that combines two
parallel correction paths using document-specific image and
language models. Each model adapts to shapes and vocabularies
within a book to identify inconsistencies as correction hypotheses,
but relies on the other for effective cross-validation. Using the
open source Tesseract engine as baseline, results on a large
dataset of scanned books demonstrate that word error rates can
be reduced by 25% using this approach.

Research Areas

Machine perception

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Improving Book OCR by Adaptive Language and Image Models

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs