Improving Book OCR by Adaptive Language and Image Models
Abstract
In order to cope with the vast diversity of book content
and typefaces, it is important for OCR systems to leverage the
strong consistency within a book but adapt to variations across
books. In this work, we describe a system that combines two
parallel correction paths using document-specific image and
language models. Each model adapts to shapes and vocabularies
within a book to identify inconsistencies as correction hypotheses,
but relies on the other for effective cross-validation. Using the
open source Tesseract engine as baseline, results on a large
dataset of scanned books demonstrate that word error rates can
be reduced by 25% using this approach.
and typefaces, it is important for OCR systems to leverage the
strong consistency within a book but adapt to variations across
books. In this work, we describe a system that combines two
parallel correction paths using document-specific image and
language models. Each model adapts to shapes and vocabularies
within a book to identify inconsistencies as correction hypotheses,
but relies on the other for effective cross-validation. Using the
open source Tesseract engine as baseline, results on a large
dataset of scanned books demonstrate that word error rates can
be reduced by 25% using this approach.