HMM-based script identification for OCR

Dmitriy Genzel
Remco Teunen
Proceedings of the 4th International Workshop on Multilingual OCR, ACM, New York, NY, US (2013), 2:1-2:5
Google Scholar

Abstract

While current OCR systems are able to recognize text in
an increasing number of scripts and languages, typically
they still need to be told in advance what those scripts and
languages are. We propose an approach that repurposes
the same HMM-based system used for OCR to the task of
script/language ID, by replacing character labels with script
class labels. We apply it in a multi-pass overall OCR process
which achieves “universal” OCR over 54 tested languages
in 18 distinct scripts, over a wide variety of typefaces in
each. For comparison we also consider a brute-force approach,
wherein a singe HMM-based OCR system is trained
to recognize all considered scripts. Results are presented on
a large and diverse evaluation set extracted from book images,
both for script identification accuracy and for overall
OCR accuracy. On this evaluation data, the script ID system
provided a script ID error rate of 1.73% for 18 distinct
scripts. The end-to-end OCR system with the script ID system
achieved a character error rate of 4.05%, an increase of
0.77% over the case where the languages are known a priori.