Jump to Content

Sequence-to-Label Script Identification for Multilingual OCR

Jonathan Michael Baccash
Patrick Michael Hurst
Proceedings of the 14th International Conference on Document Analysis and Recognition (ICDAR), IEEE (2017)
Google Scholar

Abstract

We describe a novel line-level script identification method. In multilingual OCR, script identification is a crucial component as it automates the provision of a language hint. Previous work repurposed an OCR model that generates per-character script codes, aggregated by a counting heuristic to obtain line-level script ID. This baseline has two shortcomings. First, as a sequence-to-sequence model it is more complex than necessary for the sequence-to-label problem of line script ID, making it hard to train and inefficient to run. Second, the counting heuristic may be suboptimal compared to a learned model. Therefore we reframe line script identification as a sequence-to-label problem and solve it using two components, trained end-to-end: Encoder and Summarizer. The encoder converts a line image into a sequence of features. The summarizer aggregates this sequence to classify the line. We test various summarizers while keeping identical inception-style convolutional networks as encoders. Experiments on scanned books and photos containing 232 languages in 30 scripts show 16% reduction of script ID error rate compared to the baseline. This improved script ID reduces the character error rate attributable to script misidentification by 33%.