Google Research

Sequence-to-Label Script Identification for Multilingual OCR

Proceedings of the 14th International Conference on Document Analysis and Recognition (ICDAR), IEEE (2017)

Abstract

We describe a novel line-level script identification method. In multilingual OCR, script identification is a crucial component as it automates the provision of a language hint. Previous work repurposed an OCR model that generates per-character script codes, aggregated by a counting heuristic to obtain line-level script ID. This baseline has two shortcomings. First, as a sequence-to-sequence model it is more complex than necessary for the sequence-to-label problem of line script ID, making it hard to train and inefficient to run. Second, the counting heuristic may be suboptimal compared to a learned model. Therefore we reframe line script identification as a sequence-to-label problem and solve it using two components, trained end-to-end: Encoder and Summarizer. The encoder converts a line image into a sequence of features. The summarizer aggregates this sequence to classify the line. We test various summarizers while keeping identical inception-style convolutional networks as encoders. Experiments on scanned books and photos containing 232 languages in 30 scripts show 16% reduction of script ID error rate compared to the baseline. This improved script ID reduces the character error rate attributable to script misidentification by 33%.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work