Sequence-to-Label Script Identification for Multilingual OCR

Ashok C Popat; Jonathan Michael Baccash; Karel Driesen; Patrick Michael Hurst; Yasuhisa Fujii

Sequence-to-Label Script Identification for Multilingual OCR

Ashok C Popat

Jonathan Michael Baccash

Karel Driesen

Patrick Michael Hurst

Yasuhisa Fujii

Proceedings of the 14th International Conference on Document Analysis and Recognition (ICDAR), IEEE (2017)

Google Scholar

Abstract

We describe a novel line-level script identification
method. In multilingual OCR, script identification is a crucial
component as it automates the provision of a language hint.
Previous work repurposed an OCR model that generates per-character
script codes, aggregated by a counting heuristic to
obtain line-level script ID. This baseline has two shortcomings.
First, as a sequence-to-sequence model it is more complex than
necessary for the sequence-to-label problem of line script ID,
making it hard to train and inefficient to run. Second, the counting
heuristic may be suboptimal compared to a learned model.
Therefore we reframe line script identification as a
sequence-to-label problem and solve it using two components, trained
end-to-end: Encoder and Summarizer. The encoder converts a line
image into a sequence of features. The summarizer aggregates
this sequence to classify the line. We test various summarizers
while keeping identical inception-style convolutional networks as
encoders. Experiments on scanned books and photos containing
232 languages in 30 scripts show 16% reduction of script ID error
rate compared to the baseline. This improved script ID reduces
the character error rate attributable to script misidentification
by 33%.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Sequence-to-Label Script Identification for Multilingual OCR

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs