Sequence-to-Label Script Identification for Multilingual OCR
Abstract
We describe a novel line-level script identification
method. In multilingual OCR, script identification is a crucial
component as it automates the provision of a language hint.
Previous work repurposed an OCR model that generates per-character
script codes, aggregated by a counting heuristic to
obtain line-level script ID. This baseline has two shortcomings.
First, as a sequence-to-sequence model it is more complex than
necessary for the sequence-to-label problem of line script ID,
making it hard to train and inefficient to run. Second, the counting
heuristic may be suboptimal compared to a learned model.
Therefore we reframe line script identification as a
sequence-to-label problem and solve it using two components, trained
end-to-end: Encoder and Summarizer. The encoder converts a line
image into a sequence of features. The summarizer aggregates
this sequence to classify the line. We test various summarizers
while keeping identical inception-style convolutional networks as
encoders. Experiments on scanned books and photos containing
232 languages in 30 scripts show 16% reduction of script ID error
rate compared to the baseline. This improved script ID reduces
the character error rate attributable to script misidentification
by 33%.
method. In multilingual OCR, script identification is a crucial
component as it automates the provision of a language hint.
Previous work repurposed an OCR model that generates per-character
script codes, aggregated by a counting heuristic to
obtain line-level script ID. This baseline has two shortcomings.
First, as a sequence-to-sequence model it is more complex than
necessary for the sequence-to-label problem of line script ID,
making it hard to train and inefficient to run. Second, the counting
heuristic may be suboptimal compared to a learned model.
Therefore we reframe line script identification as a
sequence-to-label problem and solve it using two components, trained
end-to-end: Encoder and Summarizer. The encoder converts a line
image into a sequence of features. The summarizer aggregates
this sequence to classify the line. We test various summarizers
while keeping identical inception-style convolutional networks as
encoders. Experiments on scanned books and photos containing
232 languages in 30 scripts show 16% reduction of script ID error
rate compared to the baseline. This improved script ID reduces
the character error rate attributable to script misidentification
by 33%.