Ashok Popat
Ashok C. Popat received the SB and SM degrees from the Massachusetts Institute of Technology in Electrical Engineering in 1986 and 1990, and the PhD from the MIT Media Lab in 1997. He is a Research Scientist at Google in Mountain View, CA. Prior to joining Google in 2005 he worked at Xerox PARC. His interests include signal processing, data compression, machine translation, and pattern recognition. He enjoys running, skiing, sailing, hiking, and spending time with his wife and two daughters.
Research Areas
Authored Publications
Sort By
Post-OCR Paragraph Recognition by Graph Convolutional Networks
Winter Conference on Applications of Computer Vision (WACV) 2022
Preview abstract
Paragraphs are an important class of document entities. We propose a new approach for paragraph recognition by spatial graph convolutional networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results. Each step uses a beta-skeleton graph constructed from bounding boxes, where the graph edges provide efficient support for graph convolution operations. With pure layout input features, the GCN model size is 3~4 orders of magnitude smaller compared to R-CNN based models, while achieving comparable or better accuracies on PubLayNet and other datasets. Furthermore, the GCN models show good generalization from synthetic training data to real-world images, and good adaptivity for variable document styles.
View details
ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction
Chun-Liang Li
Chu Wang
Association for Computational Linguistics (ACL) (2021)
Preview abstract
Natural reading orders of words are crucial for information extraction from form-like documents. Despite recent advances in Graph Convolutional Networks (GCNs) on modeling spatial layout patterns of documents, they have limited ability to capture reading orders of given word-level node representations in a graph. We propose Reading Order Equivariant Positional Encoding (ROPE), a new positional encoding technique designed to apprehend the sequential presentation of words in documents. ROPE generates unique reading order codes for neighboring words relative to the target word given a word-level graph connectivity. We study two fundamental document entity extraction tasks including word labeling and word grouping on the public FUNSD dataset and a large-scale payment dataset. We show that ROPE consistently improves existing GCNs with a margin up to 8.4% F1-score.
View details
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
Proceedings of ICASSP 2020 (2020) (to appear)
Preview abstract
Humans do not acquire perceptual abilities like we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies far greater on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, (ii) a novel clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a cluster-based active learning procedure that solicits targeted weak supervision to consolidate hypothesized categories into relevant semantic classes. By jointly training a single sound embedding/clustering/classification network according to these criteria, we achieve a new state-of-the-art unsupervised audio representation and demonstrate
up to 20-fold reduction in labels required to reach a desired classification performance.
View details
Preview abstract
Many studies on (Offline) Handwritten Text Recognition (HTR) systems have focused on building state-of-the-art models for line recognition on small corpora. However, adding HTR capability to a large scale multilingual OCR system poses new challenges. This paper addresses three problems in building such systems: data, efficiency, and integration. Firstly, one of the biggest challenges is obtaining sufficient amounts of high quality training data. We address the problem by using online handwriting data collected for a large scale production online handwriting recognition system. We describe our image data generation pipeline and study how online data can be used to build HTR models. We show that the data improve the models significantly under the condition where only a small number of real images is available, which is usually the case for HTR models. It enables us to support a new script at substantially lower cost. Secondly, we propose a line recognition model based on neural networks without recurrent connections. The model achieves a comparable accuracy with LSTM-based models while allowing for better parallelism in training and inference. Finally, we present a simple way to integrate HTR models into an OCR system. These constitute a solution to bring HTR capability into a large scale OCR system.
View details
Sequence-to-Label Script Identification for Multilingual OCR
Jonathan Michael Baccash
Patrick Michael Hurst
Proceedings of the 14th International Conference on Document Analysis and Recognition (ICDAR), IEEE (2017)
Preview abstract
We describe a novel line-level script identification
method. In multilingual OCR, script identification is a crucial
component as it automates the provision of a language hint.
Previous work repurposed an OCR model that generates per-character
script codes, aggregated by a counting heuristic to
obtain line-level script ID. This baseline has two shortcomings.
First, as a sequence-to-sequence model it is more complex than
necessary for the sequence-to-label problem of line script ID,
making it hard to train and inefficient to run. Second, the counting
heuristic may be suboptimal compared to a learned model.
Therefore we reframe line script identification as a
sequence-to-label problem and solve it using two components, trained
end-to-end: Encoder and Summarizer. The encoder converts a line
image into a sequence of features. The summarizer aggregates
this sequence to classify the line. We test various summarizers
while keeping identical inception-style convolutional networks as
encoders. Experiments on scanned books and photos containing
232 languages in 30 scripts show 16% reduction of script ID error
rate compared to the baseline. This improved script ID reduces
the character error rate attributable to script misidentification
by 33%.
View details
Label Transition and Selection Pruning and Automatic Decoding Parameter Optimization for Time-Synchronous Viterbi Decoding
Dmitriy Genzel
Remco Teunen
13th International Conference on Document Analysis and Recognition (ICDAR), IEEE (2015), pp. 756-760
HMM-based script identification for OCR
Dmitriy Genzel
Remco Teunen
Proceedings of the 4th International Workshop on Multilingual OCR, ACM, New York, NY, US (2013), 2:1-2:5
Preview abstract
While current OCR systems are able to recognize text in
an increasing number of scripts and languages, typically
they still need to be told in advance what those scripts and
languages are. We propose an approach that repurposes
the same HMM-based system used for OCR to the task of
script/language ID, by replacing character labels with script
class labels. We apply it in a multi-pass overall OCR process
which achieves “universal” OCR over 54 tested languages
in 18 distinct scripts, over a wide variety of typefaces in
each. For comparison we also consider a brute-force approach,
wherein a singe HMM-based OCR system is trained
to recognize all considered scripts. Results are presented on
a large and diverse evaluation set extracted from book images,
both for script identification accuracy and for overall
OCR accuracy. On this evaluation data, the script ID system
provided a script ID error rate of 1.73% for 18 distinct
scripts. The end-to-end OCR system with the script ID system
achieved a character error rate of 4.05%, an increase of
0.77% over the case where the languages are known a priori.
View details
Preview abstract
Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have
often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the
compound parts and morphological operations needed to split compounds into their compound parts. The method uses a bilingual corpus to learn the morphological operations
required to split a compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set of compound part candidates. We evaluate our method within a machine translation task and show significant improvements for various languages to show the versatility of the approach.
View details
Translation-Inspired OCR
Dmitriy Genzel
Nemanja Spasojevic
Michael Jahr
Frank Yung-Fong Tang
ICDAR-2011
Preview abstract
Optical character recognition is carried out using techniques
borrowed from statistical machine translation. In particular, the
use of multiple simple feature functions in linear combination,
along with minimum-error-rate training, integrated decoding, and
$N$-gram language modeling is found to be remarkably effective,
across several scripts and languages. Results are presented using
both synthetic and real data in five languages.
View details
Large Scale Parallel Document Mining for Machine Translation
Preview
Jakob Uszkoreit
Jay Ponte
Moshe Dubiner
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Coling 2010 Organizing Committee, Beijing, China, pp. 1101-1109