Ray Smith
Ray developed the Tesseract OCR engine at HPLabs Bristol for 10 years, followed by a 3 year term developing the text and line drawings pipelines for the HP PrecisionScan product in Greeley, Colorado. After spending a further 7 years developing a new architecture for the Omnipage OCR product for Caere/Scansoft/Nuance, Ray is now at Google, working on Tesseract again.
Research Areas
Authored Publications
Sort By
Improving Book OCR by Adaptive Language and Image Models
Dar-Shyang Lee
Proceedings of 2012 10th IAPR International Workshop on Document Analysis Systems, IEEE, pp. 115-119
Preview abstract
In order to cope with the vast diversity of book content
and typefaces, it is important for OCR systems to leverage the
strong consistency within a book but adapt to variations across
books. In this work, we describe a system that combines two
parallel correction paths using document-specific image and
language models. Each model adapts to shapes and vocabularies
within a book to identify inconsistencies as correction hypotheses,
but relies on the other for effective cross-validation. Using the
open source Tesseract engine as baseline, results on a large
dataset of scanned books demonstrate that word error rates can
be reduced by 25% using this approach.
View details
Limits on the Application of Frequency-based Language Models to OCR
ICDAR, IEEE (2011), pp. 538-542
Preview abstract
Although large language models are used in speech recognition and machine translation applications, OCR systems are “far behind” in their use of language models. The reason for this is not the laggardness of the OCR community, but the fact that, at high accuracies, a frequency-based language model can do more damage than good, unless carefully applied. This paper presents an analysis of this discrepancy with the help of the Google Books n-gram Corpus, and concludes that noisy-channel models that closely model the underlying classifier and segmentation errors are required.
View details
Table Detection in Heterogeneous Documents
Faisal Shafait
Document Analysis Systems 2010, ACM International Conference Proceedings series
Preview abstract
Detecting tables in document images is important since not
only do tables contain important information, but also most
of the layout analysis methods fail in the presence of tables
in the document image. Existing approaches for table de-
tection mainly focus on detecting tables in single columns
of text and do not work reliably on documents with varying
layouts. This paper presents a practical algorithm for table
detection that works with a high accuracy on documents
with varying layouts (company reports, newspaper articles,
magazine pages, . . . ). An open source implementation of the
algorithm is provided as part of the Tesseract OCR engine.
Evaluation of the algorithm on document images from pub-
licly available UNLV dataset shows competitive performance
in comparison to the table detection module of a commercial
OCR system.
View details
Hybrid Page Layout Analysis via Tab-Stop Detection
Proceedings of the 10th international conference on document analysis and recognition, IEEE (2009)
Preview abstract
A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions.
The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at http://code.google.com/p/tesseract-ocr.
View details
Combined Orientation and Script Detection using the Tesseract OCR Engine
Ranjith Unnikrishnan
Workshop on Multilingual OCR (MOCR), Proc. 10th Intl. Conf. on Document Analysis and Recognition (ICDAR), (2009)
Preview abstract
This paper proposes a simple but effective algorithm to estimate the script and dominant page orientation of the text contained in an image. A candidate set of shape classes for each script is generated using synthetically rendered text and used to train a fast shape classifier. At run time, the classifier is applied independently to connected components in the image for each possible orientation of the component, and the accumulated confidence scores are used to determine the best estimate of page orientation and script. Results demonstrate the effectiveness of the approach on a dataset of 1846 documents containing a diverse set of images in 14 scripts and any of four possible page orientations.
View details
Adapting the Tesseract Open Source OCR Engine for Multilingual OCR
Daria Antonova
Dar-Shyang Lee
MOCR '09: Proceedings of the International Workshop on Multilingual OCR (2009)
Preview abstract
We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.
©ACM, 2009. This is the authors’ version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the International Workshop on Multilingual OCR 2009, Barcelona, Spain July 25, 2009.
View details
An Overview of the Tesseract OCR Engine
Proc. Ninth Int. Conference on Document Analysis and Recognition (ICDAR), IEEE Computer Society (2007), pp. 629-633
Preview abstract
The Tesseract OCR engine, as was the HP Research
Prototype in the UNLV Fourth Annual Test of OCR
Accuracy[1], is described in a comprehensive
overview. Emphasis is placed on aspects that are novel
or at least unusual in an OCR engine, including in
particular the line finding, features/classification
methods, and the adaptive classifier.
View details