Alessandro Bissacco
Alessandro Bissacco received his B. S. in Computer Engineering from the University of Padua, Italy, in 1997, and his Ph.D. in
Computer Science from the University of California Los Angeles (UCLA), in 2006.
He is a software engineer at Google since January 2007. He has worked on projects involving image matching, landmark recognition, object detection, text detection and OCR. His contributions are in use in several Google services such as Streetview, Google Goggles and Image Search. Currently he leads Google efforts on developing new technology for reading text from camera images in unconstrained environments, such as Google Goggles and Streetview.
His research interests include image segmentation, object recognition,
video segmentation, deep learning, and application of Bayesian probabilistic models
in Computer Vision.
Research Areas
Authored Publications
Sort By
Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis
Winter Conference on Applications of Computer Vision 2024 (2024) (to appear)
Preview abstract
We propose Hierarchical Text Spotter (HTS), the first method for the joint task of word-level text spotting and geometric layout analysis.
HTS can annotate text in images with a hierarchical representation of 4 levels: character, word, line, and paragraph.
The proposed HTS is characterized by two novel components:
(1) a Unified-Detector-Polygon (UDP) that produces Bezier Curve polygons of text lines and an affinity matrix for paragraph grouping between detected lines;
(2) a Line-to-Character-to-Word (L2C2W) recognizer that splits lines into characters and further merges them back into words.
HTS achieves state-of-the-art results on multiple word-level text spotting benchmark datasets as well as geometric layout analysis tasks.
Code will be released upon acceptance.
View details
Text Reading Order in Uncontrolled Conditions by Sparse Graph Segmentation
International Conference on Document Analysis and Recognition (ICDAR) (2023) (to appear)
Preview abstract
Text reading order is a crucial aspect in the output of an OCR engine, with a large impact on downstream tasks. Its difficulty lies in the large variation of domain specific layout structures, and is further exacerbated by real-world image degradations such as perspective distortions. We propose a lightweight, scalable and generalizable approach to identify text reading order with a multi-modal, multi-task graph convolutional network (GCN) running on a sparse layout based graph. Predictions from the model provide hints of bidimensional relations among text lines and layout region structures, upon which a post-processing cluster-and-sort algorithm generates an ordered sequence of all the text lines. The model is language-agnostic and runs effectively across multi-language datasets that contain various types of images taken in uncontrolled conditions, and it is small enough to be deployed on virtually any platform including mobile devices.
View details
ICDAR 2023 Competition on Hierarchical Text Detection and Recognition
Dmitry Panteleev
ICDAR 2023: International Conference on Document Analysis and Recognition (2023)
Preview abstract
We organize a competition on hierarchical text detection and recognition. The competition is aimed to promote research into deep learning models and systems that can simultaneously perform text detection and recognition and geometric layout analysis. We present details of the proposed competition organization, including tasks, datasets, evaluations, and schedule. During the competition period (from January 2nd 2023 to April 1st 2023), at least 50 submissions from more than 30 teams were made in the 2 proposed tasks. Considering the number of teams and submissions, we conclude that the HierText competition has been successfully held. In this report, we will also present the competition results and insights from them.
View details
Towards End-to-End Unified Scene Text Detection and Layout Analysis
Dmitry Panteleev
CVPR 2022 (2022)
Preview abstract
Scene text detection and document layout analysis have long been treated as two separate tasks in different image domains. In this paper, we bring them together and introduce the task of unified scene text detection and layout analysis. The first hierarchical scene text dataset is introduced to enable this novel research task. We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way. Comprehensive experiments show that our unified model achieves better performance than multiple well-designed baseline methods. Additionally, this model achieves stateof-the-art results on multiple scene text detection datasets without the need of complex post-processing. Dataset and code: https://github.com/google-researchdatasets/hiertext.
View details
Preview abstract
We propose an end-to-end trainable network that can simultaneously detect and recognize text of arbitrary curved path, making substantial progress on the open problem of reading scene text of irregular shape. We formulate arbitrary shape text detection as an instance segmentation problem; an attention model is then used to decode the textual content of each irregularly shaped text region without rectification. To extract useful irregularly shaped text instance features from image scale features, we propose a simple yet effective RoI masking step. Finally, we show that predictions from an existing multi-step OCR engine can be leveraged as partially labeled training data, which leads to significant improvements in both the detection and recognition accuracy of our model. Our method surpasses the state-of-the-art for end-to-end recognition tasks on the ICDAR15 (straight) benchmark by 4.6%, and on the Total-Text (curved) benchmark by more than 16%.
View details
Reading Digits in Natural Images with Unsupervised Feature Learning
Yuval Netzer
Tao Wang
Adam Coates
Bo Wu
Andrew Y. Ng
NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011
Preview abstract
Detecting and reading text from natural images is a hard computer vision task that is central to a variety of emerging applications. Related problems like document character recognition have been widely studied by computer vision and machine learning researchers and are virtually solved for practical applications like reading handwritten digits. Reliably recognizing characters in more complex scenes like photographs, however, is far more difficult: the best existing methods lag well behind human performance on the same tasks. In this paper we attack the problem of recognizing digits in a real application using unsupervised feature learning methods: reading house numbers from street level photos. To this end, we introduce a new benchmark dataset for research use containing over 600,000 labeled digits cropped from Street View images. We then demonstrate the difficulty of recognizing these digits when the problem is approached with hand-designed features. Finally, we employ variants of two recently proposed unsupervised feature learning methods and find that they are convincingly superior on our benchmarks.
View details
Tour the World: building a web-scale landmark recognition engine
Preview
Yantao Zheng
Ulrich Buddemeier
Fernando Brucher
Tat-Seng Chua
International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Tour the world: a technical demonstration of a web-scale landmark recognition engine
Preview
Yan-Tao Zheng
Ulrich Buddemeier
Fernando Brucher
Tat-Seng Chua
MM '09: Proceedings of the seventeen ACM international conference on Multimedia, ACM, New York, NY, USA (2009), pp. 961-962
Large-scale Privacy Protection in Google Street View
Andrea Frome
German Cheung
Ahmad Abdulkader
Marco Zennaro
Bo Wu
Luc Vincent
IEEE International Conference on Computer Vision (2009)
Preview abstract
The last two years have witnessed the introduction and rapid expansion of products based upon large, systematically-gathered, street-level image collections, such as Google Street View, EveryScape, and Mapjack. In the process of gathering images of public spaces, these projects also capture license plates, faces, and other information considered sensitive from a privacy standpoint. In this work, we present a system that addresses the challenge of automatically detecting and blurring faces and license plates for the purpose of privacy protection in Google Street View. Though some in the field would claim face detection is "solved", we show that state-of-the-art face detectors alone are not sufficient to achieve the recall desired for large-scale privacy protection. In this paper we present a system that combines a standard sliding-window detector tuned for a high recall, low-precision operating point with a fast post-processing stage that is able to remove additional false positives by incorporating domain-specific information not available to the sliding-window detector. Using a completely automatic system, we
are able to sufficiently blur more than 89% of faces and 94-96% of license plates in evaluation sets sampled from Google Street View imagery.
The full paper will appear from IEEE.
View details