Alessandro Bissacco

Alessandro Bissacco

Alessandro Bissacco received his B. S. in Computer Engineering from the University of Padua, Italy, in 1997, and his Ph.D. in Computer Science from the University of California Los Angeles (UCLA), in 2006. He is a software engineer at Google since January 2007. He has worked on projects involving image matching, landmark recognition, object detection, text detection and OCR. His contributions are in use in several Google services such as Streetview, Google Goggles and Image Search. Currently he leads Google efforts on developing new technology for reading text from camera images in unconstrained environments, such as Google Goggles and Streetview. His research interests include image segmentation, object recognition, video segmentation, deep learning, and application of Bayesian probabilistic models in Computer Vision.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We propose Hierarchical Text Spotter (HTS), the first method for the joint task of word-level text spotting and geometric layout analysis. HTS can annotate text in images with a hierarchical representation of 4 levels: character, word, line, and paragraph. The proposed HTS is characterized by two novel components: (1) a Unified-Detector-Polygon (UDP) that produces Bezier Curve polygons of text lines and an affinity matrix for paragraph grouping between detected lines; (2) a Line-to-Character-to-Word (L2C2W) recognizer that splits lines into characters and further merges them back into words. HTS achieves state-of-the-art results on multiple word-level text spotting benchmark datasets as well as geometric layout analysis tasks. Code will be released upon acceptance. View details
    Preview abstract We organize a competition on hierarchical text detection and recognition. The competition is aimed to promote research into deep learning models and systems that can simultaneously perform text detection and recognition and geometric layout analysis. We present details of the proposed competition organization, including tasks, datasets, evaluations, and schedule. During the competition period (from January 2nd 2023 to April 1st 2023), at least 50 submissions from more than 30 teams were made in the 2 proposed tasks. Considering the number of teams and submissions, we conclude that the HierText competition has been successfully held. In this report, we will also present the competition results and insights from them. View details
    Text Reading Order in Uncontrolled Conditions by Sparse Graph Segmentation
    International Conference on Document Analysis and Recognition (ICDAR) (2023) (to appear)
    Preview abstract Text reading order is a crucial aspect in the output of an OCR engine, with a large impact on downstream tasks. Its difficulty lies in the large variation of domain specific layout structures, and is further exacerbated by real-world image degradations such as perspective distortions. We propose a lightweight, scalable and generalizable approach to identify text reading order with a multi-modal, multi-task graph convolutional network (GCN) running on a sparse layout based graph. Predictions from the model provide hints of bidimensional relations among text lines and layout region structures, upon which a post-processing cluster-and-sort algorithm generates an ordered sequence of all the text lines. The model is language-agnostic and runs effectively across multi-language datasets that contain various types of images taken in uncontrolled conditions, and it is small enough to be deployed on virtually any platform including mobile devices. View details
    Preview abstract Scene text detection and document layout analysis have long been treated as two separate tasks in different image domains. In this paper, we bring them together and introduce the task of unified scene text detection and layout analysis. The first hierarchical scene text dataset is introduced to enable this novel research task. We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way. Comprehensive experiments show that our unified model achieves better performance than multiple well-designed baseline methods. Additionally, this model achieves stateof-the-art results on multiple scene text detection datasets without the need of complex post-processing. Dataset and code: https://github.com/google-researchdatasets/hiertext. View details
    Preview abstract We propose an end-to-end trainable network that can simultaneously detect and recognize text of arbitrary curved path, making substantial progress on the open problem of reading scene text of irregular shape. We formulate arbitrary shape text detection as an instance segmentation problem; an attention model is then used to decode the textual content of each irregularly shaped text region without rectification. To extract useful irregularly shaped text instance features from image scale features, we propose a simple yet effective RoI masking step. Finally, we show that predictions from an existing multi-step OCR engine can be leveraged as partially labeled training data, which leads to significant improvements in both the detection and recognition accuracy of our model. Our method surpasses the state-of-the-art for end-to-end recognition tasks on the ICDAR15 (straight) benchmark by 4.6%, and on the Total-Text (curved) benchmark by more than 16%. View details
    Reading Digits in Natural Images with Unsupervised Feature Learning
    Yuval Netzer
    Tao Wang
    Adam Coates
    Bo Wu
    Andrew Y. Ng
    NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011
    Preview abstract Detecting and reading text from natural images is a hard computer vision task that is central to a variety of emerging applications. Related problems like document character recognition have been widely studied by computer vision and machine learning researchers and are virtually solved for practical applications like reading handwritten digits. Reliably recognizing characters in more complex scenes like photographs, however, is far more difficult: the best existing methods lag well behind human performance on the same tasks. In this paper we attack the problem of recognizing digits in a real application using unsupervised feature learning methods: reading house numbers from street level photos. To this end, we introduce a new benchmark dataset for research use containing over 600,000 labeled digits cropped from Street View images. We then demonstrate the difficulty of recognizing these digits when the problem is approached with hand-designed features. Finally, we employ variants of two recently proposed unsupervised feature learning methods and find that they are convincingly superior on our benchmarks. View details
    Tour the World: building a web-scale landmark recognition engine
    Yantao Zheng
    Ulrich Buddemeier
    Fernando Brucher
    Tat-Seng Chua
    International Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
    Preview
    Large-scale Privacy Protection in Google Street View
    Andrea Frome
    German Cheung
    Ahmad Abdulkader
    Marco Zennaro
    Bo Wu
    Luc Vincent
    IEEE International Conference on Computer Vision (2009)
    Preview abstract The last two years have witnessed the introduction and rapid expansion of products based upon large, systematically-gathered, street-level image collections, such as Google Street View, EveryScape, and Mapjack. In the process of gathering images of public spaces, these projects also capture license plates, faces, and other information considered sensitive from a privacy standpoint. In this work, we present a system that addresses the challenge of automatically detecting and blurring faces and license plates for the purpose of privacy protection in Google Street View. Though some in the field would claim face detection is "solved", we show that state-of-the-art face detectors alone are not sufficient to achieve the recall desired for large-scale privacy protection. In this paper we present a system that combines a standard sliding-window detector tuned for a high recall, low-precision operating point with a fast post-processing stage that is able to remove additional false positives by incorporating domain-specific information not available to the sliding-window detector. Using a completely automatic system, we are able to sufficiently blur more than 89% of faces and 94-96% of license plates in evaluation sets sampled from Google Street View imagery. The full paper will appear from IEEE. View details
    Tour the world: a technical demonstration of a web-scale landmark recognition engine
    Yan-Tao Zheng
    Ulrich Buddemeier
    Fernando Brucher
    Tat-Seng Chua
    MM '09: Proceedings of the seventeen ACM international conference on Multimedia, ACM, New York, NY, USA (2009), pp. 961-962
    Preview