Guided Attention for Large Scale Scene Text Verification
Abstract
Many tasks are related to determining if a particular text string exists in an image. In this work, we propose a model called Guided Attention that learns this task end-to-end. The model takes an image and a text string as input and then outputs the probability of the text string being present in the image. This is the first end-to-end model that learns such relationships between text and images and that does not require explicit scene text detection or recognition. Such a model can be applied to a variety of tasks requiring to know whether a named entity is present in an image. Furthermore, this model does not need any bounding box annotation, and it is the first work in scene text area that tackles such a problem. We show that our method is better than several state-of-the-art methods on a challenging Street View Business Matching dataset, which contains millions of images. In addition, we demonstrate the uniqueness of our task via a comparison between our problem and a typical VQA (Visual Question Answering) problem, which also has as input an image and a sequence of words. This new real-world task provides a new perspective for various research combining images and text.