Guided Attention for Large Scale Scene Text Verification
Abstract
                Many tasks are related to determining if a particular text string exists in an image.  In this work, we propose a model called Guided Attention that learns this task end-to-end.  The model takes an image and a text string as input and then outputs the probability of the text string being present in the image.  This is the first end-to-end model that learns such relationships between text and images and that does not require explicit scene text detection or recognition.  Such a model can be applied to a variety of tasks requiring to know whether a named entity is present in an image.  Furthermore, this model does not need any bounding box annotation, and it is the first work in scene text area that tackles such a problem.  We show that our method is better than several state-of-the-art methods on a challenging Street View Business Matching dataset, which contains millions of images.  In addition, we demonstrate the uniqueness of our task via a comparison between our problem and a typical VQA (Visual Question Answering) problem, which also has as input an image and a sequence of words. This new real-world task provides a new perspective for various research combining images and text.