DeViSE: A Deep Visual-Semantic Embedding Model

Andrea Frome; Greg Corrado; Jonathon Shlens; Samy Bengio; Jeffrey Dean; Marc’Aurelio Ranzato; Tomas Mikolov

DeViSE: A Deep Visual-Semantic Embedding Model

Andrea Frome

Greg Corrado

Jonathon Shlens

Samy Bengio

Jeffrey Dean

Marc’Aurelio Ranzato

Tomas Mikolov

Neural Information Processing Systems (NIPS) (2013)

Google Scholar

Abstract

Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

DeViSE: A Deep Visual-Semantic Embedding Model

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs