Large Scale Image Annotation: Learning to Rank with Joint Word-Image Embeddings

Jason Weston
Samy Bengio
Nicolas Usunier
European Conference on Machine Learning (2010)

Abstract

Image annotation datasets are becoming larger and larger,
with tens of millions of images and
tens of thousands of possible annotations.
We propose a strongly performing method that scales to such
datasets
by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image
and learning a low-dimensional joint embedding
space for both images and annotations.
Our method both outperforms several baseline
methods and, in comparison to them,
is faster and consumes less memory.
We also demonstrate how our method
learns an interpretable model, where annotations with
alternate spellings
or even languages are close in the embedding space. Hence, even when our model does not predict
the exact annotation given by a human labeler,
it often predicts similar annotations, a fact that we try to quantify by measuring
the newly introduced ``sibling'' precision metric, where our method also obtains excellent results.

Research Areas