Illustrative Language Understanding: Large-Scale Visual Grounding with Google Image Search
Abstract
We introduce Picturebook, a large-scale
lookup operation to ground language via
‘snapshots’ of our physical world accessed
through image search. For each word in
a vocabulary, we extract the top-k images
from Google image search and feed
the images through a convolutional network
to extract a word embedding. We
introduce a multimodal gating function
to fuse our Picturebook embeddings with
other word representations. We also introduce
Inverse Picturebook, a mechanism to
map a Picturebook embedding back into
words. We experiment and report results
across a wide range of tasks: word similarity,
natural language inference, semantic
relatedness, sentiment/topic classification,
image-sentence ranking and machine
translation. We also show that gate activations
corresponding to Picturebook embeddings
are highly correlated to human
judgments of concreteness ratings.
lookup operation to ground language via
‘snapshots’ of our physical world accessed
through image search. For each word in
a vocabulary, we extract the top-k images
from Google image search and feed
the images through a convolutional network
to extract a word embedding. We
introduce a multimodal gating function
to fuse our Picturebook embeddings with
other word representations. We also introduce
Inverse Picturebook, a mechanism to
map a Picturebook embedding back into
words. We experiment and report results
across a wide range of tasks: word similarity,
natural language inference, semantic
relatedness, sentiment/topic classification,
image-sentence ranking and machine
translation. We also show that gate activations
corresponding to Picturebook embeddings
are highly correlated to human
judgments of concreteness ratings.