DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections

Yury Zemlyanskiy; Sudeep Gandhe; Ruining He; Bhargav Kanagal; Anirudh Ravula; Juro Gottweis; Fei Sha; Ilya Eckstein

DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections

Yury Zemlyanskiy

Sudeep Gandhe

Ruining He

Bhargav Kanagal

Anirudh Ravula

Juro Gottweis

Fei Sha

Ilya Eckstein

Proceedings of EACL (2021) (to appear)

Download Google Scholar

Abstract

This paper explores learning rich self-supervised entity representations from large amounts of associated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as search ranked retrieval, knowledge base completion, question answering and more. Unlike other methods that harvest self-supervision signals based merely on a local context within a sentence, we radically expand the notion of context to include {\em any} available text related to an entity. With the breadth and depth of textual content available on the web, this approach enables a new class of powerful, high-capacity representations that can ultimately ``remember" any useful information about an entity, without the need for human annotations.

We present several training strategies that jointly learn to predict words and entities --- strategies we compare experimentally on downstream tasks in the TV-Movies domain, such as MovieLens tag prediction from user reviews and natural language movie search. As evidenced by results, our models outperform competitive baselines, sometimes with little or no fine-tuning, and are also able to scale to very large corpora.

Finally, we make our datasets and pre-trained models publicly available\footnote{To be released after the review period.}. This includes {\em Reviews2Movielens}, mapping the 1B word corpus of Amazon movie reviews to MovieLens tags, as well as Reddit Movie Suggestions containing natural language queries and corresponding community recommendations.

Research Areas

Information retrieval

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs