NewsEmbed: Modeling News through Pretrained Document Representations

Cong Yu; Jialu Liu; Tianqi Liu

NewsEmbed: Modeling News through Pretrained Document Representations

Cong Yu

Jialu Liu

Tianqi Liu

KDD2021

Google Scholar

Abstract

Effectively modeling text-rich fresh content such as news articles
and blog posts is a challenging problem. To ensure a content-based
model generalize well to a broad range of applications, it is critical
to have a training dataset that is large beyond the scale of human
labels while achieving desired quality. In this work, we addressing those two challenges by proposing a novel approach to mine
semantically-relevant fresh documents, and their topic labels, with
little human supervision. Specifically, we design a multitask model
that alternate trains a contrasting learning with a multi-label classification to derive an universal document encoder. We show that
this approach can provide billions of high quality organic training examples and can be naturally extended to multilingual setting
where texts in different languages are encoded in the same semantic
space. We experimentally demonstrate NewsEmbed’s competitive
performance across multiple natural language understanding tasks,
both supervised and unsupervised.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

NewsEmbed: Modeling News through Pretrained Document Representations

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs