Jump to Content

NewsEmbed: Modeling News through Pretrained Document Representations

Cong Yu
Google Scholar


Effectively modeling text-rich fresh content such as news articles and blog posts is a challenging problem. To ensure a content-based model generalize well to a broad range of applications, it is critical to have a training dataset that is large beyond the scale of human labels while achieving desired quality. In this work, we addressing those two challenges by proposing a novel approach to mine semantically-relevant fresh documents, and their topic labels, with little human supervision. Specifically, we design a multitask model that alternate trains a contrasting learning with a multi-label classification to derive an universal document encoder. We show that this approach can provide billions of high quality organic training examples and can be naturally extended to multilingual setting where texts in different languages are encoded in the same semantic space. We experimentally demonstrate NewsEmbed’s competitive performance across multiple natural language understanding tasks, both supervised and unsupervised.