Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

Dara Bahri; Yi Tay; Che Zheng; Don Metzler; Cliff Brunk; Andrew Tomkins

Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

Dara Bahri

Yi Tay

Che Zheng

Don Metzler

Cliff Brunk

Andrew Tomkins

WSDM 2021 (2021)

Download Google Scholar

Abstract

Large generative language models such as GPT-2 are well-known for not only their ability to generate highly realistic text but also in their utility for common downstream tasks. However, how and in what settings one can best leverage these powerful language models is still a nascent research question. In this work, we explore their use in predicting ``language quality'', a notion of coherence and understandability of text. Our key finding is that, when trained in a self-discriminating fashion, large language models emerge as unsupervised predictors for such language quality. This enables fast bootstrapping of quality indicators in a low-resource setting. We conduct extensive qualitative and quantitative analysis over 500 million web articles, the largest-scale study conducted on this topic.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs