Deduplicating Training Data Makes Language Models Better

Andrew Nystrom; Chiyuan Zhang; Chris Callison-Burch; Daphne Ippolito; Douglas Eck; Katherine Lee; Nicholas Carlini

Deduplicating Training Data Makes Language Models Better

Andrew Nystrom

Chiyuan Zhang

Chris Callison-Burch

Daphne Ippolito

Douglas Eck

Katherine Lee

Nicholas Carlini

(2022) (to appear)

Google Scholar

Abstract

As large language models scale up, researchers and engineers have chosen to use larger datasets of loosely-filtered internet text instead of curated texts.
We find that existing NLP datasets are highly repetitive and contain duplicated examples.
For example, there is an example in the training dataset C4 that has over 200,000 near duplicates.
As a whole, we find that 1.68% of the C4 are near-duplicates.
Worse, we find a 1% overlap between the training and testing sets in these datasets.
Duplicate examples in training data inappropriately biases the distribution of rare/common sequences.
Models trained with non-deduplicated datasets are more likely to generate ``memorized" examples.
Additionally, if those models are used for downstream applications, such as scoring likelihoods of given sequences, we find that models trained on non-deduplicated and deduplicated datasets have a difference in accuracy of on average TODO.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Deduplicating Training Data Makes Language Models Better

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs