Google Research



Neural language models have rapidly developed recently, and play a fundamental role in the success of the natural language processing (NLP) field. Many studies have demonstrated that incorporating pre-trained neural language models (PreLMs) into target task-specific models can dramatically improve model performance. In other words, PreLMs learned from large-scale text datasets can effectively serve as universal features for various NLP tasks.

We focus on the training data of PreLMs and explore a C4 (Colossal Clean Crawled Corpus) subset, which can be used to train a language model with equal or better performance compared to training a large-scale PreLM. We refer to the representative subset from the original full training C4 dataset as the "representative dataset" or "RepSet" for short. Suppose it is possible to extract a representative subset. In that case, conducting research on PreLMs with less practical computational resources and research budgets will be possible.

We provide a list of URLs extracted from C4 data. A naive and straightforward way to use this dataset is to download a URL list and extract data from the original CommonCrawl dataset defined in C4.