Extracting Representative Subset from Massive Raw Texts for Training Pre-trained Neural Language Models

Jun Suzuki
Information Processing & Management Conference, 60(2023) (to appear)


This paper explores the research question of whether training neural language models using a small subset of representative data selected from a large training dataset can achieve the same level of performance obtained using all the original training data. We explore the likelihood-based scoring for the purpose of obtaining representative subsets, which we call RepSet. Our experiments confirm that the representative subset obtained by a likelihood difference-based score can achieve the 90% performance level even when the dataset is reduced to about 1,000th of the original data. We also show that the performance of the random selection method deteriorates significantly when the amount of data is reduced.