ニューラル言語モデルの効率的な学習に向けた代表データ集合の獲得

鈴木潤
言語処理学会 (2022)
Google Scholar

Abstract

This paper explores the research question of whether training neural language models using a small subset of representative data selected from a large training dataset can achieve the same level of performance that obtained using all the original training data.
In our experiments, we confirm that the representative subset obtained by the likelihood-difference-based method can maintain the same performance level even when the dataset is reduced to about 10th or 100th of the original data.
We also show that the performance of the random selection method deteriorates significantly when the amount of data is reduced.