ニューラル言語モデルの効率的な学習に向けた代表データ集合の獲得

鈴木潤; 全炳河; 賀沢秀人

ニューラル言語モデルの効率的な学習に向けた代表データ集合の獲得

鈴木潤

全炳河

賀沢秀人

言語処理学会 (2022)

Google Scholar

Abstract

This paper explores the research question of whether training neural language models using a small subset of representative data selected from a large training dataset can achieve the same level of performance that obtained using all the original training data.
In our experiments, we confirm that the representative subset obtained by the likelihood-difference-based method can maintain the same performance level even when the dataset is reduced to about 10th or 100th of the original data.
We also show that the performance of the random selection method deteriorates significantly when the amount of data is reduced.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

ニューラル言語モデルの効率的な学習に向けた代表データ集合の獲得

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs