Extracting Representative Subset from Massive Raw Texts for Training Pre-trained Neural Language Models

Jun Suzuki

Heiga Zen

Hideto Kazawa

Information Processing & Management Conference, 60(2023) (to appear)

Download Google Scholar

Abstract

This paper explores the research question of whether training neural language models using a small subset of representative data selected from a large training dataset can achieve the same level of performance obtained using all the original training data. We explore the likelihood-based scoring for the purpose of obtaining representative subsets, which we call RepSet. Our experiments confirm that the representative subset obtained by a likelihood difference-based score can achieve the 90% performance level even when the dataset is reduced to about 1,000th of the original data. We also show that the performance of the random selection method deteriorates significantly when the amount of data is reduced.

Research Areas

Natural Language Processing

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Extracting Representative Subset from Massive Raw Texts for Training Pre-trained Neural Language Models

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Extracting Representative Subset from Massive Raw Texts for Training Pre-trained Neural Language Models

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities