Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Felix Stahlberg

Shankar Kumar

Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications (2021)

Google Scholar

Abstract

Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to realistically generate the broad range of grammatical errors made by human writers in practice. In this work, we use explicit error-type tags from automatic annotation tools like ERRANT to guide synthetic data generation. We compare several models that can produce ungrammatical sentences given a clean sentence and an error type tag, and use these models to build a new large synthetic pre-training set that matches the tag frequency distributions in a development set. Our synthetic data set yields large and consistent gains, leading to state-of-the-art performance on the BEA-test and CoNLL-14 test sets. We also show that our approach is particularly effective in adapting a GEC system that has been trained on mixed native and non-native English to a native English test set, even surpassing real training data consisting of high-quality sentence pairs.

Research Areas

Natural Language Processing

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities