Google Research

C4_200M Synthetic Dataset for Grammatical Error Correction


Grammatical Error Correction (GEC) is the task of automatically correcting grammatical errors in written text. There has been a substantial improvement in the quality of GEC over the past 10 years. This has been made possible because of two reasons: a) application of newer machine-learning techniques and b) availability of public domain datasets for training and evaluation. Nevertheless, the amount of high-quality public-domain training data for grammar correction is substantially smaller than for similar problems such as language translation, where several millions of examples are available. This limits the variety of machine learning algorithms that can be developed for grammar correction in the academic NLP community.

This dataset contains no human annotations. Instead, a tagged corruption model was used to generate diverse ungrammatical sentences. This enabled us to easily scale up to 200M sentences and provide a valuable pre-training dataset for GEC researchers. See our related blog post for more information.