Google Research

Data Strategies for Low-Resource Grammatical Error Correction

Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, ACL, (2021)


Grammatical Error Correction (GEC) is a task that has been extensively investigated for the English language. However for other low-resource languages the best practices for training GEC systems have not yet been systematically determined. We investigate how best to take advantage of existing data sources for improving GEC systems for languages with limited quantities of high quality training data. In particular, we compare methods for generating artificial error data to train GEC systems, and show that these methods can benefit from including morphological errors. We then look into the usefulness of noisy error correction data gathered from Wikipedia and the language learning website Lang8, and demonstrate that despite their inherent noise, these are valuable data sources. Finally, we show that GEC systems pre-trained on the noisy data sources can be fine-tuned effectively using small amounts of high quality, human-annotated data.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work