Data Strategies for Low-Resource Grammatical Error Correction

Simon Flachs
Felix Stahlberg
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, ACL,


Grammatical Error Correction (GEC) is a task that has been extensively investigated for the English language. However for other low-resource languages the best practices for training GEC systems have not yet been systematically determined. We investigate how best to take advantage of existing data sources for improving GEC systems for languages with limited quantities of high quality training data. In particular, we compare methods for generating artificial error data to train GEC systems, and show that these methods can benefit from including morphological errors. We then look into the usefulness of noisy error correction data gathered from Wikipedia and the language learning website Lang8, and demonstrate that despite their inherent noise, these are valuable data sources. Finally, we show that GEC systems pre-trained on the noisy data sources can be fine-tuned effectively using small amounts of high quality, human-annotated data.