WikiAtomicEdits: A Multilingual Corpus of Atomic Edits for Modeling Inference and Discourse

Manaal Faruqui
Proc. of EMNLP (2018)
Google Scholar

Abstract

We release a corpus of atomic insertion ed-its: instances in which a human editor has inserted a single contiguous span of text into an existing sentence. Our corpus is derived fromWikipedia edit history and contains 43 million sentences across 8 different languages. We argue that the signal contained in these edits is valuable for research in semantics and dis-course, and that such signal differs from that found in conventional language modeling corpora. We provide experimental evidence from both a corpus linguistics and a language modeling perspective to support these claims.