Google Research

WikiAtomicEdits: A Multilingual Corpus of Atomic Edits for Modeling Inference and Discourse

Abstract

We release a corpus of atomic insertion ed-its: instances in which a human editor has inserted a single contiguous span of text into an existing sentence. Our corpus is derived fromWikipedia edit history and contains 43 million sentences across 8 different languages. We argue that the signal contained in these edits is valuable for research in semantics and dis-course, and that such signal differs from that found in conventional language modeling corpora. We provide experimental evidence from both a corpus linguistics and a language modeling perspective to support these claims.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work