Google Research

Wikipedia Translated Clusters


Wikipedia Translated Clusters is a collection of 5K introductions to popular English Wikipedia articles, with their parallel versions in 10 other languages, and machine translations to English. Also includes a synthetically corrupted dataset where one sentence out of the English Wiki is modified, and the task is to use the multilingual documents to identify the outlier with natural language inference (NLI).

The synthetic corruptions to the English Wikipedia introductions are created by replacing one sentence with an alternative version (based on edits from the VitaminC dataset). The goal in this setting is to automatically identify which sentence was modified by using the information from the other articles in the cluster (i.e., Wikipedia versions in other languages, translated to English).

See the Findings of EMNLP 2022 paper for more details: Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters