Description
We introduce the FRMT dataset for evaluating the quality of few-shot region-aware machine translation. The dataset covers two regions each for Portuguese (Brazil and Portugal) and Mandarin (Mainland and Taiwan). To create it, we sampled English sentences from Wikipedia and acquired professional human translations into the target regional varieties. FRMT seeks to capture region-specific linguistic differences, as well as potential distractors in the form of entities that are strongly associated with one region (e.g., Lisbon vs. São Paulo). To this end, the dataset is split into three buckets (lexical, entity, random), each containing human translations of sentences extracted from different sets of English Wikipedia articles.
- Lexical: From various web sources, we manually collected English lexical items for which the best translation into the target language differs depending on the target region, and had native speakers validate that each word is or isn't acceptable in their language variety. We then extract up to 100 sentences from the beginning of each English Wikipedia article with one of these terms as title.
- Entities: We manually select a balanced number of entities per language region such that they are strongly associated with one region, relying primarily on world knowledge, token frequency in mC4, and following Wikipedia hyperlinks. Selection is done within each of a few broad entity types defined at the outset: people, locations, organizations, attractions/infrastructure, and other. Again we extract 100 source sentences from the beginning of the English Wikipedia article about each selected entity.
- Random: We also sampled 100 articles at random from the combined set of 28k articles appearing in Wikipedia’s collections of “featured” or “good” articles. Here, we can extract less text from more articles, taking up to 20 contiguous sentences from the start of a randomly chosen section within each article. Unlike the other two buckets, this one features one common set of sentences to be translated into all four target varieties.
Each sentence was translated by a single translator. For each bucket, we split our data into exemplar, development (dev), and test data. The exemplars are intended to be the only pairs where the region label is shown to the model, such as via few-shot or in-context learning. For each document in our dataset, all sentences from that document appear only in a single split—this ensures that a system cannot “cheat” by memorizing word-region associations from the exemplars, or by overfitting to words and entities while hill-climbing on the validation set.
Each bucket is in a separate directory, with one .tsv file for each (split, regional variety) combination. The first column contains the English source, and the second has the professional translation.