Jump to Content
Markus Freitag

Markus Freitag

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    There's no Data Like Better Data: Using QE Metrics for MT Data Filtering
    Jan-Thorsten Peter
    Mara Finkelstein
    Jurik Juraska
    Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 561-577
    Preview abstract Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems (NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches. View details
    Training and Meta-Evaluating Machine Translation Evaluation Metrics at the Paragraph-Level
    Jurik Juraska
    Mara Finkelstein
    Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 996-1013
    Preview abstract As research on machine translation moves to translating text beyond the sentence level, it remains unclear how effective automatic evaluation metrics are at scoring longer translations. In this work, we first propose a method for creating paragraph-level data for training and meta-evaluating metrics from existing sentence-level data. Then, we use these new datasets to benchmark existing sentence-level metrics as well as train learned metrics at the paragraph level. Interestingly, our experimental results demonstrate that using sentence-level metrics to score entire paragraphs is equally as effective as using a metric designed to work at the paragraph level. We speculate this result can be attributed to properties of the task of reference-based evaluation as well as limitations of our datasets with respect to capturing all types of phenomena that occur in paragraph-level translations. View details
    Prompting PaLM for Translation: Assessing Strategies and Performance
    Jiaming Luo
    Viresh Ratnakar
    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada (2023), 15406–15427
    Preview abstract Large language models (LLMs) that have been trained on multilingual but not parallel text exhibit a remarkable ability to translate between languages. We probe this ability in an in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We investigate various strategies for choosing translation examples for few-shot prompting, concluding that example quality is the most important factor. Using optimized prompts, we revisit previous assessments of PaLM’s MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems. We conclude by providing an analysis of PaLM’s MT output which reveals some interesting properties and prospects for future work. View details
    Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation
    Behrooz Ghorbani
    Patrick Fernandes
    Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, pp. 9198-9209
    Preview abstract Recent advances in machine translation (MT) have shown that Minimum Bayes Risk (MBR) decoding can be a powerful alternative to beam search decoding, especially when combined with neural-based utility functions. However, the performance of MBR decoding depends heavily on how and how many candidates are sampled from the model. In this paper, we explore how different sampling approaches for generating candidate lists for MBR decoding affect performance. We evaluate popular sampling approaches, such as ancestral, nucleus, and top-k sampling. Based on our insights into their limitations, we experiment with the recently proposed epsilon-sampling approach, which prunes away all tokens with a probability smaller than epsilon, ensuring that each token in a sample receives a fair probability mass. Through extensive human evaluations, we demonstrate that MBR decoding based on epsilon-sampling significantly outperforms not only beam search decoding, but also MBR decoding with all other tested sampling methods across four language pairs. View details
    Results of WMT23 Metrics Shared Task: Metrics might be Guilty but References are not Innocent
    Nitika Mathur
    Chi-kiu Lo
    Eleftherios Avramidis
    Ricardo Rei
    Brian Thompson
    Tom Kocmi
    Frédéric Blain
    Craig Stewart
    Chrysoula Zerva
    Sheila Castilho
    Alon Lavie
    Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 576-626
    Preview abstract This paper presents the results of the WMT23 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT23 News Translation Task. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). Following last year's success, we also included a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics' ability to capture and penalise specific types of translation errors. Furthermore, we improved our meta-evaluation procedure by considering fewer tasks and calculating a global score by weighted averaging across the various tasks. We present an extensive analysis on how well metrics perform on three language pairs: Chinese-English, Hebrew-English on the sentence-level and English-German on the paragraph-level. The results strongly confirm the results reported last year, that neural-based metrics are significantly better than non-neural metrics in their levels of correlation with human judgments. Further, we investigate the impact of bad reference translations on the correlations of metrics with human judgment. We present a novel approach for generating synthetic reference translations based on the collection of MT system outputs and their corresponding MQM ratings, which has the potential to mitigate bad reference issues we observed this year for some language pairs. Finally, we also study the connections between the magnitude of metric differences and their expected significance in human evaluation, which should help the community to better understand and adopt new metrics. View details
    Preview abstract Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on direct estimation of quality scores, the resulting metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we fill this gap by proposing \textbf{\textsc{AutoMQM}}, a prompting technique which leverages the \textit{reasoning} and \textit{in-context learning} capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple \textit{score prediction} prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate \textsc{AutoMQM} with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations. View details
    WMT23 Metrics shared task Submission: Quality Estimation using Minimum Bayes Risk
    Subhajit Naskar
    Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 806-811
    Preview abstract This report describes the Minimum Bayes Risk Quality Estimation (MBR-QE) submission to the Workshop on Machine Translation's 2023 Metrics Shared Task. MBR decoding with neural utility metrics (BLEURT) are known to be very effective in generating high quality machine translations. We use the underlying assumption of MBR decoding and develop a MBR based reference-free quality estimation metric. Our method uses a evaluator machine translation system and a reference-based utility metric (BLEURT, MeticX) to calculate a quality estimation score of a model. We report results related to comparing different MBR configuration and utility metrics. View details
    INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback
    Wenda Xu
    Danqing Wang
    Liangming Pan
    Zhenqiao Song
    William Wang
    Lei Li
    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, pp. 5967-5994
    Preview abstract Automatically evaluating the quality of language generation is critical. Although recent learned metrics show high correlation with human judgement, these metrics do not provide explicit explanation of their verdict, nor associate the scores with defects in the generated text. To address this limitation, we present INSTRUCTSCORE, a fine-grained explainable evaluation metric for text generation. By harnessing both explicit human instruction and the implicit knowledge of GPT-4, we fine-tune a text evaluation metric based on LLaMA, producing both a score for generated text and a human readable diagnostic report. We evaluate INSTRUCTSCORE on a variety of generation tasks, including translation, captioning, data-to-text, and commonsense generation. Experiments show that our 7B model surpasses all other unsupervised metrics, including those based on 175B GPT-3 and GPT-4. Surprisingly, our INSTRUCTSCORE, even without direct supervision from human-rated data, achieves performance levels on par with state-of-the-art metrics like COMET22, which were fine-tuned on human ratings. View details
    MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task
    Jurik Juraska
    Mara Finkelstein
    Mahdi Mirzazadeh
    Conference on Machine Translation (2023)
    Preview abstract This report details the MetricX-23 submission to the Workshop on Machine Translation's 2023 Metrics Shared Task and provides an overview of the experiments that informed which metrics were submitted. Our three submissions---each with a quality estimation (or reference-free) version---are all learned regression-based metrics that vary in the data used for training and which pretrained language model was used for initialization. We report results related to understanding (1) which supervised training data to use, (2) the impact of how the training labels are normalized, (3) the amount of synthetic training data to use, (4) how metric performance is related to model size, and (5) the effect of initializing the metrics with different pretrained language models. The training recipes that we found to be most successful are detailed in this report. View details
    Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration
    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, pp. 12914-12929
    Preview abstract Kendall's tau is frequently used to meta-evaluate how well machine translation (MT) evaluation metrics score individual translations. Its focus on pairwise score comparisons is intuitive but raises the question of how ties should be handled, a gray area that has motivated different variants in the literature. We demonstrate that, in settings like modern MT meta-evaluation, existing variants have weaknesses arising from their handling of ties, and in some situations can even be gamed. We propose instead to meta-evaluate metrics with a version of pairwise accuracy that gives metrics credit for correctly predicting ties, in combination with a tie calibration procedure that automatically introduces ties into metric scores, enabling fair comparison between metrics that do and do not predict ties. We argue and provide experimental evidence that these modifications lead to fairer ranking-based assessments of metric performance. View details
    Toward More Effective Human Evaluation for Machine Translation
    Belén Saldías-Fuentes
    Qijun Tan
    ACL2022 Workshop on Human Evaluation of NLP Systems
    Preview abstract Improvements in text generation technologies such as machine translation have necessitated more costly and time-consuming human evaluation procedures to ensure an accurate signal. We investigate a simple way to reduce cost by reducing the number of text segments that must be annotated in order to accurately predict a score for a complete test set. Using a sampling approach, we demonstrate that information from document membership and automatic metrics can help improve estimates compared to a pure random sampling baseline. We achieve gains of up to 20% in average absolute error by leveraging stratified sampling and control variates. Our techniques can improve estimates made from a fixed annotation budget, are easy to implement, and can be applied to any problem with structure similar to the one we study. View details
    A Natural Diet: Towards Improving Naturalness of Machine Translation Output
    David Grangier
    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online (2022)
    Preview abstract Machine translation (MT) evaluation often focuses on accuracy and fluency, without paying much attention to translation style. This means that, even when considered accurate and fluent, MT output can still sound less natural than high quality human translations or text originally written in the target language. Machine translation output notably exhibits lower lexical diversity, and employs constructs that mirror those in the source sentence. In this work we propose a method for training MT systems to achieve a more natural style, i.e. mirroring the style of text originally written in the target language. Our method tags parallel training data according to the naturalness of the target side by contrasting language models trained on natural and translated data. Tagging data allows us to put greater emphasis on target sentences originally written in the target language. Automatic metrics show that the resulting models achieve lexical richness on par with human translations, mimicking a style much closer to sentences originally written in the target language. Furthermore, we find that their output is preferred by human experts when compared to the baseline translations. View details
    Original or Translated? A Causal Analysis of the Impact of Translationese on Machine Translation Performance
    Jingwei Ni
    Zhijing Jin
    Mrinmaya Sachan
    Bernhard Scholkopf
    Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Seattle, United States, pp. 5303-5320
    Preview abstract Human-translated text displays distinct features from naturally written text in the same language. This phenomena, known as translationese, has been argued to confound the machine translation (MT) evaluation. Yet, we find that existing work on translationese neglects some important factors and the conclusions are mostly correlational but not causal. In this work, we collect CAUSALMT, a dataset where the MT training data are also labeled with the human translation directions. We inspect two critical factors, the train-test alignment (whether the human translation directions in the training and test sets are aligned), and data-model alignment (whether the model learns in the same direction as the human translation direction in the dataset). We show that these two factors have a large causal effect on the MT performance, in addition to the test-model misalignment highlighted by existing work on the impact of translationese in the test set. In light of our findings, we provide a set of suggestions for MT training and evaluation. View details
    Results of WMT22 Metrics Shared Task: Stop Using BLEU - Neural Metrics Are Better and More Robust
    Ricardo Rei
    Nitika Mathur
    Chi-kiu Lo
    Craig Stewart
    Eleftherios Avramidis
    Tom Kocmi
    Alon Lavie
    André Martins
    Proceedings of the Seventh Conference on Machine Translation, Association for Computational Linguistics, Abu Dhabi (2022), pp. 46-68
    Preview abstract This paper presents the results of the WMT22 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT22 News Translation Task on four different domains: news, social, ecommerce, and chat. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages, among other things: (i) expert-based evaluation is more reliable, (ii) we extended the pool of translations by 5 additional translations based on MBR decoding or rescoring which are challenging for current metrics. In addition, we initiated a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics' ability to capture and penalise specific types of translation errors. Finally, we present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. The results demonstrate the superiority of neural-based learned metrics and demonstrate again that overlap metrics like Bleu, spBleu or chrf correlate poorly with human ratings. The results also reveal that neural-based metrics are significant better than non-neural metrics across different domains and challenges. View details
    On Systematic Style Differences between Unsupervised and Supervised MT and an Application for High-Resource Machine Translation
    Kelly Venning Marchisio
    David Grangier
    Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2214-2225
    Preview abstract Modern unsupervised machine translation (MT) systems reach reasonable translation quality under clean and controlled data conditions. As the performance gap between supervised and unsupervised MT narrows, it is interesting to ask whether the different training methods result in systematically different output beyond what is visible via quality metrics like adequacy or BLEU. We compare translations from supervised and unsupervised MT systems of similar quality, finding that unsupervised output is more fluent and more structurally different in comparison to human translation than is supervised MT. We then demonstrate a way to combine the benefits of both methods into a single system which results in improved adequacy and fluency as rated by human evaluators. Our results open the door to interesting discussions about how supervised and unsupervised MT might be different yet mutually-beneficial. View details
    Findings of the WMT 2022 Shared Task on Automatic Post-Editing
    Pushpak Bhattacharyya
    Rajen Chatterjee
    Diptesh Kanojia
    Matteo Negri
    Marco Turchi
    Proceedings of the Seventh Conference on Machine Translation, Association for Computational Linguistics, Abu Dhabi (2022), pp. 109-117
    Preview abstract We present the results from the 8th round of the WMT shared task on MT Automatic Post-Editing, which consists in automatically correcting the output of a “black-box” machine translation system by learning from human corrections. This year, the task focused on a new language pair (English→Marathi) and on data coming from multiple domains (healthcare, tourism, and general/news). Although according to several indicators this round was of medium-high difficulty compared to the past, the best submission from the three participating teams managed to significantly improve (with an error reduction of 3.49 TER points) the original translations produced by a generic neural MT system. View details
    High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics
    David Grangier
    Qijun Tan
    Bowen Liang
    Transactions of the Association for Computational Linguistics, vol. 10 (2022), pp. 811-825
    Preview abstract In Neural Machine Translation, it is typically assumed that the sentence with the highest estimated probability should also be the translation with the highest quality as measured by humans. In this work, we question this assumption and show that model estimates and translation quality only vaguely correlate. We apply Minimum Bayes Risk (MBR) decoding on unbiased samples to optimize diverse automated metrics of translation quality as an alternative inference strategy to beam search. Instead of targeting the hypotheses with the highest model probability, MBR decoding extracts the hypotheses with the highest estimated quality. Our experiments show that the combination of a neural translation model with a neural reference-based metric, Bleurt, results in significant improvement in human evaluations. This improvement is obtained with translations different from classical beam-search output: These translations have much lower model likelihood and are less favored by surface metrics like Bleu. View details
    Using Machine Translation to Localize Task Oriented NLG Output
    Scott Roy
    Cliff Brunk
    Kyu-Young Kim
    Justin Xu Zhao
    Sidharth Mudgal
    Chris Varano
    CoRR, vol. abs/2107.04512 (2021)
    Preview abstract One of the challenges for a task oriented NLG system like the Google Assistant is to internationalize the output to many languages. This paper explores doing this by applying machine translation to the English output. Using machine translation is very scalable, as it can work with any English output and can handle dynamic text, but it is difficult to meet the required quality bar: machine translation is good, but for a commercial NLG application it often needs to be nearly perfect. Fortunately, in task oriented NLG the quality only needs to reach this bar for the narrow range of sentences that the NLG system can actually produce. We are able to reach this quality using a combination of semantic annotations, fine tuning on in-domain translations, automatic error detection, and sentences from the Web. This paper shares our approach and results, together with a distillation model to serve the NMT models at scale. View details
    Preview abstract Reference-free evaluation has the potential to make machine translation evaluation substantially more scalable, allowing us to pivot easily to new languages or domains. It has been recently shown that the probabilities given by a large, multilingual model can achieve state of the art results when used as a reference-free metric. We experiment with various modifications to this model, and demonstrate that by scaling it up we can match the performance of BLEU. We analyze various potential weaknesses of the approach, and find that it is surprisingly robust and likely to offer reasonable performance across a broad spectrum of domains and different system qualities. View details
    Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
    David Grangier
    Viresh Ratnakar
    Qijun Tan
    Transactions of the Association for Computational Linguistics, vol. 9, pp. 1460-1474
    Preview abstract Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research. View details
    Findings of the 2021 Conference on Machine Translation (WMT21)
    Farhad Akhbardeh
    Arkady Arkhangorodsky
    Magdalena Biesialska
    Ondrej Bojar
    Rajen Chatterjee
    Vishrav Chaudhary
    Marta R. Costa-jussà
    Cristina España-Bonet
    Angela Fan
    Christian Federman
    Yvette Graham
    Roman Grundkiewicz
    Barry Haddow
    Leonie Harter
    Kenneth Heafield
    Christopher M. Homan
    Matthias Huck
    Kwabena Amponsah-Kaakyire
    Jungo Kasai
    Daniel Khashabi
    Kevin Knight
    Tom Kocmi
    Philipp Koehn
    Nicholas Lourie
    Christof Monz
    Makoto Morishita
    Masaaki Nagata
    Ajay Nagesh
    Toshiaki Nakazawa
    Matteo Negri
    Santanu Pal
    Allahsera Tapo
    Marco Turchi
    Valentin Vydrin
    Marcos Zampieri
    Proceedings of the Sixth Conference on Machine Translation, Association for Computational Linguistics, Online (2021), pp. 1-88
    Preview abstract This paper presents the results of the news translation task, the multilingual low-resource translation for Indo-European languages, the triangular translation task, and the automatic post-editing task organised as part of the Conference on Machine Translation (WMT) 2021. In the news task, participants were asked to build machine translation systems for any of 10 language pairs, to be evaluated on test sets consisting mainly of news stories. The task was also opened up to additional test suites to probe specific aspects of translation. In the Similar Language Translation (SLT) task, participants were asked to develop systems to translate between pairs of similar languages from the Dravidian and Romance family as well as French to two similar low-resource Manding languages (Bambara and Maninka). In the Triangular MT translation task, participants were asked to build a Russian to Chinese translator, given parallel data in Russian-Chinese, RussianEnglish and English-Chinese. In the multilingual low-resource translation for IndoEuropean languages task, participants built multilingual systems to translate among Romance and North-Germanic languages. The task was designed to deal with the translation of documents in the cultural heritage domain for relatively low-resourced languages. In the automatic post-editing (APE) task, participants were asked to develop systems capable to correct the errors made by an unknown machine translation systems. View details
    Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain
    Ricardo Rei
    Nitika Mathur
    Chi-kiu Lo
    Craig Stewart
    Alon Lavie
    Ondrej Bojar
    Proceedings of the Sixth Conference on Machine Translation, Association for Computational Linguistics, Online (2021), pp. 733-774
    Preview abstract This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT21 News Translation Task with automatic metrics on two different domains: news and TED talks. All metrics were evaluated on how well they correlate at the system- and segment-level with human ratings. Contrary to previous years' editions, this year we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages: (i) expert-based evaluation has been shown to be more reliable, (ii) we were able to evaluate all metrics on two different domains using translations of the same MT systems, (iii) we added 5 additional translations coming from the same system during system development. In addition, we designed three challenge sets that evaluate the robustness of all automatic metrics. We present an extensive analysis on how well metrics perform on three language pairs: English to $German, English to Russian and Chinese to English. We further show the impact of different reference translations on reference-based metrics and compare our expert-based MQM annotation with the DA scores acquired by WMT. View details
    BLEU might be Guilty but References are not Innocent
    David Grangier
    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, pp. 61-71
    Preview abstract The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is also critical. We study different methods to collect references and compare their value in automated evaluation by reporting correlation with human evaluation for a variety of systems and metrics. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output, which have been shown to have low correlation with automatic metrics using standard references. We demonstrate that our methodology improves correlation with all modern evaluation metrics we look at, including embedding-based methods.To complete this picture, we reveal that multi-reference BLEU does not improve the correlation for high quality output, and present an alternative multi-reference formulation that is more effective. View details
    Complete Multilingual Neural Machine Translation
    Proceedings of the Fifth Conference on Machine Translation (Volume 1: Research Papers) (2020)
    Preview abstract Multilingual Neural Machine Translation (MNMT) models are commonly trained on a joint set of bilingual corpora which is acutely English-centric (i.e. English either as the source or target language). While direct data between two languages that are non-English is explicitly available at times, its use is not common. In this paper, we first take a step back and look at the commonly used bilingual corpora (WMT), and resurface the existence and importance of implicit structure that existed in it: multi-way alignment across examples (the same sentence in more than two languages). We set out to study the use of multi-way aligned examples to enrich the original English-centric parallel corpora. We reintroduce this direct parallel data from multi-way aligned corpora between all source and target languages. By doing so, the English-centric graph expands into a complete graph, every language pair being connected. We call MNMT with such connectivity pattern complete Multilingual Neural Machine Translation (cMNMT) and demonstrate its utility and efficacy with a series of experiments and analysis. In combination with a novel training data sampling strategy that is conditioned on the target language only, cMNMT yields competitive translation quality for all language pairs. We further study the size effect of multi-way aligned data, its transfer learning capabilities and how it eases adding a new language in MNMT. Finally, we stress test cMNMT at scale and demonstrate that we can train a cMNMT model with up to 111*112=12,432 language pairs that provides competitive translation quality for all language pairs. View details
    Preview abstract We propose a simple and effective method for machine translation evaluation which does not require reference translations. Our approach is based on (1) grounding the entity mentions found in each source sentence and candidate translation against a large-scale multilingual knowledge base, and (2) measuring the recall of the grounded entities found in the candidate vs. those found in the source. Our approach achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references, which is the largest number of wins for a single evaluation method on this task. On 4 language pairs, we also achieve higher correlation with human judgements than BLEU. To foster further research, we release a dataset containing 1.8 million grounded entity mentions across 18 language pairs from the WMT19 metrics track data. View details
    Human-Paraphrased References Improve Neural Machine Translation
    David Grangier
    Proceedings of the Fifth Conference on Machine Translation (Volume 1: Research Papers) (2020)
    Preview abstract Automatic evaluation comparing candidate translations to human-generated paraphrases of reference translations has recently been proposed by Freitag et al (2020). When used in place of original references, the paraphrased versions produce metric scores that correlate better with human judgment. This effect holds for a variety of different automatic metrics, and tends to favor natural formulations over more literal (translationese) ones. In this paper we compare the results of performing end-to-end system development using standard and paraphrased references. With state-of-the-art English-German NMT components, we show that tuning to paraphrased references produces a system that is significantly better according to human judgment, but 5 BLEU points worse when tested on standard references. Our work confirms the finding that paraphrased references yield metric scores that correlate better with human judgment, and demonstrates for the first time that using these scores for system development can lead to significant improvements. View details
    Translationese as a Language in “Multilingual” NMT
    Parker Riley
    David Grangier
    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online (2020), pp. 7737-7746
    Preview abstract Machine translation has an undesirable propensity to produce “translationese” artifacts, which can lead to higher BLEU scores while being liked less by human raters. Motivated by this, we model translationese and original (i.e. natural) text as separate languages in a multilingual model, and pose the question: can we perform zero-shot translation between original source text and original target text? There is no data with original source and original target, so we train a sentence-level classifier to distinguish translationese from original target text, and use this classifier to tag the training data for an NMT model. Using this technique we bias the model to produce more natural outputs at test time, yielding gains in human evaluation scores on both adequacy and fluency. Additionally, we demonstrate that it is possible to bias the model to produce translationese and game the BLEU score, increasing it while decreasing human-rated quality. We analyze these outputs using metrics measuring the degree of translationese, and present an analysis of the volatility of heuristic-based train-data tagging. View details
    APE at Scale and its Implications on MT Evaluation Biases
    Scott Roy
    Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), Association for Computational Linguistics, Florence, Italy (2019), pp. 34-44
    Preview abstract In this work, we train an Automatic Post-Editing (APE) model and use it to reveal biases in standard MT evaluation procedures. The goal of our APE model is to correct typical errors introduced by the translation process, and convert the “translationese” output into natural text. Our APE model is trained entirely on monolingual data that has been round-trip translated through English, to mimic errors that are similar to the ones introduced by NMT. We apply our model to the output of existing NMT systems, and demonstrate that, while the human-judged quality improves in all cases, BLEU scores drop with forward-translated test sets. We verify these results for the WMT18 English to German, WMT15 English to French, and WMT16 English to Romanian tasks. Furthermore, we selectively apply our APE model on the output of the top submissions of the most recent WMT evaluation campaigns. We see quality improvements on all tasks of up to 2.5 BLEU points. View details
    Unsupervised Natural Language Generation with Denoising Autoencoders
    Scott Roy
    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018), pp. 3922-3929
    Preview abstract Generating text from structured data is important for various tasks such as question answering and dialog systems. The task of Natural Language Generation (NLG) is to generate fluent sentences including all of the information given by some structured data. We show that without any supervision and only based on unlabeled text, we are able to build a NLG system with similar performance compared to supervised approaches. In our approach, we treat the structured data as a corrupt representation of the desired output and use a denoising auto-encoder to reconstruct the sentence. We show how to introduce noise into the training data to build a denoising auto-encoder that is able to generate correct sentences out of structured data. Further, by using bilingual out-of-domain data, we show how to train an unsupervised NLG system that can generate sentences in different languages within one network. View details
    No Results Found