Daniel Deutsch
Daniel is a Research Scientist on the Google Translate Research team. His research interests include automatic and human evaluation of text generation.
Research Areas
Authored Publications
Sort By
MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task
Jurik Juraska
Mara Finkelstein
Mahdi Mirzazadeh
Conference on Machine Translation (2023)
Preview abstract
This report details the MetricX-23 submission to the Workshop on Machine Translation's 2023 Metrics Shared Task and provides an overview of the experiments that informed which metrics were submitted. Our three submissions---each with a quality estimation (or reference-free) version---are all learned regression-based metrics that vary in the data used for training and which pretrained language model was used for initialization. We report results related to understanding (1) which supervised training data to use, (2) the impact of how the training labels are normalized, (3) the amount of synthetic training data to use, (4) how metric performance is related to model size, and (5) the effect of initializing the metrics with different pretrained language models. The training recipes that we found to be most successful are detailed in this report.
View details
WMT23 Metrics shared task Submission: Quality Estimation using Minimum Bayes Risk
Subhajit Naskar
Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 806-811
Preview abstract
This report describes the Minimum Bayes Risk Quality Estimation (MBR-QE) submission to the Workshop on Machine Translation's 2023 Metrics Shared Task. MBR decoding with neural utility metrics (BLEURT) are known to be very effective in generating high quality machine translations. We use the underlying assumption of MBR decoding and develop a MBR based reference-free quality estimation metric. Our method uses a evaluator machine translation system and a reference-based utility metric (BLEURT, MeticX) to calculate a quality estimation score of a model. We report results related to comparing different MBR configuration and utility metrics.
View details
Results of WMT23 Metrics Shared Task: Metrics might be Guilty but References are not Innocent
Nitika Mathur
Chi-kiu Lo
Eleftherios Avramidis
Ricardo Rei
Brian Thompson
Tom Kocmi
Frédéric Blain
Craig Stewart
Chrysoula Zerva
Sheila Castilho
Alon Lavie
George Foster
Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 576-626
Preview abstract
This paper presents the results of the WMT23 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT23 News Translation Task. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). Following last year's success, we also included a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics' ability to capture and penalise specific types of translation errors. Furthermore, we improved our meta-evaluation procedure by considering fewer tasks and calculating a global score by weighted averaging across the various tasks.
We present an extensive analysis on how well metrics perform on three language pairs: Chinese-English, Hebrew-English on the sentence-level and English-German on the paragraph-level. The results strongly confirm the results reported last year, that neural-based metrics are significantly better than non-neural metrics in their levels of correlation with human judgments. Further, we investigate the impact of bad reference translations on the correlations of metrics with human judgment. We present a novel approach for generating synthetic reference translations based on the collection of MT system outputs and their corresponding MQM ratings, which has the potential to mitigate bad reference issues we observed this year for some language pairs. Finally, we also study the connections between the magnitude of metric differences and their expected significance in human evaluation, which should help the community to better understand and adopt new metrics.
View details
There's no Data Like Better Data: Using QE Metrics for MT Data Filtering
Jan-Thorsten Peter
Mara Finkelstein
Jurik Juraska
Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 561-577
Preview abstract
Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems (NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches.
View details
Training and Meta-Evaluating Machine Translation Evaluation Metrics at the Paragraph-Level
Jurik Juraska
Mara Finkelstein
Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 996-1013
Preview abstract
As research on machine translation moves to translating text beyond the sentence level, it remains unclear how effective automatic evaluation metrics are at scoring longer translations. In this work, we first propose a method for creating paragraph-level data for training and meta-evaluating metrics from existing sentence-level data. Then, we use these new datasets to benchmark existing sentence-level metrics as well as train learned metrics at the paragraph level. Interestingly, our experimental results demonstrate that using sentence-level metrics to score entire paragraphs is equally as effective as using a metric designed to work at the paragraph level. We speculate this result can be attributed to properties of the task of reference-based evaluation as well as limitations of our datasets with respect to capturing all types of phenomena that occur in paragraph-level translations.
View details
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation
Patrick Fernandes
Mara Finkelstein
André Martins
Graham Neubig
Ankush Garg
Conference on Machine Translation (2023)
Preview abstract
Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on direct estimation of quality scores, the resulting metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we fill this gap by proposing \textbf{\textsc{AutoMQM}}, a prompting technique which leverages the \textit{reasoning} and \textit{in-context learning} capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple \textit{score prediction} prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate \textsc{AutoMQM} with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.
View details
Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration
George Foster
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, pp. 12914-12929
Preview abstract
Kendall's tau is frequently used to meta-evaluate how well machine translation (MT) evaluation metrics score individual translations. Its focus on pairwise score comparisons is intuitive but raises the question of how ties should be handled, a gray area that has motivated different variants in the literature. We demonstrate that, in settings like modern MT meta-evaluation, existing variants have weaknesses arising from their handling of ties, and in some situations can even be gamed. We propose instead to meta-evaluate metrics with a version of pairwise accuracy that gives metrics credit for correctly predicting ties, in combination with a tie calibration procedure that automatically introduces ties into metric scores, enabling fair comparison between metrics that do and do not predict ties. We argue and provide experimental evidence that these modifications lead to fairer ranking-based assessments of metric performance.
View details