Results of WMT23 Metrics Shared Task: Metrics might be Guilty but References are not Innocent

Markus Freitag; Nitika Mathur; Chi-kiu Lo; Eleftherios Avramidis; Ricardo Rei; Brian Thompson; Tom Kocmi; Frédéric Blain; Dan Deutsch; Craig Stewart; Chrysoula Zerva; Sheila Castilho; Alon Lavie; George Foster

Results of WMT23 Metrics Shared Task: Metrics might be Guilty but References are not Innocent

Markus Freitag

Nitika Mathur

Chi-kiu Lo

Eleftherios Avramidis

Ricardo Rei

Brian Thompson

Tom Kocmi

Frédéric Blain

Dan Deutsch

Craig Stewart

Chrysoula Zerva

Sheila Castilho

Alon Lavie

George Foster

Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 576-626

Download Google Scholar

Abstract

This paper presents the results of the WMT23 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT23 News Translation Task. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). Following last year's success, we also included a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics' ability to capture and penalise specific types of translation errors. Furthermore, we improved our meta-evaluation procedure by considering fewer tasks and calculating a global score by weighted averaging across the various tasks.
We present an extensive analysis on how well metrics perform on three language pairs: Chinese-English, Hebrew-English on the sentence-level and English-German on the paragraph-level. The results strongly confirm the results reported last year, that neural-based metrics are significantly better than non-neural metrics in their levels of correlation with human judgments. Further, we investigate the impact of bad reference translations on the correlations of metrics with human judgment. We present a novel approach for generating synthetic reference translations based on the collection of MT system outputs and their corresponding MQM ratings, which has the potential to mitigate bad reference issues we observed this year for some language pairs. Finally, we also study the connections between the magnitude of metric differences and their expected significance in human evaluation, which should help the community to better understand and adopt new metrics.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Results of WMT23 Metrics Shared Task: Metrics might be Guilty but References are not Innocent

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs