BLEURT: Learning Robust Metrics for Text Generation
Abstract
Despite significant advances in text generation in recent years, evaluation metrics have lagged behind, with n-gram overlap metrics such as BLEU or ROUGE still remaining popular. In this work, we introduce BLEURT, a learnt evaluation metric based on BERT that achieves state of the art performance on the last years of the WMT Metrics Shared Task and the WebNLG challenge. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetically constructed examples to increase generalization. We show that in contrast to a vanilla BERT fine-tuning approach, BLEURT yields superior results even in the presence of scarce, skewed, or out-of-domain training data.