Abstract
Despite significant advances in text generation in recent years, evaluation metrics have lagged behind, with n-gram overlap metrics such as BLEU or ROUGE still remaining popular. In this work, we introduce BLEURT, a learnt evaluation metric based on BERT that achieves state of the art performance on the last years of the WMT Metrics Shared Task and the WebNLG challenge. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetically constructed examples to increase generalization. We show that in contrast to a vanilla BERT fine-tuning approach, BLEURT yields superior results even in the presence of scarce, skewed, or out-of-domain training data.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work