Jump to Content

Assessing The Factual Accuracy of Text Generation

Ben Goodrich
Vinay Rao
The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'19) (2019) (to appear)
Google Scholar


We propose an automatic metric to reflect the factual accuracy of generated text as an alternative to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We consider models that can extract fact triplets from text and then use them to de- fine a metric that compares triplets extracted from generated summaries and reference texts. We show that this metric correlates with human evaluation of factual accuracy better than ROUGE does. To build these models, we introduce a new Wikidata based dataset for fact extraction, and show that a transformer-based attention model can learn to predict structured fact triplets as well as perform favorably compared to more traditional two-stage approaches (entity recognition and relationship classification).