Q^2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering

Or Honovich; Leshem Choshen; Roee Aharoni; Ella Neeman; Idan Szpektor; Omri Abend

Q^2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering

Or Honovich

Leshem Choshen

Roee Aharoni

Ella Neeman

Idan Szpektor

Omri Abend

Empirical Methods in Natural Language Processing (EMNLP) (2021)

Google Scholar

Abstract

Neural knowledge-grounded generative models for dialogue often produce content that is \textit{factually inconsistent} with the knowledge they rely on, making them unreliable and limiting their applicability.
Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic question generation and question answering. Our metric, denoted $Q^2$, compares answer spans using natural language inference, which enables better factual comparison than in previous token-based metrics. To foster proper evaluation, we curate a novel dataset of state-of-the-art dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consistency. We perform a thorough meta-evaluation of $Q^2$ against other metrics using the new dataset and two others, where it shows higher correlation with human judgements.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Q^2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs