q2d: Automatic Dialog Generation to Improve Models' Query Generation
Abstract
We propose q2d: an automatic data generation pipeline that generates information-seeking dialogues based on questions. We apply our method to create conversational versions of questions answering datasets, which we release as a new dataset. We use this data to improve query generation models, which communicate with an external search APIs to generate factual responses. Unlike previous approaches, which relied on human annotators, our method allows to automatically generate labeled dialogues with better control and scale.
In experiments, we demonstrate that: (1) Models trained on our synthetic data produce results comparable to those trained on natural data; (2) Our generated datasets are effective as a benchmark and as a training signal that generalizes to human-annotated test sets.
We also provide an extensive analysis of the quality and factuality of the generated datasets. Our studies indicate that our automatic dialogue generation pipeline is effective at improving query generation and factuality.
In experiments, we demonstrate that: (1) Models trained on our synthetic data produce results comparable to those trained on natural data; (2) Our generated datasets are effective as a benchmark and as a training signal that generalizes to human-annotated test sets.
We also provide an extensive analysis of the quality and factuality of the generated datasets. Our studies indicate that our automatic dialogue generation pipeline is effective at improving query generation and factuality.