Hurdles to Progress in Long-form Question Answering
Abstract
There has been remarkable recent progress in factoid open-domain question answering (QA), where a short phrase or entity is sufficient to answer the question. A lot less work has been done in the more challenging task of long-form QA, where the goal is to generate elaborate, paragraph-long answers to more open-ended questions. In this work, we present a new system based on sparse attention and
contrastive retriever learning, which achieves state-of-the-art performance on ELI5, a popular long-form QA dataset in the KILT benchmark (Petroni et al. 2020).
However, a detailed analysis of our system reveals several concerning trends which are hampering progress in this important area: (1) little to no evidence our model's generations are actually grounded in the retrieved documents, a desirable property which is not captured by metrics in the KILT benchmark; (2) a significant training / valid / test set overlap in ELI5, with atleast 75\% validation questions having a paraphrased question in training data; (3) significant issues in the use of the popular evaluation metric ROUGE-L, with a very low margin of improvement (2-5 ROUGE-L) from lower-bound trivial baselines (like input copying) to upper-bound reference baselines; (4) inherent difficulty of human evaluation in this task due to long length of generated answers and unfamiliarity with topics.
contrastive retriever learning, which achieves state-of-the-art performance on ELI5, a popular long-form QA dataset in the KILT benchmark (Petroni et al. 2020).
However, a detailed analysis of our system reveals several concerning trends which are hampering progress in this important area: (1) little to no evidence our model's generations are actually grounded in the retrieved documents, a desirable property which is not captured by metrics in the KILT benchmark; (2) a significant training / valid / test set overlap in ELI5, with atleast 75\% validation questions having a paraphrased question in training data; (3) significant issues in the use of the popular evaluation metric ROUGE-L, with a very low margin of improvement (2-5 ROUGE-L) from lower-bound trivial baselines (like input copying) to upper-bound reference baselines; (4) inherent difficulty of human evaluation in this task due to long length of generated answers and unfamiliarity with topics.