Mor Geva
Research Areas
Authored Publications
Sort By
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
Alon Jacovi
Or Honovich
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024), pp. 4615–4634
Preview abstract
Prompting language models to provide step-by-step answers (e.g., “Chain-of-Thought”) is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question-answering settings. REVEAL includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model’s answer, across a variety of datasets and state-of-the-art language models. Evaluation on REVEAL shows that verifiers struggle at verifying reasoning chains — in particular, verifying logical correctness and detecting contradictions. Available at https://reveal-dataset.github.io/.
View details
A Comprehensive Evaluation of Tool-Assisted Generation Strategies
Alon Jacovi
Findings of EMNLP (2023)
Preview abstract
A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.
View details
Preview abstract
Sentence fusion is the task of joining several
independent sentences into a single coherent
text. Current datasets for sentence fusion are
small and insufficient for training modern neural models. In this paper, we propose a method
for automatically-generating fusion examples
from raw text and present DISCOFUSE, a large
scale dataset for discourse-based sentence fusion. We author a set of rules for identifying
a diverse set of discourse phenomena in raw
text, and decomposing the text into two independent sentences. We apply our approach
on two document collections: Wikipedia and
Sports articles, yielding 60 million fusion examples annotated with discourse information
required to reconstruct the fused text. We develop a sequence-to-sequence model on DISCOFUSE and thoroughly analyze its strengths
and weaknesses with respect to the various discourse phenomena, using both automatic as
well as human evaluation. Finally, we conduct transfer learning experiments with WEBSPLIT, a recent dataset for text simplification. We show that pretraining on DISCOFUSE
substantially improves performance on WEBSPLIT when viewed as a sentence fusion task.
View details