Jonathan Herzig
Research Areas
Authored Publications
Sort By
Inside-Out: Hidden Factual Knowledge in LLMs
Eyal Ben David
Eran Ofek
Hadas Orgad
Zorik Gekhman
Roi Reichart
Yonatan Belinkov
2025
Preview abstract
This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express
in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model’s observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) puts a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.
View details
(D)RAGged Into a Conflict: Detecting and Addressing Conflicting Sources in Retrieval-Augmented LLMs
Arie Cattan
Alon Jacovi
Ori Ram
Eran Ofek
2025
Preview abstract
Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing LLMs with relevant and up-to-date information. However, the retrieved sources can often bring conflicting information and it is not clear how models address such discrepancies. In this work, we first point out that knowledge conflicts stem from various reasons and thus require tailored solutions in order to better align model responses to human preferences. To that end, we introduce a novel taxonomy of knowledge conflicts in RAG and define the desired model’s behavior for each category. Additionally, we construct a high-quality benchmark by asking two expert annotators to identify the conflict type within realistic RAG instances, each comprising a query and its associated search results. Finally, we conduct extensive experiments and show that explicitly informing LLMs about the potential conflict category significantly improves the quality and appropriateness of the responses. Yet, there is still a vast room for improvement. Taken together, our work highlights the importance of evaluating RAG systems not only on factual accuracy but also on their ability to manage and resolve knowledge conflicts effectively.
View details
Preview abstract
As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages from the pre-training corpus. We first show that many languages transfer some instruction-following capabilities to other languages from even monolingual tuning. Furthermore, we find that only 40 multilingual examples integrated in an English tuning set substantially improve multilingual instruction-following, both in seen and unseen languages during tuning. In general, we observe that models tuned on multilingual mixtures exhibit comparable or superior performance in multiple languages compared to monolingually tuned models, despite training on 10x fewer examples in those languages. Finally, we find that diversifying the instruction tuning set with even just 2-4 languages significantly improves cross-lingual generalization. Our results suggest that building massively multilingual instruction-tuned models can be done with only a very small set of multilingual instruction-responses.
View details
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
Alon Jacovi
Or Honovich
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024), pp. 4615–4634
Preview abstract
Prompting language models to provide step-by-step answers (e.g., “Chain-of-Thought”) is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question-answering settings. REVEAL includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model’s answer, across a variety of datasets and state-of-the-art language models. Evaluation on REVEAL shows that verifiers struggle at verifying reasoning chains — in particular, verifying logical correctness and detecting contradictions. Available at https://reveal-dataset.github.io/.
View details
Systematization, Analysis, and Mitigation of LLMs Hallucinations
Fazl Barez
Zorik Gekhman
Gabriel Stanovsky
Itay Itzhak
Roi Reichart
Yonatan Belinkov
Dana Arad
Adi Simhi
Arxiv (2024)
Preview abstract
Hallucinations in large language models represent a critical barrier to reliable usage. However, existing research tends to focus on categorizing error types by their manifestations rather than by their underlying knowledge-related causes. We propose a novel framework for categorizing hallucinations along two critical dimensions for effective mitigation: knowledge and certainty. Along the knowledge axis, we distinguish between hallucinations caused by a lack of knowledge (HK− ) and those occurring despite the model having the correct knowledge (HK+). Through model-specific dataset construction and comprehensive experiments across multiple models and datasets we show that we can distinguish HK+ and HK− hallucinations. Furthermore, HK+ and HK−
hallucinations exhibit different characteristics, and respond differently to mitigation strategies, with activation steering proving effective only for HK+ hallucinations. We then turn to the certainty axis, identifying a particularly concerning subset of HK+ hallucinations that occur with high certainty, which we refer to as Certainty Misalignment (CC), where models hallucinate with certainty despite having the correct knowledge. To address this, we introduce a new evaluation metric (CC-Score). This reveals significant blind spots in existing mitigation methods, which may perform well on average but fail disproportionately on these critical cases. Our targeted probe-based mitigation approach, specifically designed for CC instances, demonstrates superior performance compared to existing methods (such as internal probing-based and prompting-based). These findings highlight the importance of considering both knowledge and certainty in hallucination analysis and call for more targeted approaches to detection and mitigation that consider their underlying causes.
View details
A Comprehensive Evaluation of Tool-Assisted Generation Strategies
Alon Jacovi
Findings of EMNLP (2023)
Preview abstract
A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.
View details
Preview abstract
Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. We also show that our method generalizes to multilingual scenarios using the mFACE dataset. Finally, we release a large-scale synthetic dataset with 1.4M examples generated using TrueTeacher.
View details
What You See is What You Read? Improving Text-Image Alignment Evaluation
Michal Yarom
Eran Ofek
arXiv (2023)
Preview abstract
Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic image-text alignment evaluation. We first introduce a comprehensive evaluation set spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach based on synthetic data generation. Both methods surpass prior approaches in various text-image alignment tasks, with our analysis showing significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.
View details
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
Pat Verga
Jianmo Ni
arXiv (2022)
Preview abstract
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
View details
TRUE: Re-evaluating Factual Consistency Evaluation
Or Honovich
Hagai Taitelbaum
Vered Cohen
Thomas Scialom
NAACL 2022, The Association for Computational Linguistics (2022)
Preview abstract
Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear.
In this work, we introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better methods.
View details