Language Models Know More Than They Show: Exploring Hallucinations From the Model’s Viewpoint

Hasas Orgad; Michael Toker; Zorik Gekhman; Roi Reichart; Idan Szpektor; Hadas Kotek; Yonatan Belinkov

Language Models Know More Than They Show: Exploring Hallucinations From the Model’s Viewpoint

Hasas Orgad

Michael Toker

Zorik Gekhman

Roi Reichart

Idan Szpektor

Hadas Kotek

Yonatan Belinkov

2025

Download Google Scholar

Abstract

We introduce a model-centric approach to investigate hallucinations and other errors generated by large language models (LLMs).
We begin by developing an enhanced error detection method, using a linear classifier that leverages intermediate representations of exact answer tokens and outperform existing techniques.
Our findings confirm that LLMs encode information on the truthfulness of their outputs, yet they also challenge the existence of universal truthfulness features by showing that generalization is skill-specific.
Next, we propose a new error categorization by analyzing the distribution their responses, which uncovers distinct patterns in error types.
We discover that these types are also predictable from internal model states, revealing that internal representations encode more than truthfulness.
Finally, we find that a trained probe can effectively identify correct answers from multiple generated samples, outperforming other baselines for returning an answer.
This exposes a critical \textit{disconnect} between the external behavior of LLMs and their internal state.
Our results suggest new directions for understanding and mitigating errors in LLMs.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Language Models Know More Than They Show: Exploring Hallucinations From the Model’s Viewpoint

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs