Abstract
We introduce a model-centric approach to investigate hallucinations and other errors generated by large language models (LLMs).
We begin by developing an enhanced error detection method, using a linear classifier that leverages intermediate representations of exact answer tokens and outperform existing techniques.
Our findings confirm that LLMs encode information on the truthfulness of their outputs, yet they also challenge the existence of universal truthfulness features by showing that generalization is skill-specific.
Next, we propose a new error categorization by analyzing the distribution their responses, which uncovers distinct patterns in error types.
We discover that these types are also predictable from internal model states, revealing that internal representations encode more than truthfulness.
Finally, we find that a trained probe can effectively identify correct answers from multiple generated samples, outperforming other baselines for returning an answer.
This exposes a critical \textit{disconnect} between the external behavior of LLMs and their internal state.
Our results suggest new directions for understanding and mitigating errors in LLMs.