Abstract
Hallucinations in large language models represent a critical barrier to reliable usage. However, existing research tends to focus on categorizing error types by their manifestations rather than by their underlying knowledge-related causes. We propose a novel framework for categorizing hallucinations along two critical dimensions for effective mitigation: knowledge and certainty. Along the knowledge axis, we distinguish between hallucinations caused by a lack of knowledge (HK− ) and those occurring despite the model having the correct knowledge (HK+). Through model-specific dataset construction and comprehensive experiments across multiple models and datasets we show that we can distinguish HK+ and HK− hallucinations. Furthermore, HK+ and HK−
hallucinations exhibit different characteristics, and respond differently to mitigation strategies, with activation steering proving effective only for HK+ hallucinations. We then turn to the certainty axis, identifying a particularly concerning subset of HK+ hallucinations that occur with high certainty, which we refer to as Certainty Misalignment (CC), where models hallucinate with certainty despite having the correct knowledge. To address this, we introduce a new evaluation metric (CC-Score). This reveals significant blind spots in existing mitigation methods, which may perform well on average but fail disproportionately on these critical cases. Our targeted probe-based mitigation approach, specifically designed for CC instances, demonstrates superior performance compared to existing methods (such as internal probing-based and prompting-based). These findings highlight the importance of considering both knowledge and certainty in hallucination analysis and call for more targeted approaches to detection and mitigation that consider their underlying causes.