Thinking to recall: How reasoning unlocks parametric knowledge in LLMs

It is well-established that allowing large language models (LLMs) to generate step-by-step reasoning traces, commonly known as chain-of-thought (CoT), enhances performance on complex tasks. When a model solves difficult math equations, writes software, or answers multi-hop factual questions, breaking the problem down into manageable logical steps is highly effective.

However, the utility of this approach remains unclear for simple, single-hop factual questions. For instance, consider a query like: "What year was Mary Engle Pennington inducted into the National Inventors Hall of Fame?" An LLM either has the fact stored in its parametric memory (knowledge encoded directly into its weights) or it doesn't; no complex arithmetic or logical deduction is required. So why would a reasoning trace help?

In "Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs”, to be presented at COLM 2026, we investigate this phenomenon. We demonstrate that allowing a model to generate a reasoning trace unlocks correct answers that are otherwise effectively unreachable. To understand why reasoning aids parametric knowledge recall when there are no complex reasoning steps to execute, we conduct a series of hypothesis-driven controlled experiments. Our findings reveal two complementary mechanisms driving this: a computational buffer effect and factual priming.

Probing the knowledge boundary

We first measure the parametric recall capability boundary using the pass@k metric. Instead of only checking one model-generated answer, pass@k checks if the correct fact exists within multiple generated attempts. By evaluating the presence of successful reasoning paths in the model’s output distribution while being less sensitive to their exact ranking, pass@k helps us estimate the potential of reasoning for factual recall, rather than only looking at the current model’s top-1 behavior. To assess the impact of reasoning while controlling for parametric knowledge, we focus on reasoning LLMs (R-LLMs) where reasoning can be enabled or disabled (toggled on or off), and compare pass@k between these two modes. We focus on the Gemini-2.5 (Flash and Pro) and Qwen3-32B models, using two challenging closed-book QA datasets: SimpleQA Verified and EntityQuestions.

The results are surprisingly consistent. When reasoning is enabled, the models successfully recall answers that are virtually unrecoverable when reasoning is off. Importantly, this improvement isn't just because the model is decomposing complex questions. This results from our deliberate focus on datasets containing predominantly simple, single-hop questions.

Pass@𝑘 curves across two closed-book QA datasets and three LLMs, comparing the same models with reasoning enabled (ON) vs reasoning disabled (OFF).

These results raise the question: if the effect does not come from step-by-step reasoning, what reasoning patterns enable the model to retrieve the correct answer?

Mechanism 1: The computational buffer

Our first hypothesis focuses on the mechanics of generation. We take the long-standing hypothesis that generating extra tokens acts as extended computation time by providing additional forward passes, and test it in the new setting of parametric knowledge recall in R-LLMs. Specifically, we hypothesize that models implicitly use these reasoning tokens as a computational buffer to perform latent processing, independent of the actual semantic content being generated.

To test this, we design an experiment that removes all meaningful content from the reasoning trace . We intercept the model's reasoning process and replace its generated trace with a meaningless string "Let me think", repeated over and over until it matches the length of the original reasoning trace. We then let the model predict the final answer conditioned on this dummy text.

Remarkably, conditioning the model on this meaningless trace substantially improves its ability to recall the correct answer compared to the baseline where reasoning is completely turned off. This provides strong evidence that simply giving the model more computational runway helps it refine its internal state and fetch hard-to-reach facts.

Computation buffer effect on Gemini-2.5-Flash. ON Dummy overrides the thinking trace with a short sequence without factual content that is repeated to match the token length of the original trace.

However, this compute-buffer effect has its limits. Pushing the dummy text to longer lengths eventually offers diminishing returns, and it never fully matches the performance of the model's natural reasoning traces. This means that while extra computation helps, the actual content of the thoughts still matters.

Reasoning effectiveness as a function of the input length in tokens when conditioning on dummy reasoning traces. ON Dummy X overrides the reasoning trace with a short dummy sequence which is repeated such that the input length will be X tokens. The reasoning effectiveness metric (Ω) summarizes the pass@k gains across all k values. We define it as a weighted average relative difference in pass@𝑘 between reasoning ON and OFF modes.

Mechanism 2: Factual priming

When we analyze the natural reasoning traces generated for simple factual questions, we notice a common pattern. The models aren't writing out logical proofs; they are surfacing related facts.

In human cognition, there is a concept known as spreading activation, where processing a specific concept primes related concepts in semantic memory, making them easier to retrieve. We hypothesize that language models exhibit a similar generative self-retrieval mechanism, which we call factual priming. By generating facts topically related to the question, the model builds a contextual bridge that facilitates the retrieval of the correct answer.

To test hypotheses, we extract just the concrete facts from the model’s reasoning traces, applying strict filtering to strip away any filler text, search plans, or explicit mentions of the final target answer. We then isolate the effect of the recalled facts, and show that conditioning on a short list of recalled facts recovers most of reasoning’s gains and helps even when reasoning is OFF.

Factual priming effect on Gemini-2.5-Flash. We first extract the facts mentioned during reasoning. ON Facts overrides the models’ original reasoning trace with this short fact list and regenerates the final answer, while OFF Facts runs the model reasoning disabled with the fact list provided as additional input context as part of the prompt.

For example, if asked for the name of the 10th King of Nepal, a reasoning model might first list the previous nine kings. Recalling those first nine acts as a semantic warm-up, priming the network to successfully recall the 10th. The facts themselves are the stepping stones.

An illustration of "factual priming" in action where intermediate factual retrieval (listing the previous nine Kings) primes the model to successfully recall the 10th King of Nepal. The model succeeds to answer correctly with reasoning enabled (ON) while failing without it. It also succeeds when the prediction is conditioned only on a short list of facts recalled during reasoning (ON Facts).

The hallucination trap

While generative self-retrieval is a powerful mechanism, it introduces a fundamental risk. Because the model generates these intermediate facts itself, they might be hallucinated. We thus check how these reasoning-stage errors impact the final answer. To find out, we build a large-scale auditing pipeline using a search-enabled verifier to independently check the correctness of every single intermediate fact generated across hundreds of thousands of reasoning traces.

The audit reveals a distinct pattern. If a reasoning trace contains even a single hallucinated intermediate fact, the model is significantly less likely to arrive at the correct final answer. This suggests that, while effective, the factual priming mechanism might be fragile.

Ratio of correct answers when reasoning traces contain hallucinations (hallucinated) compared to those that do not contain hallucinations (clean).

Building more reliable models

Understanding these mechanisms provides practical avenues for improving model reliability. Because factual priming is effective and hallucinated intermediate facts degrade performance, we can leverage both insights to improve model accuracy.

To evaluate the potential of these insights, we use a test-time selection strategy that generates multiple reasoning trajectories for a single question, retaining only those that contain verifiable, hallucination-free facts. Prioritizing these trajectories considerably improves accuracy. In practice, this prioritization could be implemented during training via process rewards that encourage factually supported intermediate steps.

Expected accuracy under test-time selection criteria based on factual recall and factual correctness.

Conclusion

Our findings highlight that reasoning in language models serves a much broader purpose than just task decomposition or mathematical logic. It acts as a fundamental mechanism for exposing a model's internal memory and expanding its parametric knowledge boundary. These insights open up exciting directions for future research. Knowing that factually accurate reasoning traces yield better answers suggests that training recipes can be further optimized. By utilizing process rewards that specifically encourage factually supported intermediate steps, we might be able to train models that are inherently more reliable and less prone to hallucination. We look forward to seeing how the research community continues to explore the intersections of reasoning, memory, and retrieval.

Acknowledgements

This research was conducted by Zorik Gekhman, Roee Aharoni, Eran Ofek, Mor Geva, Roi Reichart and Jonathan Herzig. We thank Eyal Ben-David and Avinatan Hassidim for reviewing the work and their valuable suggestions.