Retrieval and generation are two separate steps with two separate failure modes. Retrieving the right document means your retrieval system is working. Producing the wrong answer from that document means your generation step is failing. These problems require different diagnoses and different fixes.
Analysis Briefing
- Topic: RAG generation failure modes after successful retrieval
- Analyst: Mike D (@MrComputerScience)
- Context: Born from an exchange with Claude Sonnet 4.6 that refused to stay shallow
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: If the model has the right document in its context, what causes it to still produce the wrong answer?
The Four Most Common Generation Failures After Correct Retrieval
The context is too long. When multiple documents are retrieved and injected into the prompt, the relevant passage may be buried in the middle of a long context block. Research on the lost-in-the-middle problem shows that language models attend most strongly to content at the beginning and end of long contexts. A correct answer sitting in the middle of a 4,000-token retrieved context block may be effectively invisible to the model’s generation.
The question and the passage don’t share vocabulary. Retrieval found the right document because of semantic similarity in embedding space. But the model reading that document tries to match the user’s question to the passage text, and if the question uses different terminology than the passage, the model may not recognize that the passage answers the question. It reads the document correctly and fails to connect it to what was asked.
The model prefers its parametric memory. If the retrieved document contradicts something the model learned strongly during training, the model will sometimes ignore the document and produce its memorized answer instead. This is especially common for questions about facts that changed after the model’s training cutoff, where the retrieved document contains updated information and the model discards it in favor of what it “knows.”
The retrieved passage requires inference the model skips. Some questions require the model to combine information from a retrieved passage with reasoning to reach the answer. If the passage says the policy changed on March 1 and the user asks whether the policy applies to an event from February, the model needs to do date arithmetic to answer correctly. Under prompt configurations that encourage the model to answer from the document directly, it may produce a summary of the document rather than performing the inference.
How to Distinguish a Retrieval Problem From a Generation Problem
Log both the retrieved documents and the final answers for a sample of queries. For any wrong answer, manually check whether the retrieved documents contained the correct information. If they did, you have a generation problem. If they did not, you have a retrieval problem. Without this logging, both failure types look identical from the user’s perspective.
Stale index poisoning in RAG is a retrieval problem. The answer in this article is a generation problem. Conflating them leads to fixing the wrong layer.
What Actually Fixes Generation Failures
For the lost-in-the-middle problem, rerank retrieved passages so the most relevant content appears at the beginning of the context block, not buried in the middle. For vocabulary mismatch, add query expansion that generates alternative phrasings of the user’s question before retrieval. For parametric memory override, add an explicit instruction in the system prompt to treat the provided context as ground truth and to defer to it over prior knowledge. For inference-requiring passages, add chain-of-thought instructions that ask the model to reason step by step from the context before answering.
What This Means For You
- Log retrieved documents and final answers separately so you can distinguish retrieval failures from generation failures when diagnosing wrong answers.
- Reorder retrieved context to place the most relevant passage first, because model attention to middle-of-context content is significantly weaker than attention to early content.
- Add an explicit instruction to defer to retrieved context over prior knowledge for any RAG application where the documents may contain information newer than or contradicting the model’s training data.
Enjoyed this? Subscribe for more clear thinking on AI:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
