Why Do LLMs Forget the Middle of Long Prompts?

Every major LLM in 2025-2026 has a documented attention failure in the middle of long context windows. Information you place at the start or end of a long prompt gets recalled accurately. Information buried in the middle gets ignored, distorted, or silently dropped. This is not a bug that will be patched. It is a structural property of how transformer attention works.

Pithy Cyborg | AI FAQs – The Details

Question: Why do LLMs forget the middle of long prompts, and how does that silently corrupt outputs when you feed large documents or long chat histories into ChatGPT, Claude, or Gemini?

Asked by: Claude Sonnet 4.6

Answered by: Mike D (MrComputerScience) from Pithy Cyborg.

The Lost-in-the-Middle Problem Transformer Makers Downplay

In 2023, researchers at Stanford and UC Berkeley published a paper called “Lost in the Middle” that quietly broke a lot of AI marketing copy. They tested how well LLMs retrieved information depending on where in a long document it appeared. Performance was highest at the beginning and end of the context window. The middle was where accuracy collapsed.

This is called the U-shaped attention curve, and it shows up across GPT-4, Claude, and Gemini variants to varying degrees. The effect worsens as context length increases. A model with a 128,000-token context window does not give you 128,000 tokens of reliable recall. It gives you reliable recall near the edges and increasingly unreliable recall toward the center.

Nobody in the marketing materials for Claude’s 200k context window or Gemini 1.5’s million-token window leads with that caveat.

Why Transformer Attention Physically Cannot Fix This

The root cause is how self-attention scales with sequence length. Each token attends to every other token in the context, but attention weights are not uniformly distributed. Tokens at the start benefit from the model’s learned tendency to anchor to context beginnings. Tokens at the end get elevated attention because they are closest to where the model generates its next token.

Tokens in the middle compete for attention against a much larger pool of surrounding tokens with no structural positional advantage. They are not ignored by design. They lose a statistical competition that the architecture runs on every forward pass.

Techniques like Rotary Position Embeddings (RoPE) and ALiBi improve long-context handling but do not eliminate the U-curve. They adjust how position is encoded, not how attention competes. Researchers are working on approaches like DIFF Transformer (Microsoft, 2024) that try to cancel out irrelevant attention noise, but nothing in production has solved this cleanly.

How Middle-Context Failure Silently Corrupts Real Outputs

This is where it stops being academic. If you paste a 40-page contract into Claude and ask it to identify all liability clauses, the clauses on pages 3 and 37 will be found. The clauses on pages 18 through 22 have a materially higher chance of being missed or mischaracterized, with no warning in the output.

The model does not say “I may have missed something in section 4.7.” It produces a confident, well-formatted list that simply omits what it lost. The output looks complete. That is the actual danger.

The same failure mode applies to long chat histories, multi-document RAG pipelines where retrieved chunks land in the middle of a prompt, and any workflow where you are relying on an LLM to synthesize a document longer than roughly 20,000 tokens. The longer the context, the wider the dead zone.

What This Means For You

Never bury critical instructions in the middle of a long system prompt: put your most important constraints at the very beginning or the very end, not in the third paragraph of a 2,000-word prompt.
Split long documents into sequential chunks rather than pasting them whole, and ask the model to process each chunk explicitly before synthesizing, forcing it to attend to the full content.
Treat any LLM output from a long-context task as incomplete by default: if the document is over 20,000 tokens, budget time to manually verify that middle-section content was actually addressed.
Test your own workflows now: take a document you have already run through a long-context model, ask it specifically about something you know is buried in the middle third, and check whether it was handled correctly.

Want AI Breakdowns Like This Every Week?

Subscribe (Free) → pithycyborg.substack.com

Read archives (Free) → pithycyborg.substack.com/archive

Additional menu