AI models degrade in quality as conversations grow longer because attention mechanisms spread unevenly across tokens, causing the model to underweight information buried in the middle of long inputs. This is called context rot. A July 2025 study across 18 models confirmed that every major LLM, including GPT-4.1, Claude 4, and Gemini 2.5, gets measurably worse as input length increases.
Pithy Cyborg | AI FAQs – The Details
Question: Why does AI get dumber the longer your chat gets?
Asked by: Grok 2
Answered by: Mike D (MrComputerScience) from Pithy Cyborg.
Why Transformer Attention Fails Across Very Long Conversations
The architecture behind every major AI model, the transformer, works by comparing each new token against every previous token to decide what to pay attention to.
That sounds thorough. It is not.
In practice, attention is not distributed evenly. Research published in the Transactions of the Association for Computational Linguistics (Liu et al., 2024) documented what they called the “lost in the middle” effect: model accuracy is highest when relevant information appears at the very start or very end of a long input, and drops significantly for anything in the middle.
Think about that for a second. You paste a 10,000-word document into a chat. The instructions you gave the AI at the beginning of your conversation, and the last thing you said, get the most weight. Everything else in between gets increasingly ignored.
The model is not reading your conversation the way you read it. It is skimming it, with structural blind spots baked into the architecture.
The Context Rot Problem That Vendors Won’t Put in the Marketing Copy
AI companies advertise context windows in raw token counts: 128K, 200K, 1 million tokens. What they do not advertise is effective context, the portion of that window where performance actually holds up.
Chroma’s 2025 “Context Rot” study evaluated 18 models and found that performance degrades consistently as input length grows, regardless of claimed context window size.
Adobe Research’s NoLiMa benchmark found that 11 out of 12 models dropped below 50% of their baseline performance at just 32K tokens. Even GPT-4o fell from 99.3% to 69.7% accuracy.
The most uncomfortable finding came from an October 2025 arXiv paper: even with 100% perfect retrieval of relevant information, performance still degraded between 13.9% and 85% as input length increased.
The length itself is the problem. Not what is in it.
And the failure modes differ by model. Claude models tend toward conservative abstention while GPT models show higher hallucination rates when distractors are present. Neither failure mode is announced. Both look like normal output.
When a Long Context Window Actually Helps (And When It’s Just a Number)
Context windows are not useless. For simple retrieval tasks, “find the clause in this contract that mentions termination,” long context works well.
The failure shows up in reasoning tasks. Multi-hop questions, code review across a large file, summarizing a long conversation accurately. These require the model to integrate information from across the whole context, not just retrieve a single needle from a haystack.
Research found that popular LLMs effectively utilize only 10 to 20 percent of their context when complex reasoning is required across long documents.
The practical ceiling varies. Claude models generally hold up better at long context than GPT-4o due to architectural training differences, but neither is immune past 32K tokens of complex reasoning tasks.
The honest version of the marketing pitch would say: “200K token context window, effective for retrieval tasks, degrades significantly for multi-step reasoning past 32K tokens.” Nobody writes that on the product page.
What This Means For You
- Start a fresh conversation when a task changes significantly, because instructions buried in the middle of a long chat are the most likely to be ignored or misapplied by the model.
- Put your most important instructions and context at the very beginning or the very end of your prompt, since attention mechanisms overweight both positions relative to the middle.
- Avoid pasting massive documents into a single chat and asking multi-step questions about them — break the task into smaller focused conversations instead of expecting the model to reason across 50,000 tokens reliably.
- Treat confident answers from a very long conversation with extra skepticism, because the model has no signal that its attention has drifted and will not tell you when it is working from degraded context.
Related Questions
- 1
- 2
- 3
