Why Do LLMs Get Worse the Longer You Talk to Them?

Every heavy LLM user hits this wall eventually. The model that gave sharp, precise answers in the first ten messages starts hedging, contradicting itself, losing track of earlier context, and producing noticeably lower-quality responses by message thirty or forty. This is not your imagination and it is not random variation. It is a documented, mechanically explainable degradation that gets worse as conversations grow longer, affects every major LLM in production, and is almost never disclosed in the marketing materials for the context window sizes those products advertise.

Pithy Cyborg | AI FAQs – The Details

Question: Why do LLMs like Claude and ChatGPT get noticeably worse mid-conversation, and what causes quality degradation as context windows fill up during long chat sessions?

Asked by: Perplexity AI

Answered by: Mike D (MrComputerScience) from Pithy Cyborg.

Why a Larger Context Window Does Not Mean Uniform Quality Across It

The marketing for large context windows implies a simple relationship: more context means the model can handle more. Claude’s 200k token context window, Gemini’s million-token window, and GPT-4o’s 128k window are all advertised as capabilities without the accompanying disclosure that quality is not uniform across those windows. The model’s ability to attend to and accurately use information degrades as a function of where that information sits in the context and how much total context surrounds it.

The mechanism is the U-shaped attention curve documented in the Stanford and UC Berkeley “Lost in the Middle” research. Transformer attention is not uniformly distributed across the context window. Tokens near the beginning of the context and tokens near the current generation position receive the most reliable attention. Tokens in the middle of a long context compete for attention against a much larger pool of surrounding tokens with no structural positional advantage. In a long conversation, everything you discussed in the middle of the session sits in exactly the region where attention reliability is lowest.

As a conversation grows, the ratio of middle-context to edge-context content grows with it. A ten-message conversation has relatively little content in the degraded middle zone. A forty-message conversation has most of its content there. The model is not getting tired. Its architecture is processing an increasingly unfavorable distribution of information across a context geometry that was never designed for uniform recall across arbitrary lengths.

The Three Specific Quality Failures That Appear as Conversations Lengthen

Context degradation does not produce uniform quality reduction. It produces three specific failure patterns that appear in a roughly predictable sequence as conversations extend.

Instruction drift is the first and earliest to appear. Instructions, constraints, and preferences established early in a conversation receive less attention as the conversation grows and those instructions move deeper into the middle context zone. A user who specified a preferred output format in message three, asked the model to avoid a certain approach in message five, or established a specific persona or tone at the conversation’s start will notice those specifications being partially ignored or forgotten by message twenty-five. The model did not decide to ignore the instructions. It is attending to them less reliably because they are now competing for attention against forty messages of subsequent context.

Contradiction accumulation is the second. As a model generates more responses in a long conversation, the probability that earlier responses contain claims that later responses should be consistent with increases. The model’s ability to maintain consistency across the full conversation degrades as the distance between potentially contradictory statements grows. By the middle of a long technical conversation, the model may confidently state something in message thirty-five that contradicts a position it took in message twelve, with no awareness of the inconsistency because message twelve is in the low-attention middle zone.

Contextual compression artifacts are the third and most subtle. When a conversation reaches the limits of the context window, the serving infrastructure either truncates earlier messages, applies summarization to compress older context, or uses a sliding window that drops the oldest messages entirely. Each approach introduces artifacts. Truncation removes context the model might need. Summarization compresses nuance and specific details into generalities that the model treats as equivalent to the original. Sliding window removal produces a model that has no memory of the conversation’s early stages and cannot be made aware of what it has lost.

Why Starting a New Conversation Is Often the Correct Technical Decision

The instinct to continue a long conversation rather than start a new one feels natural because it mirrors how human conversation works. Human memory does not reset between sessions and continuity feels like it preserves value. LLM context does not work like human memory and the analogy misleads users into patterns that actively degrade their results.

A fresh conversation gives the model a clean context geometry where everything is either near the beginning or near the current generation position. There is no accumulated middle-context degradation, no instruction drift from early specifications, and no contradiction accumulation from dozens of previous responses. For most tasks, a well-structured prompt in a new conversation that includes the relevant context from previous sessions outperforms continuing a degraded long conversation.

The practical technique is context distillation: at the point where a conversation is producing noticeably lower quality responses, summarize the key decisions, constraints, and conclusions from the conversation and open a new session with that summary as the opening context. The new session gets the benefit of the prior conversation’s substance without the attention degradation that accumulated during it. This is not a workaround. It is the architecturally correct approach to multi-session work with current LLMs.

The context window sizes advertised by AI labs are real capabilities. They are not quality guarantees across their full length. A 200k token context window means the model can process 200k tokens. It does not mean the model processes all 200k tokens with equivalent accuracy. That distinction is buried in technical research papers and absent from every product page that advertises context window size as a headline feature.

What This Means For You

Start a new conversation rather than extending a degraded one: when response quality drops noticeably mid-session, the fastest fix is context distillation into a fresh conversation rather than attempting to re-establish quality through additional prompting in the same degraded context.
Front-load your most important instructions and constraints at the very beginning of every conversation rather than establishing them mid-session, because early-context instructions receive more reliable attention throughout the conversation’s lifetime than instructions added after significant context has accumulated.
Treat long conversations as having a practical quality horizon of roughly 20 to 30 messages for most models at current capability levels, after which the cost of starting fresh with distilled context is lower than the quality cost of continuing in a degraded window.
Test context degradation in your specific use case by running identical evaluation queries at message 5 and message 35 of a representative conversation, because the degradation rate varies by model, task type, and context content, and measuring it on your actual workflow gives you a practical quality horizon that generic benchmarks do not provide.

Want AI Breakdowns Like This Every Week?

Subscribe (Free) → pithycyborg.substack.com

Read archives (Free) → pithycyborg.substack.com/archive

Additional menu