Response length compresses in long conversations because the model attends less reliably to early instructions as they move deeper into context. A request for thorough, detailed responses made at message two is fighting for attention against forty subsequent messages by message forty-two. The instruction did not disappear. It just stopped winning.
Analysis Briefing
- Topic: Response length compression in extended LLM conversations
- Analyst: Mike D (@MrComputerScience)
- Context: A back-and-forth with Claude Sonnet 4.6 that went deeper than expected
- Source: Pithy Cyborg
- Key Question: Why does the AI that wrote three paragraphs at the start of our chat now write three sentences?
Why Length Instructions Fade While Other Instructions Stay
Not all instructions degrade equally in long contexts. Instructions that are reinforced by each new exchange, like tone and topic focus, stay relatively stable because each message implicitly re-signals them. Length instructions are different. They are stated once at the start and never repeated. By the middle of a long session, the model is generating responses in a context where the length instruction is one data point among hundreds.
The U-shaped attention curve compounds this. Transformer models attend most reliably to tokens near the beginning and end of the context window. Length instructions placed at the start of a long conversation end up in the middle of the growing context, exactly where attention is weakest.
The model is not ignoring the length instruction. It is weighting it against everything else in the context, and everything else now outnumbers it substantially. The statistical pull of recent exchanges, most of which did not explicitly reinforce the length requirement, dominates.
The Recency Pull That Shortens Responses Over Time
There is a second mechanism beyond attention dilution. The model’s own prior responses become evidence of how it should respond.
If early responses were long and later responses drifted slightly shorter, those shorter responses become the apparent norm the model is now working from. Each new response is generated in a context where the most recent examples of the model’s behavior are the shorter ones. The model produces a response consistent with its recent behavior rather than with the instruction given forty messages ago.
This recency pull is self-reinforcing. Slightly shorter responses become the context for even shorter responses. The compression accelerates gradually rather than happening all at once, which is why users often do not notice until the responses are dramatically shorter than they were at the start.
The effect is most pronounced in creative, analytical, and research-oriented conversations where length genuinely reflects quality. Conversational exchanges are less affected because short responses are appropriate there anyway.
How to Maintain Length Across a Long Session
Periodic re-injection of the length instruction is the most direct fix. Every fifteen to twenty exchanges, restate the length requirement explicitly. This does not need to be elaborate. A brief “continue giving thorough responses like you did earlier” reanchors the instruction near the current generation position where attention is strongest.
System prompts that establish length requirements hold better than user-turn instructions because system prompt tokens receive more consistent attention weight than user turn tokens across model versions. If you are using an API or a platform that supports system prompts, length requirements belong there rather than in the first user message.
Starting a fresh session with a distilled context is the strongest fix for conversations that have already compressed significantly. Extract the key conclusions from the current session, open a new conversation with those conclusions plus an explicit length instruction, and the model generates from a clean context where the length requirement has not been buried.
What This Means For You
- Restate length requirements every 15 to 20 exchanges in long sessions rather than assuming a single early instruction holds. Position the re-injection as part of your next message, not a separate meta-conversation.
- Put length instructions in the system prompt if your platform supports it. System prompt instructions maintain attention weight better than user-turn instructions across long conversations.
- Watch for gradual compression as an early signal of broader context rot. Length degradation appears before quality degradation. If responses are getting shorter without prompting, the session is accumulating context noise that will affect quality next.
- Start a fresh session when compression is severe. Distill your key conclusions, open a new conversation, and include an explicit length instruction at the top. Recovery in a fresh context is faster than fighting the recency pull in a polluted one.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
