Adding more few-shot examples to a prompt improves performance up to a point and then degrades it. Beyond roughly three to five examples on most tasks, additional examples begin constraining the model’s output distribution in ways that reduce quality on queries that do not closely match the examples. More examples is not always more guidance. Past a threshold, it is a cage.
Analysis Briefing
- Topic: Few-shot prompting degradation and optimal example count
- Analyst: Mike D (@MrComputerScience)
- Context: Born from an exchange with Claude Sonnet 4.6 that refused to stay shallow
- Source: Pithy Cyborg
- Key Question: How many few-shot examples is too many, and why does adding more hurt?
Why Few-Shot Examples Constrain the Output Distribution
Few-shot examples teach the model the desired input-output mapping by demonstration. The model observes the pattern across examples and generalizes it to new inputs. This works well when the new input closely resembles the examples. It works less well when it does not.
Each example narrows the model’s output distribution toward the demonstrated pattern. With one example, the model has a loose prior. With five examples, the prior is tighter. With fifteen examples, the prior can be tight enough that the model forces new inputs into the demonstrated pattern even when that pattern is not appropriate.
A model with fifteen examples of formal legal document summaries produces formal legal summaries on inputs that are not legal documents and do not need formal summaries. It has learned a strong pattern from the examples and applies it more aggressively than the task requires. The examples stopped being guidance and started being a constraint.
The degradation is most visible on diverse input sets. If your few-shot examples cover a narrow slice of the possible input space and your real queries are more varied, more examples make the mismatch worse. The model becomes better at handling inputs that look like your examples and worse at handling everything else.
The Recency Bias That Makes Later Examples Dominate
Context position affects how strongly each example influences the model’s output. Examples near the end of the few-shot block receive more reliable attention than examples near the beginning, due to the same U-shaped attention curve that drives context window degradation.
In a fifteen-example block, the last three examples exert disproportionate influence on the model’s behavior relative to the first twelve. If your examples are not perfectly representative of the full input distribution, and they rarely are, the three examples the model attends to most heavily determine the output pattern for the entire prompt.
Adding more examples to fix a quality problem often makes this worse by pushing the earlier examples further from the current generation position. The examples added most recently dominate. The examples added earliest recede. A prompt built iteratively by adding examples when outputs look wrong ends up dominated by the most recently added examples, which were chosen specifically because they addressed the last failure case, not because they represent the overall task well.
The Optimal Few-Shot Count by Task Type
The research on few-shot scaling has converged on task-type-dependent optima that differ significantly from the intuitive “more is better” assumption.
Classification tasks with clear categories peak at three to five examples per class. Beyond that, example diversity within a class matters more than example count. Adding more examples of the same category produces diminishing returns faster than adding examples that cover edge cases within each category.
Format demonstration tasks, where the examples teach output structure rather than domain knowledge, typically need one to three examples. Once the format is clear from two or three demonstrations, additional format examples do not add information and begin constraining content generation in ways that reduce quality on varied inputs.
Complex reasoning tasks are the exception. Chain-of-thought prompting benefits from more examples up to roughly eight because each example demonstrates a reasoning pattern rather than just an input-output mapping. The reasoning demonstrations are more diverse and less constraining than format demonstrations, which keeps the per-example marginal value positive for longer.
What This Means For You
- Start with three examples and test before adding more. Measure quality on a representative sample of your actual input distribution before concluding that more examples will help. The degradation threshold is often lower than expected.
- Prioritize example diversity over example count. Three examples that cover the edges of your input distribution outperform ten examples that cluster around the common case. Coverage matters more than quantity.
- Put your most representative examples last in the few-shot block, not first. The recency bias in transformer attention means the final examples exert the strongest influence on output behavior. Your best examples belong nearest the query.
- Use zero-shot with explicit instructions for format tasks rather than multiple format examples. One or two format demonstrations plus a clear instruction outperforms ten format examples that constrain content generation without adding information.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
