Chain-of-thought prompting works by asking the model to reason step by step before producing an answer. For large models, this reliably improves performance on multi-step reasoning tasks. For small models, it can actively hurt performance. The model generates a reasoning chain it cannot actually follow, and the errors in the chain corrupt the final answer.
Analysis Briefing
- Topic: Chain-of-thought prompting, model scale, and reasoning capability limits
- Analyst: Mike D (@MrComputerScience)
- Context: A structured investigation kicked off by Claude Sonnet 4.6
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: Why does the same prompting technique that helps a big model hurt a small one?
What Chain-of-Thought Actually Requires From the Model
Chain-of-thought prompting works on a specific assumption: that the model is capable of generating a reasoning chain and that following that chain improves its answer. Both parts of this assumption have to be true for the technique to help.
Large models have enough capacity to generate accurate intermediate reasoning steps. When they write “first, I’ll identify the relevant variables” and then list them, the list is usually correct. The chain scaffolds genuine reasoning that the model can execute.
Small models often cannot generate accurate intermediate steps. They produce plausible-looking reasoning text because that is what chain-of-thought examples in their training data looked like. But the reasoning steps are pattern-matched from training, not derived from the problem at hand. When those steps are wrong, the model then conditions its final answer on a false premise and produces a result that would have been better without the chain.
The Research Finding and What It Means
Research on chain-of-thought (including work from Google Brain that popularized the technique) found consistent improvements primarily for models above roughly 100 billion parameters. Below that threshold, results were mixed to negative on complex reasoning tasks.
The threshold has shifted somewhat as training methods have improved. Modern smaller models are better than older large models at following reasoning chains. But the pattern holds: chain-of-thought is not a free improvement. It is a technique that amplifies reasoning capability when reasoning capability exists, and amplifies errors when it does not.
For models that are genuinely too small for a task, no prompting technique produces reliable results. Why do chain-of-thought style prompts sometimes make small models worse is part of a broader pattern: prompting techniques that adjust output style without changing underlying capability can make problems worse by adding a layer of confident-sounding incorrectness.
Better Approaches for Small Models on Reasoning Tasks
Decompose the task externally rather than asking the model to decompose it internally. Break a multi-step reasoning problem into individual single-step questions, route each through the model separately, and combine the answers programmatically. This gives you the benefits of step-by-step reasoning without relying on the model to generate a correct chain.
Use few-shot examples that demonstrate correct reasoning for your specific task type, but keep the examples aligned with tasks the model can actually handle. Showing a small model examples of complex reasoning it cannot replicate reliably may produce worse outputs than simpler direct prompting.
What This Means For You
- Test chain-of-thought on your specific model before assuming it helps. Measure performance with and without it using a small eval set.
- For small models, externalize the reasoning chain by breaking problems into sequential single-step prompts rather than asking for a full internal chain.
- Use chain-of-thought for large frontier models on complex tasks where it consistently adds value, and reserve simpler direct prompting for smaller or specialized models.
Enjoyed this? Subscribe for more clear thinking on AI:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
