Asking Claude, ChatGPT, or Gemini the same question twice and getting different answers is not a bug, a server error, or evidence that the model is confused. It is the intended behavior of a system deliberately engineered to be non-deterministic. Every major LLM ships with randomness built into its output generation by default, and that randomness is a design choice with a specific technical justification that the products never explain to the people using them.
Pithy Cyborg | AI FAQs – The Details
Question: Why does Claude give different answers to the same question asked twice, and what is the temperature and sampling mechanism that makes LLM outputs non-deterministic by design?
Asked by: GPT-4o
Answered by: Mike D (MrComputerScience) from Pithy Cyborg.
The Temperature Parameter That Makes Every LLM Response a Controlled Roll of the Dice
Every time an LLM generates a response, it does not select the single most probable next token at each step. It samples from a probability distribution over its entire vocabulary. At each token position, the model assigns a probability to every possible next token based on the context. A parameter called temperature controls how that probability distribution is shaped before sampling occurs.
At temperature zero, the model always selects the highest-probability token at each step. The output is fully deterministic. Ask the same question twice with temperature set to zero and you get the same answer both times, word for word. This sounds like the obviously correct setting for accuracy-critical applications and it is, with a significant tradeoff.
At temperature above zero, the probability distribution is spread across more tokens before sampling. At temperature one, the model samples proportionally to the raw probabilities. At temperature above one, lower-probability tokens become relatively more likely and the output becomes more varied and creative but also less reliable. The randomness is not noise added to the output. It is randomness in the selection process itself, applied at every single token position across the entire response.
Claude’s default temperature in the claude.ai interface is set above zero. So is ChatGPT’s. So is Gemini’s. The products ship with randomness enabled because deterministic outputs at temperature zero are repetitive, less engaging, and worse at creative and open-ended tasks. The tradeoff is that factual queries get slightly different answers on repeated runs, which most users experience as inconsistency and none of the products explain proactively.
Why Non-Determinism Is a Feature for Some Tasks and a Liability for Others
The temperature design choice reflects a genuine tension in what people use LLMs for, and understanding which side of that tension your use case sits on tells you whether the non-determinism is helping or hurting you.
Creative tasks, brainstorming, writing assistance, and open-ended exploration all benefit from temperature above zero. A deterministic model asked to generate ten tagline options would generate the same ten options every time, ranked identically, with no variation across runs. The temperature parameter is what makes the second run produce genuinely different options rather than reshuffling the same list. For these use cases, non-determinism is not a flaw. It is the mechanism that makes the tool useful.
Factual queries, code generation, data extraction, and any task with a single correct answer are harmed by temperature above zero. A model asked to extract the date from a document should return the same date every time the document does not change. A model asked whether a code function contains a bug should give a consistent answer across runs. For these use cases, the randomness introduced by temperature above zero produces variation where consistency is the correct behavior. The same query returning different answers is not creative exploration. It is unreliable output on a task that has a right answer.
The practical implication is that the default temperature settings in consumer LLM interfaces are optimized for the average use case across all users, not for your specific use case. If your use case requires consistency, the default settings are working against you and the products are not telling you that.
What Top-P Sampling and Other Parameters Add to the Non-Determinism Stack
Temperature is the most commonly discussed source of LLM non-determinism but it is not the only one. Two additional sampling parameters compound the effect in ways that even technically sophisticated users rarely account for.
Top-p sampling, also called nucleus sampling, adds a second layer of randomness by restricting the sampling pool to the smallest set of tokens whose cumulative probability exceeds a threshold p. At top-p of 0.9, the model samples only from tokens that together account for 90 percent of the probability mass, discarding the long tail of low-probability tokens before sampling occurs. The interaction between temperature and top-p means that two runs with identical temperature but different top-p settings produce different output distributions, and the combined effect of both parameters operating simultaneously is not intuitively predictable from either parameter in isolation.
Top-k sampling adds a third layer by capping the sampling pool at the k highest-probability tokens regardless of their cumulative probability. Some models and serving configurations apply all three parameters simultaneously. The resulting output distribution is the product of three stacked sampling restrictions, each introducing its own source of variance.
The practical consequence for anyone building applications on LLM APIs is that reproducing a specific output requires fixing all three parameters simultaneously, not just temperature, and even then infrastructure-level sources of non-determinism, floating point rounding differences across GPU hardware, batching effects from serving frameworks, and model version updates that change the underlying weights, can produce different outputs on identical inputs with identical sampling parameters. Truly deterministic LLM output is harder to guarantee than the temperature zero setting implies.
What This Means For You
- Set temperature to zero for any task with a single correct answer: code generation, data extraction, factual lookup, and classification tasks all benefit from deterministic outputs, and the claude.ai and OpenAI API interfaces both expose temperature as a configurable parameter that ships above zero by default.
- Use the API rather than the chat interface if output consistency across runs matters for your use case, because the API exposes temperature, top-p, and top-k as explicit parameters while consumer chat interfaces apply platform defaults you cannot inspect or override.
- Treat repeated runs as a cheap consistency check on high-stakes outputs: asking the same factual question three times at default temperature and comparing the answers is a practical signal about how confident the model’s training signal is on that topic, because high-variance answers across runs indicate low-confidence training signal more reliably than any single response’s hedging language does.
- Do not interpret different answers to the same question as evidence of model error or confusion: non-determinism at default temperature settings is intended behavior, and the variation you observe across runs reflects the sampling mechanism rather than instability in the model’s underlying knowledge representation.
