Getting refused by an AI model on a request it handled without friction last week is one of the most consistently maddening experiences in the current generation of LLM products. The request did not change. The user did not change. The model apparently did. The explanation is not that the model is being arbitrarily inconsistent or that you did something wrong. It is that every major LLM runs multiple classifier systems in parallel with the language model itself, and those classifiers make probabilistic decisions that vary across sessions, model versions, and prompt phrasings in ways that are neither transparent nor consistent by design.
Pithy Cyborg | AI FAQs – The Details
Question: Why do AI models like Claude and ChatGPT refuse requests they previously helped with, and what is the safety classifier mechanism that makes refusal behavior inconsistent across sessions and model versions?
Asked by: Gemini 2.0 Flash
Answered by: Mike D (MrComputerScience) from Pithy Cyborg.
Why Safety Classifiers Are a Separate System From the Language Model You Think You Are Talking To
The mental model most users have of an LLM is a single system that reads their prompt and generates a response. The actual architecture of every major production LLM is more complex and the additional components are precisely the ones responsible for inconsistent refusals.
Production LLMs run input classifiers that evaluate prompts before the language model processes them, output classifiers that evaluate generated responses before they are returned to the user, and in some deployments intermediate classifiers that monitor reasoning chains during generation. These classifiers are separate models, often smaller and faster than the main language model, trained specifically to identify content categories that violate the platform’s usage policies. They operate on probability thresholds, not binary rules, and they produce scores rather than definitive classifications.
The language model you interact with is the component that generates the actual response text. The classifiers are the components that determine whether that response gets returned to you or replaced with a refusal message. A request that the language model would handle competently and helpfully can be refused before the language model fully processes it if an input classifier scores it above the refusal threshold for a sensitive content category. A response the language model generated completely can be replaced with a refusal message if an output classifier scores it above threshold after generation.
The classifiers and the language model were trained separately, updated on different schedules, and tuned against different objectives. Their interaction produces emergent behavior that neither system’s developers fully predicted, including the inconsistent refusals that users experience as arbitrary or capricious.
The Four Mechanisms That Make Refusal Behavior Inconsistent Across Sessions
Inconsistent refusals are not random. They have specific causes that cluster into four mechanisms, each of which produces a different pattern of inconsistency.
Probabilistic classifier thresholds are the first mechanism. Safety classifiers produce probability scores, not binary decisions. A prompt that scores 0.73 on a sensitive content classifier with a refusal threshold of 0.75 gets through. The same prompt rephrased slightly, processed in a different session context, or evaluated by a classifier that received a minor update scores 0.76 and gets refused. The two-point difference is invisible to the user and produces responses that feel like arbitrary inconsistency. Both decisions are correct by the classifier’s probabilistic logic. Neither decision feels consistent from the user’s perspective.
Model version updates are the second. AI labs update their models continuously, sometimes with changes to the language model weights, sometimes with changes to the classifier thresholds or training data, and sometimes with both simultaneously. A request that worked last week may be refused this week because a classifier update shifted the threshold on a content category that the request touches. The labs do not publish classifier update logs. Users have no mechanism to know whether a new refusal reflects a changed model or an edge case in an unchanged one.
Session context contamination is the third. Classifiers evaluate prompts not just in isolation but in the context of the conversation they appear in. A benign request that follows a series of messages the classifier scored as borderline inherits elevated risk scores from the conversation context. The same request made as the opening message of a fresh conversation scores lower because the context is clean. This is why the same request sometimes works in a new conversation after being refused in an existing one, a behavior users discover empirically and find baffling because the request text is identical.
Phrasing sensitivity is the fourth and most exploitable mechanism, which is precisely why the labs do not document it clearly. Classifier models are sensitive to surface-level phrasing features that are semantically irrelevant to the underlying request. A request phrased in passive voice may score differently from the same request in active voice. A request that includes certain trigger words scores higher than a semantically equivalent request that avoids them. A request framed as a hypothetical or fictional scenario may score differently from the same request framed as a direct query. The semantic content is identical. The classifier scores differ. The user experiences inconsistency.
Why the Labs Cannot Simply Fix This and What They Are Actually Doing About It
The inconsistent refusal problem is not an engineering oversight. It is a fundamental tension between two objectives that cannot be simultaneously optimized: catching genuinely harmful requests reliably and avoiding false positives on legitimate requests. Every adjustment that makes classifiers more sensitive to harmful content increases false positive rates on legitimate content. Every adjustment that reduces false positives on legitimate content reduces sensitivity to genuinely harmful requests.
The labs are aware of this tension and address it through continuous threshold tuning, classifier retraining on new examples, and in Anthropic’s case through Constitutional AI approaches that attempt to incorporate safety reasoning into the language model itself rather than relying entirely on external classifiers. None of these approaches eliminate the inconsistency. They shift the tradeoff curve between false positive and false negative rates without resolving the underlying tension.
What the labs will not tell you directly is that the refusal threshold calibration reflects commercial and reputational risk assessment as much as harm prevention logic. A false negative, where a genuinely harmful request gets through, creates reputational and potentially legal liability. A false positive, where a legitimate request gets refused, creates user frustration. The calibration that minimizes false negatives at the cost of elevated false positives is a rational commercial decision that happens to produce the inconsistent refusal experience users find frustrating. The refusal behavior is not miscalibrated relative to the lab’s actual objective. It is miscalibrated relative to what users assume the objective is.
What This Means For You
- Open a new conversation before concluding a request is permanently refused: session context contamination means the same request that was refused in a conversation with borderline earlier messages may succeed as the opening message of a fresh session with no prior context to inflate the classifier score.
- Rephrase refused requests by changing framing rather than intent: passive versus active voice, hypothetical versus direct framing, and avoidance of specific trigger vocabulary all affect classifier scores on semantically identical requests, and rephrasing is not circumventing safety systems, it is navigating classifier surface sensitivity on legitimate requests.
- Expect increased refusal rates after model updates even on requests that previously succeeded, because classifier threshold adjustments are deployed continuously without user notification and a changed refusal on an unchanged request almost always reflects an updated classifier rather than a changed underlying policy.
- Use the feedback mechanism on false positive refusals by clicking the thumbs down on refused responses that are clearly legitimate, because classifier retraining depends on labeled examples of false positives and user feedback is one of the primary sources of that training signal for every major LLM platform.
