Llama 4 Scout running at Q4_K_M quantization on Ollama produces token repetition loops that the same model at Q6_K or Q8_0 does not produce on identical prompts. This is not a configuration error, a corrupted download, or an Ollama bug. It is a predictable consequence of how aggressive quantization degrades the precision of the weight matrices responsible for repetition penalty enforcement, and it gets worse on Scout’s specific MoE architecture in ways that standard quantization guidance for dense models does not anticipate.
Pithy Cyborg | AI FAQs – The Details
Question: Why does Llama 4 Scout repeat tokens and phrases after Q4_K_M quantization on Ollama, and what is the mechanism that makes MoE architecture models more vulnerable to repetition loops at aggressive quantization levels?
Asked by: Claude Sonnet 4.6
Answered by: Mike D (MrComputerScience) from Pithy Cyborg.
Why Repetition Penalty Enforcement Degrades Under Aggressive Quantization
Repetition penalty is a sampling-time constraint that reduces the probability of tokens that have already appeared in the generated sequence. The penalty is applied as a multiplicative factor to the logits of previously seen tokens before the softmax and sampling steps. Higher penalty values make repetition less likely. Lower values allow it.
The penalty itself is a simple scalar operation on logits. What is not simple is the underlying model state that determines which tokens are candidates for repetition in the first place. That determination depends on the attention mechanism’s ability to accurately track which tokens appeared earlier in the sequence and weight them appropriately in the current generation step. The attention computation is where quantization error enters the repetition loop problem.
Q4_K_M quantization compresses 16-bit floating point weight values to approximately 4-bit integer representations using a mixed-precision scheme that applies different quantization levels to different weight matrices based on their estimated sensitivity. The key and value projection matrices in the attention layers, the weights most directly involved in tracking token history across sequence positions, are among the matrices where Q4_K_M introduces the most precision loss relative to Q6_K and Q8_0.
When those matrices accumulate quantization error, the model’s attention over its own generation history becomes less precise. Tokens that appeared earlier in the sequence receive attention weights that are noisier and less reliably suppressed. The repetition penalty fires on the logit values, but the logit values themselves reflect an imprecise attention computation over the token history. The result is a penalty that is nominally applied but operating on a corrupted representation of what has already been generated. At specific sequence lengths and token distributions, this produces the repetition loops that Q4_K_M users experience and Q6_K users do not.
Why Llama 4 Scout’s MoE Architecture Makes This Worse Than Dense Models
Llama 4 Scout is a mixture-of-experts model with 16 experts and 17 billion active parameters out of 109 billion total. At each token generation step, a routing mechanism selects which subset of experts processes that token. The routing decision is itself a learned function computed from the current hidden state.
Quantization error in the routing mechanism produces a specific failure mode that dense models do not have: inconsistent expert selection across similar token contexts. A dense model with quantization error produces noisy logits but applies those noisy logits through the same weight matrices every time. A MoE model with quantization error in the routing layer may select different experts for tokens that a full-precision model would route identically, and may select the same experts repeatedly for tokens that should trigger expert diversity.
The combination of degraded attention history tracking and inconsistent expert routing under Q4_K_M quantization creates a compounding failure path. The attention mechanism imprecisely represents the token history. The routing mechanism imprecisely selects which experts process the current token. The expert that gets selected may have been the same expert that processed the previous several tokens due to routing degradation. That expert produces output distributions that are locally coherent but globally repetitive because it is being applied in a context where the model’s representation of what has already been said is degraded and the diversity mechanism that should vary the computation is misfiring.
This is why the same quantization level that works adequately for dense models like Mistral 7B produces more severe repetition artifacts on Scout. The MoE routing layer adds a second quantization-sensitive component to the repetition failure path that dense model quantization guidance does not account for.
The Fixes That Work and the Ones That Do Not
The fix most users try first is increasing the repetition penalty parameter in their Ollama Modelfile or API call. Setting repeat_penalty to 1.2 or 1.3 instead of the default 1.1 reduces the frequency of repetition loops by increasing the logit suppression applied to previously seen tokens. This works as a partial mitigation. It does not fix the underlying attention and routing degradation. It compensates for imprecise attention history tracking by applying a heavier penalty across the board, which reduces output diversity as a side effect and can produce clipped, terse responses on prompts that benefit from elaboration.
The fix that actually addresses the mechanism is quantization level selection. Q6_K retains significantly more precision in the key and value projection matrices and the routing layer than Q4_K_M. On Scout’s 109B total parameter MoE architecture, the VRAM cost of moving from Q4_K_M to Q6_K is approximately 30 to 35 percent higher. On a 24GB card, Q4_K_M Scout fits with room for context. Q6_K Scout requires offloading layers to CPU or a second GPU, which introduces latency. The repetition loop tradeoff is real and hardware-constrained.
The practical recommendation for users on 24GB single-GPU setups is Q5_K_M, which sits between Q4_K_M and Q6_K on the precision-VRAM tradeoff and eliminates most repetition artifacts on Scout without requiring CPU offloading on cards with 24GB. Users on 48GB dual-GPU setups or Mac hardware with unified memory above 48GB should run Q6_K or Q8_0 and avoid Q4_K_M on Scout entirely.
Increasing the context length parameter beyond what is needed for the current task also worsens repetition artifacts at Q4_K_M by giving the degraded attention mechanism more history to mistrack. Setting num_ctx to match the actual prompt and expected response length rather than a large default reduces the severity of quantization-induced repetition on long generation tasks.
What This Means For You
- Switch from Q4_K_M to Q5_K_M on Llama 4 Scout if you are on a 24GB single GPU and experiencing repetition loops, because Q5_K_M fits within 24GB VRAM on Scout with minimal CPU offloading while eliminating most of the attention precision degradation that causes repetition artifacts at Q4_K_M.
- Do not increase repeat_penalty above 1.15 as a primary fix for quantization-induced repetition loops, because the penalty increase compensates for degraded attention history tracking by suppressing token diversity globally, which reduces response quality on elaborative tasks in ways that are harder to notice than the repetition loops it partially fixes.
- Set num_ctx to match your actual use case rather than maximizing it to the model’s theoretical context window, because longer context windows require the degraded attention mechanism to track more token history and worsen quantization-induced repetition artifacts at Q4_K_M on Scout’s MoE architecture.
- Treat Q4_K_M as a VRAM emergency quantization level for Scout specifically, not a quality-neutral size reduction: the same quantization level that runs adequately on dense models introduces MoE routing degradation on Scout that produces output quality issues beyond repetition, including inconsistent reasoning depth across similar prompts, that Q5_K_M and above avoid.
