Fine-tuning Llama 4 Scout with a LoRA adapter and then quantizing the base model to Q4_K_M or Q5_K_M produces degraded outputs that look like adapter failure but are not. The adapter is intact. The base model weights are intact. The problem is that quantization shifts the activation distributions the adapter was trained to modify, and the adapter applies deltas calibrated for full-precision activations to quantized activations that no longer match the distribution it learned. The result is an adapter that is technically correct and functionally broken, and the standard debugging steps, redownloading the adapter, checking merge settings, adjusting learning rate retrospectively, address none of the actual cause.
Pithy Cyborg | AI FAQs – The Details
Question: Why do LoRA adapters trained on full-precision Llama 4 base models produce degraded outputs after post-training quantization, and what is the activation distribution shift mechanism that breaks adapter compatibility?
Asked by: Claude Sonnet 4.6
Answered by: Mike D (MrComputerScience) from Pithy Cyborg.
What LoRA Adapters Actually Do and Why Quantization Breaks the Assumption They Depend On
LoRA, low-rank adaptation, fine-tunes a model by adding small trainable weight matrices alongside the frozen base model weights rather than updating the base weights directly. During training, the adapter matrices learn to produce activation deltas that shift the base model’s behavior toward the fine-tuning objective. Those deltas are calibrated against the activation values the full-precision base model produces at each layer.
The critical assumption LoRA training makes is that the base model activations it saw during fine-tuning are the activations the adapter will see during inference. That assumption holds when the base model runs at the same precision during inference as during training. It breaks when the base model is quantized after the adapter is trained.
Quantization replaces the full-precision floating point weights with lower-precision integer representations. The quantized weights produce activation values that differ from full-precision activations in ways that are small on average but systematically biased at specific layers and specific input distributions. Those biases are not random noise. They are structured errors that reflect the quantization scheme’s specific precision tradeoffs across different weight matrices.
The adapter’s delta matrices were trained to produce corrections relative to the full-precision activation baseline. When the base model is quantized, the baseline shifts. The adapter applies corrections calibrated for a baseline that no longer exists. The resulting outputs are the sum of the quantized base model’s activations and corrections designed for a different activation regime. The mismatch produces outputs that are degraded in ways that correlate with the adapter’s training domain: the fine-tuning target behavior is partially lost, partially distorted, and partially replaced by artifacts from the baseline mismatch.
Why Llama 4 Scout’s MoE Architecture Makes This Worse
Llama 4 Scout’s mixture-of-experts architecture adds a dimension to the LoRA quantization compatibility problem that dense model guidance does not address. LoRA adapters trained on Scout apply their delta matrices to the activations produced by whichever experts the routing mechanism selected for each token. The adapter learns to produce corrections that assume specific expert routing patterns across the training distribution.
Quantizing the base model after LoRA training introduces quantization error into the routing mechanism as well as the expert weight matrices. The routing errors described in the previous piece on token repetition apply here with equal force: the quantized routing mechanism selects experts inconsistently relative to the full-precision routing the adapter was trained against. The adapter encounters activations from expert combinations it never saw during training, applies corrections calibrated for different expert activations, and produces outputs that reflect both the activation baseline mismatch and the routing mismatch simultaneously.
The compounding effect is that LoRA adapters on quantized Scout underperform not just in the adapter’s fine-tuning domain but across general tasks, because the routing mismatch affects all token processing rather than only the tokens most relevant to the fine-tuning objective. An adapter trained to improve Scout’s performance on legal document summarization will exhibit routing mismatch degradation on general conversation tasks as well, because the routing errors affect every inference step regardless of the task domain.
This is why LoRA adapters on quantized Scout frequently produce outputs that seem inconsistently degraded across tasks rather than specifically degraded in the fine-tuning domain. The inconsistency is not random. It tracks the routing mismatch distribution, which varies by token context rather than by task category.
The Three Fixes and Their Real Tradeoffs
Quantization-aware fine-tuning is the architecturally correct fix. Rather than training the LoRA adapter on the full-precision base model and quantizing afterward, QAT trains the adapter while the base model runs in its target quantized precision. The adapter learns delta matrices calibrated against the quantized activation baseline rather than the full-precision baseline. The resulting adapter is compatible with the quantized base model by construction rather than by approximation.
QAT requires more setup than standard LoRA fine-tuning and is not the default workflow in most fine-tuning frameworks. bitsandbytes supports QAT through its QLoRA implementation, which trains adapters against a 4-bit quantized base model using double quantization for the quantization constants. Unsloth’s QLoRA implementation adds optimized CUDA kernels that reduce QAT training time significantly on consumer hardware. If your fine-tuning objective is important enough to invest in a custom adapter, QAT is the approach that eliminates the post-quantization compatibility problem rather than mitigating it.
Post-hoc adapter recalibration is the partial fix for teams that have already trained adapters on full-precision models and cannot retrain. Running a short recalibration fine-tuning pass of the existing adapter against the quantized base model, using a small representative sample of the original fine-tuning data, allows the adapter matrices to adjust toward the quantized activation baseline without full retraining. The recalibration pass requires a fraction of the original training compute and recovers a significant portion of the post-quantization degradation. It does not fully recover the full-precision adapter performance because the adapter’s rank and capacity constrain how completely it can adjust to the new baseline.
Higher quantization levels reduce the degradation without eliminating it. Q5_K_M and Q6_K produce activation distributions closer to full-precision FP16 than Q4_K_M, which reduces the baseline mismatch the adapter must bridge. For adapters where full retraining is not feasible and recalibration is not practical, moving from Q4_K_M to Q6_K recovers meaningful adapter performance at the cost of additional VRAM. On Scout’s MoE architecture, Q6_K is the minimum quantization level that produces activation distributions close enough to FP16 that well-trained adapters function with acceptable degradation.
What This Means For You
- Use QLoRA rather than standard LoRA for any fine-tuning task where the resulting adapter will be deployed against a quantized base model, because QLoRA trains the adapter against the quantized activation baseline by construction and eliminates the post-quantization compatibility problem that standard LoRA fine-tuning creates.
- Run a recalibration pass on existing full-precision adapters before deploying them against a quantized Scout base model by fine-tuning for 50 to 100 steps against the quantized model using a representative subset of your original training data, because even a short recalibration pass recovers significant adapter performance degraded by activation baseline mismatch.
- Do not diagnose post-quantization adapter degradation as adapter corruption or merge failure before testing the adapter against the full-precision base model, because an adapter that performs correctly against FP16 and degrades against Q4_K_M has a quantization compatibility problem rather than an adapter integrity problem and the debugging steps for each cause are completely different.
- Set Q6_K as the minimum quantization floor for production deployments of quantized Scout with LoRA adapters, treating Q4_K_M and Q5_K_M as development and testing quantization levels where adapter performance degradation is acceptable and Q6_K as the level where adapter compatibility becomes reliable enough for production use cases.
