Quantizing a model requires holding the full-precision weights in memory at the same time as the quantization process infrastructure. The peak VRAM during quantization can be two to three times the VRAM required to run the finished quantized model. If you sized your hardware for inference, you sized it wrong for quantization.
Analysis Briefing
- Topic: Quantization VRAM requirements vs. inference
- Analyst: Mike D (@MrComputerScience)
- Context: A technical briefing developed with Claude Sonnet 4.6
- Source: Pithy Cyborg
- Key Question: Why does quantizing a model OOM when running it doesn’t?
Why Quantization Loads the Full Model Before Compressing It
Quantization is not a streaming operation. It does not process the model layer by layer while keeping memory usage flat.
Most quantization tools, including llama.cpp’s quantize binary and bitsandbytes, load the entire set of full-precision FP16 or BF16 weights into memory first. Only then does the compression process begin. During that window, your GPU holds both the original weights and the quantization infrastructure simultaneously.
For Llama 4 Scout at FP16, that means approximately 218GB of weight data before the quantization process adds its own overhead. A single 24GB card cannot hold that. Neither can two. The finished Q4_K_M file is around 55GB. The gap between 55GB and 218GB is where the OOM lives.
The Hidden Overhead Nobody Puts in the Setup Guide
The raw weight size is not the only VRAM consumer during quantization. Three additional cost centers appear simultaneously.
Calibration data processing requires holding activation samples in memory to compute quantization scales and zero points. This is especially true for GPTQ and AWQ quantization methods, which need representative input samples to calibrate the compression accurately. Those samples and their intermediate activations consume VRAM alongside the weights.
The output buffer for the quantized weights accumulates in memory before being written to disk. During the write window, both the original and the compressed representations coexist in memory. On large models this overlap is significant.
Quantization kernels themselves consume CUDA context memory, typically 500MB to 1.5GB depending on the method and framework. Small relative to the weights, but enough to push an already-tight setup over the limit.
When You Can Quantize Locally and When You Cannot
The math is straightforward once you know what to measure. For Llama 4 Scout FP16, you need approximately 240 to 260GB of accessible memory to quantize without offloading. That means a multi-GPU server, a Mac with 192GB unified memory, or a CPU quantization run that accepts significantly longer processing time.
CPU quantization via llama.cpp’s quantize tool is the practical escape valve for single-consumer-GPU setups. It uses system RAM instead of VRAM. A machine with 256GB of DDR5 RAM can quantize Scout where a 24GB GPU cannot. The tradeoff is time. CPU quantization of a 109B MoE model runs hours rather than minutes.
The fastest path for most self-hosters is skipping local quantization entirely. Download pre-quantized GGUF files from verified sources, verify the SHA256 checksum against the publisher’s stated hash, and run inference directly. Reserve local quantization for cases where no trusted pre-quantized version exists.
What This Means For You
- Check peak VRAM requirements, not inference VRAM, before attempting local quantization. Multiply the FP16 model size by 1.2 to estimate minimum quantization memory.
- Use CPU quantization via llama.cpp if your GPU VRAM is insufficient. It uses system RAM, runs slower, but completes without OOM errors on machines with 256GB RAM.
- Download pre-quantized GGUFs from verified publishers when available. Verify SHA256 checksums against the publisher’s stated hash before running anything.
- Avoid GPTQ and AWQ on consumer hardware for models above 30B parameters. Both methods require calibration data that adds significant VRAM overhead beyond the base weight cost.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
