Quantization reduces the number of bits used to represent each model weight. A model quantized to 4-bit uses roughly half the memory of its 8-bit version and a quarter of its 32-bit version. But memory usage at runtime is not determined by weights alone. The KV cache, activations, the runtime framework, and the context window all consume RAM independently of quantization, and they do not shrink when you reduce weight precision.
Analysis Briefing
- Topic: Local LLM memory usage, KV cache overhead, and quantization limits
- Analyst: Mike D (@MrComputerScience)
- Context: A structured investigation kicked off by Claude Sonnet 4.6
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: If you quantize a model down to 4-bit, why does it still need so much memory to run?
What Quantization Actually Reduces (and What It Doesn’t)
Quantization compresses the stored model weights. A 7-billion-parameter model at 16-bit floating point uses roughly 14GB. At 4-bit, that same model uses roughly 3.5GB. That is a real and significant reduction in the memory needed to load the model.
What quantization does not touch is the KV cache. During inference, the model stores Key and Value tensors for every token in the current context window. These are stored at the model’s native activation precision, typically 16-bit, regardless of weight quantization. With a 32K context window and a 7B model, the KV cache alone can require 2 to 4GB of additional memory on top of the quantized weights.
Activations during the forward pass also require temporary memory proportional to the context length and batch size, not to the quantization level. The runtime framework (llama.cpp, Ollama, vLLM) adds its own overhead. The combination means a 4-bit quantized 7B model that looks like it should need 3.5GB may consume 8 to 10GB at runtime with a reasonable context window.
The Context Window Multiplier
The KV cache scales linearly with context length. Doubling the context window roughly doubles the KV cache memory requirement. This is why vLLM memory overhead surprises teams that plan capacity based on model size alone.
For local deployments with limited RAM, the practical choice is often between a larger model with a short context and a smaller model with a longer context. A 13B model at 4-bit with a 4K context window may fit in the same memory as a 7B model at 4-bit with a 32K context window.
KV Cache Quantization as a Further Reduction
Some inference frameworks (llama.cpp with the --cache-type-k and --cache-type-v flags, and certain vLLM configurations) support quantizing the KV cache itself to 8-bit or even 4-bit. This reduces KV cache memory by 50 to 75% at the cost of a small quality degradation, primarily visible in very long context tasks where cache precision matters most.
For most local use cases (coding assistance, document Q&A, conversation), KV cache quantization produces negligible quality impact and meaningful memory savings. It is one of the least-publicized optimizations available to local LLM users.
What This Means For You
- Plan memory capacity based on weights plus KV cache, not weights alone, because the KV cache for long context windows can equal or exceed the quantized weight size.
- Reduce your context window setting if you are hitting memory limits before dropping to a smaller model, because halving the context window halves the KV cache without changing model quality on short inputs.
- Enable KV cache quantization in your inference framework if it is supported, because it reduces runtime memory by 50% with minimal quality impact for most use cases.
Enjoyed this? Subscribe for more clear thinking on AI:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
