Slow local LLM inference is almost never the model’s fault. It is almost always a configuration problem sitting between your hardware and the model that nobody in the setup tutorials mentions. The four most common culprits are CPU offloading you did not know was happening, VRAM that is technically available but fragmented, a context length set higher than your hardware can serve efficiently, and a serving framework running on the wrong backend for your GPU architecture.
Pithy Cyborg | AI FAQs – The Details
Question: Why is my local LLM inference so much slower than benchmarks suggest it should be, and how do I diagnose the real bottleneck between my GPU, RAM, and serving framework?
Asked by: Mistral
Answered by: Mike D (MrComputerScience) from Pithy Cyborg.
The CPU Offload Problem That Tanks Inference Speed Silently
The single most common cause of unexpectedly slow local LLM inference is partial CPU offloading that is happening without any visible warning. When a model does not fit entirely in VRAM, frameworks like Ollama and llama.cpp automatically offload layers to system RAM and run them on the CPU. The model loads. Inference starts. Nothing tells you that 30 percent of your layers are running at one-tenth the speed.
GPU inference on a modern RTX 4090 runs at roughly 100 to 120 tokens per second on a well-fitted 7B model. CPU inference on the same layers runs at 3 to 8 tokens per second. When your setup is split, your effective speed is a weighted average dragged down catastrophically by the CPU portion. A setup with 70 percent GPU layers and 30 percent CPU layers does not run at 70 percent of full GPU speed. It runs far slower because every generation step touches both.
The diagnostic is straightforward. In Ollama, run ollama ps while the model is active and check the GPU memory usage against the model’s expected footprint. In llama.cpp, the --n-gpu-layers flag controls offloading explicitly. If you did not set it, the default may not be fitting the full model on your GPU. Set it to 999 to push everything possible onto the GPU and see if your tokens per second jump immediately.
Why VRAM Fragmentation and Context Length Kill Throughput
Even when the full model fits in VRAM, two configuration choices silently destroy inference speed that most tutorials never address.
VRAM fragmentation happens when your GPU has enough total memory for the model but that memory is not contiguous. Other processes, your display driver, other applications, and background GPU tasks all consume VRAM in chunks. A model that needs 10GB of contiguous VRAM cannot load correctly into a GPU with 12GB total but only 9.5GB free in any single contiguous block. The framework falls back to partial CPU offloading silently. The fix is to close every GPU-consuming application before loading your model and check actual free VRAM rather than total VRAM before diagnosing a hardware limitation.
Context length is the less obvious killer. KV cache, the memory structure that stores attention state for your context window, grows linearly with context length and consumes VRAM that competes directly with model weights. Setting a 32k context length on a GPU that can barely fit the model weights at 4k context forces aggressive KV cache offloading that tanks generation speed dramatically. Most Ollama model files default to 2048 or 4096 context. If you changed that number upward without understanding the VRAM math, that change is likely responsible for your slowdown. Set your context length to the minimum your actual use case requires, not the maximum the model supports.
When the Serving Framework Itself Is the Bottleneck
Ollama is the easiest local LLM setup and not the fastest. That tradeoff is worth understanding before you spend time optimizing hardware configuration on a framework that has a performance ceiling below what your GPU can actually deliver.
Ollama uses llama.cpp under the hood with a convenience layer on top. For most consumer GPU setups, this is fine. For RTX 4090, A100, or multi-GPU configurations, vLLM or llama.cpp compiled directly with optimal CUDA flags will deliver meaningfully higher throughput than Ollama’s default build. vLLM’s PagedAttention architecture specifically eliminates KV cache fragmentation at the framework level, which is the single biggest source of throughput loss in high-context serving scenarios.
The backend compilation matters too. Ollama ships with precompiled CUDA binaries that target a broad range of GPU architectures. Compiling llama.cpp from source with your specific GPU’s compute capability flag, sm_89 for RTX 4090, sm_90 for H100, produces binaries optimized for your exact hardware rather than a general compatibility target. Self-hosters who have done this report 15 to 25 percent throughput improvements on identical hardware without changing any other configuration.
What This Means For You
- Run
ollama psor check--n-gpu-layersoutput right now to confirm your model is fully GPU-resident and not silently offloading layers to CPU, because partial offloading is responsible for more slow inference reports than any other single cause. - Close every GPU-consuming application before loading your model and verify actual free contiguous VRAM rather than total VRAM before concluding your hardware is insufficient for a given model size.
- Set context length to your actual use case minimum, not the model maximum: every 4k tokens of context you add above what you need consumes VRAM that your model weights need and forces KV cache offloading that directly reduces tokens per second.
- Benchmark llama.cpp compiled from source against your Ollama setup if you have an RTX 4090 or better, because the precompiled binaries Ollama ships are not optimized for your specific GPU architecture and the throughput gap on high-end hardware is measurable and consistent.
