What vLLM Won’t Tell You About Memory Overhead

vLLM’s benchmarks are real. Its memory requirements in production are significantly higher than those benchmarks imply, and the gap is not disclosed prominently anywhere in the official documentation. PagedAttention, the architecture that makes vLLM fast, pre-reserves a large block of your GPU memory at startup before a single request arrives. That reservation is configurable, but the default settings will consume memory you were planning to use for your model, on hardware you already sized carefully, the first time you run it.

Pithy Cyborg | AI FAQs – The Details

Question: What does vLLM not tell you about memory overhead, and why does PagedAttention’s KV cache reservation cause out-of-memory errors on hardware that should be sufficient for your model?

Asked by: DeepSeek V3

Answered by: Mike D (MrComputerScience) from Pithy Cyborg.

How PagedAttention Pre-Reserves Memory Before Your First Request

PagedAttention is genuinely the right architecture for high-throughput LLM serving. It eliminates KV cache fragmentation by managing attention memory in fixed-size pages, the same way an operating system manages virtual memory. The throughput gains over naive serving are real and well-documented. The memory cost of that architecture is not explained upfront anywhere that a first-time vLLM user is likely to read.

When vLLM starts, it runs a memory profiling pass to determine how much VRAM is available after loading the model weights. It then reserves a large fraction of that remaining VRAM as a pre-allocated KV cache block pool. The default gpu_memory_utilization parameter is set to 0.90, meaning vLLM will attempt to consume 90 percent of your total GPU memory between model weights and KV cache reservation combined.

On a 24GB GPU running a model that consumes 18GB of weights, vLLM does not leave you 6GB for KV cache. It targets 21.6GB total utilization, allocates the 18GB for weights, and then tries to pre-allocate 3.6GB of KV cache blocks at startup. If anything else is consuming GPU memory, including your display driver, that pre-allocation fails with an out-of-memory error before you serve a single token. The model loaded fine. vLLM fails anyway because the KV cache reservation could not complete.

The Four Memory Costs vLLM Documentation Buries

The model weights are the number everyone calculates. Three other memory consumers sit underneath PagedAttention’s reservation that compound the problem on real hardware.

CUDA context overhead consumes 500MB to 1.5GB of VRAM on initialization depending on your driver version and GPU model. This is GPU memory consumed before vLLM loads a single weight. It is not included in model footprint calculations and is not mentioned in vLLM’s memory planning documentation. On a 24GB card running a model sized to fit with 1GB to spare, CUDA context overhead alone can cause the startup OOM.

Activation memory is the second hidden cost. During inference, intermediate layer activations consume VRAM proportional to batch size and sequence length. vLLM’s memory profiling pass estimates this, but the estimate uses a synthetic profiling batch that may not represent your actual request distribution. High-concurrency workloads with long prompts generate activation memory spikes that the profiling pass underestimates.

Tensor parallelism communication buffers are the third. If you are running vLLM across multiple GPUs using tensor parallelism, each GPU allocates communication buffers for the all-reduce operations between layers. These buffers scale with model size and parallelism degree and are allocated on top of weights, KV cache, and activations. Multi-GPU setups that sized each card for the model shard often hit OOM on the communication buffer allocation.

The fourth is framework overhead from PyTorch’s caching allocator, which holds freed memory in a pool rather than returning it to the GPU immediately. Memory that appears freed in your monitoring tool may not be available for reallocation for several seconds, creating transient OOM conditions under load that are nearly impossible to reproduce in isolation.

The Configuration Changes That Actually Fix vLLM Memory Problems

The good news is that every one of these problems is addressable with explicit configuration. The bad news is that none of the fixes are prominently documented and most require understanding the problem before you know what to search for.

Dropping gpu_memory_utilization from 0.90 to 0.75 or 0.80 is the first and most impactful change for hardware that is hitting startup OOM errors. This reduces the KV cache pre-allocation proportionally, leaving headroom for CUDA context overhead and activation spikes. You trade some maximum throughput for stability. On a 24GB card that is borderline for your model size, this single parameter change frequently resolves OOM errors without any other modification.

The --max-num-batched-tokens and --max-num-seqs parameters directly control peak activation memory by limiting the maximum batch vLLM will attempt to process simultaneously. Reducing these below vLLM’s auto-selected defaults constrains activation memory spikes at the cost of maximum concurrent request handling. For single-user local deployments this tradeoff is almost always worth making.

For multi-GPU setups hitting tensor parallelism OOM errors, --gpu-memory-utilization needs to be set lower than you would on a single-GPU setup specifically to account for communication buffer allocation. A value of 0.70 is a reasonable starting point for 2-GPU tensor parallel configurations on 24GB cards.

What This Means For You

Set --gpu-memory-utilization 0.80 as your starting point rather than accepting vLLM’s 0.90 default, especially on any GPU where the model weights consume more than 70 percent of total VRAM, because the default leaves insufficient headroom for CUDA context and activation overhead.
Check actual free VRAM after CUDA context initialization before sizing your vLLM deployment by running a minimal CUDA program and measuring the baseline consumption, since 500MB to 1.5GB of invisible overhead changes the math on borderline hardware configurations significantly.
Monitor GPU memory during peak load rather than idle: vLLM’s memory usage at rest after startup does not reflect its peak consumption under concurrent requests, and sizing based on idle memory will produce OOM errors under real workloads that benchmarks never surface.
Read the vLLM GitHub issues tracker before the official documentation for memory-related problems, because the actual configuration knowledge for edge cases lives in issue threads and community PRs rather than in docs that are written for happy-path deployments on well-provisioned hardware.

Want AI Breakdowns Like This Every Week?

Subscribe (Free) → pithycyborg.substack.com

Read archives (Free) → pithycyborg.substack.com/archive

Additional menu