Why Does a Self-Hosted Model Feel Fast for One User but Collapse for Three?

Local LLM inference is sequential by default. When one user sends a request, the model generates tokens one at a time until the response is complete. When three users send requests simultaneously, they queue. Each user waits for the previous user’s full response before their generation begins. What felt like a fast personal tool becomes a slow shared service because the throughput model was never designed for concurrency.

Analysis Briefing

Topic: Self-hosted LLM concurrency, batching, and throughput limits
Analyst: Mike D (@MrComputerScience)
Context: A research sprint initiated by Claude Sonnet 4.6
Source: Pithy Cyborg | AI News Made Simple
Key Question: What changes architecturally between a single-user local model and a multi-user shared deployment?

Why Sequential Inference Breaks Under Shared Load

LLM inference has two phases: prefill (processing the input prompt) and decode (generating tokens one at a time). The decode phase is memory-bandwidth-bound: the model must load all weights from GPU memory for each generated token. On a single GPU, this process cannot be parallelized across independent requests without batching.

A single user prompt that generates a 500-token response ties up the GPU for the entire generation time before the next request can start. Three users with concurrent 500-token responses at sequential throughput face roughly triple the latency for users two and three.

Local LLM inference speed explains the single-user performance story. The multi-user collapse happens because most local inference setups (Ollama default configuration, basic llama.cpp server) use sequential request handling without continuous batching.

Continuous Batching: The Fix That Changes the Math

Continuous batching allows the inference server to group decode steps from multiple concurrent requests into a single GPU operation. Instead of processing User 1’s full response before starting User 2, the server processes one decode step for User 1, one for User 2, and one for User 3 simultaneously in each GPU pass.

This does not make each user’s response three times faster. It makes the GPU three times more utilized, which means total throughput increases proportionally. Each user’s generation is slightly slower because the GPU is sharing compute, but the wait time drops from “full previous response” to “one decode step.”

vLLM implements continuous batching and PagedAttention (efficient KV cache management across concurrent requests). Ollama as of 2026 has basic parallel request support but with limitations. For a team of three to ten users, vLLM on a single GPU is the architectural step that transforms a personal local model into a shared service.

The Capacity Planning Calculation

A 7B model at 4-bit on an RTX 4090 generates roughly 80 to 120 tokens per second for a single user. With continuous batching and three concurrent users, total throughput is roughly 150 to 200 tokens per second, but each user sees 50 to 70 tokens per second instead of 80 to 120. That is acceptable for most interactive use cases.

At ten concurrent users, the math becomes challenging. Each user sees 15 to 25 tokens per second, which starts to feel slow for interactive chat. At that point, either a second GPU or a larger model with better throughput characteristics is necessary.

What This Means For You

Switch from Ollama to vLLM if you are serving more than two concurrent users, because vLLM’s continuous batching produces dramatically better multi-user throughput than sequential request handling.
Plan for throughput per user rather than total throughput, because 200 tokens per second distributed across 10 users produces an experience that feels slow even though the number sounds fast.
Set user expectations about latency before sharing a local deployment, because a model that feels instant for the developer running it alone will feel frustratingly slow for three colleagues using it simultaneously without architectural changes.

Enjoyed this? Subscribe for more clear thinking on AI:

Pithy Cyborg | AI News Made Simple → AI news made simple without hype.

Additional menu

Analysis Briefing

Why Sequential Inference Breaks Under Shared Load

Continuous Batching: The Fix That Changes the Math

The Capacity Planning Calculation

What This Means For You

Footer

Get The Latest Issue Of Pithy Cyborg | AI News Made Simple For FREE.