Prompt caching stores the processed key-value representations of repeated context prefixes so Claude does not recompute them on every request. When your system prompt, documents, or conversation history are identical across requests, the cached computation is reused. First request pays full cost. Subsequent requests are faster and cheaper.
Analysis Briefing
- Topic: Anthropic prompt caching mechanics and cost implications
- Analyst: Mike D (@MrComputerScience)
- Context: A technical briefing developed with Claude Sonnet 4.6
- Source: Pithy Cyborg
- Key Question: Why do some Claude API calls return faster, and how do you make that happen deliberately?
What Gets Cached and What the KV Cache Actually Stores
Every transformer model processes input by computing attention over key and value representations of each input token. These computations are expensive relative to the final output generation step. Prompt caching stores those key-value representations for reuse rather than recomputing them on every request.
For caching to work, the beginning of the prompt must be identical across requests. Anthropic’s implementation caches prefixes: contiguous blocks of tokens at the start of the prompt that match a previously processed sequence. A system prompt that is identical across all your requests is cached after the first call. Every subsequent call that starts with that system prompt skips the KV computation for those tokens and pays only for the new tokens being added.
The cache is stored server-side by Anthropic’s infrastructure with a five-minute TTL by default. A request that arrives more than five minutes after the last cache hit triggers a cache miss and recomputes the prefix from scratch. For high-frequency applications this is not a constraint. For applications with lower request rates, the TTL means caching benefits are inconsistent unless the cache is actively maintained through periodic requests.
The Cost and Latency Numbers That Make Caching Worth Configuring
Prompt caching for Claude Sonnet 4.6 writes cached tokens at 1.25x the standard input token price on the first request. Subsequent cache hits read those tokens at 0.1x the standard input token price. The breakeven point is roughly two cache hits, after which every subsequent hit produces significant cost reduction.
For applications with large system prompts or repeated document context, the savings compound quickly. A RAG pipeline that includes 50,000 tokens of retrieved documents in every request pays full input token pricing on every call without caching. With caching, the document context is computed once and the per-request cost drops by up to 90 percent on the cached portion.
Latency improvements are proportional to the cached token count. Skipping KV computation on 50,000 cached tokens produces a measurably faster time-to-first-token than recomputing those tokens on every request. For interactive applications where response start time matters, caching large stable context is as much a latency optimization as a cost optimization.
The Use Cases Where Caching Produces the Biggest Impact
Long system prompts are the highest-impact caching target. A system prompt that runs to several thousand tokens and is identical across all requests produces cache hits on every call after the first. The cost reduction on the system prompt portion is immediate and consistent.
Document-grounded applications are the second highest-impact target. If your application always includes the same reference documents, the same code repository context, or the same knowledge base chunks in every request, those documents are ideal cache candidates. The document tokens are recomputed on every request without caching and cached once for all subsequent requests with it.
Multi-turn conversations benefit less directly from prefix caching because the conversation history changes with every turn. However, keeping the system prompt and any stable document context at the beginning of the prompt and appending conversation history after it ensures the stable portion remains cache-eligible even as the conversation grows.
What This Means For You
- Mark your system prompt for caching using the cache_control parameter in the Anthropic API if your system prompt exceeds 1,000 tokens and is identical across requests. The breakeven point is two cache hits and most production applications exceed that immediately.
- Place stable content before dynamic content in every prompt. Caching works on prefixes. Documents, instructions, and context that are identical across requests belong at the top. User inputs and conversation history belong at the bottom.
- Monitor cache hit rates in production using the usage fields in API responses. Low hit rates on a prompt you believe should be caching indicate the five-minute TTL is expiring between requests and the cache is not being maintained.
- Calculate your breakeven before implementing. Cache writes cost 1.25x standard input pricing. Cache reads cost 0.1x. If your average request sends fewer than 2,000 cached tokens and your hit rate is below 50 percent, the cost reduction may not justify the implementation complexity.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
