Jun 11, 2026 · 8 min · Dev Guides

Claude Prompt Caching Deep Dive: Cut Input Costs by Reusing Stable Prefixes

Claude is excellent at working with long context: large codebases, policy manuals, agent traces, tool schemas, retrieval bundles, and multi-step instructions. The downside is obvious to anyone running production workloads: if you resend the same 80,000-token prefix on every request, you pay for those input tokens again and again.

Prompt caching solves that problem.

Instead of charging full input price every time you send a stable prompt prefix, Claude can cache that prefix and let later requests reuse it at a much lower “cache read” price. Used well, prompt caching can dramatically reduce costs for coding agents, document chat, customer support copilots, research workflows, and any system where most of the prompt stays the same while the final user query changes.

This deep dive explains how Claude prompt caching works, what counts as a cache write versus a cache read, what breaks cache hits, and how to structure prompts for maximum savings.

What Claude Prompt Caching Actually Does

Prompt caching lets you mark parts of a request as reusable. Claude stores a tokenized representation of a stable prompt prefix. On later requests, if the beginning of the request matches a previously cached prefix, Claude can reuse it instead of processing those tokens from scratch.

The key phrase is stable prefix.

Prompt caching is not a semantic cache. Claude is not saying, “This looks similar enough.” It is matching the beginning of the request exactly enough to safely reuse the cached computation. If the reusable part changes, even slightly, you may get a cache miss.

Typical cacheable content includes:

Long system prompts
Tool definitions
Static application instructions
Product documentation
Codebase context
Style guides
Legal or compliance text
Few-shot examples
Long-running agent memory snapshots
Retrieval chunks that are reused across multiple turns

Typical non-cacheable or less useful content includes:

The user’s latest question
A changing timestamp
Request IDs
Randomized metadata
Dynamic retrieved snippets that differ every call
Conversation turns that are constantly appended before the cache boundary

The design goal is simple: put the expensive, stable stuff first; put the small, changing stuff last.

Cache Writes vs Cache Reads

Claude prompt caching has two relevant billing categories:

Billing category	What it means	Typical cost behavior
Cache write	Claude processes and stores a cacheable prefix for future reuse	More expensive than normal input tokens
Cache read	Claude reuses a previously cached prefix	Much cheaper than normal input tokens
Normal input	Tokens not cached or not matching a cache	Standard input price
Output	Claude’s generated response	Standard output price

The first time you send a cacheable prefix, you pay a cache write price for those tokens. On subsequent matching requests within the cache’s time-to-live, you pay the cheaper cache read price.

Exact pricing varies by model and provider, so always check the current rate card. In Anthropic-style pricing, cache writes are usually priced at a premium over normal input, while cache reads are a small fraction of normal input cost. That means caching is most valuable when a prefix is reused multiple times.

If you are accessing Claude through a gateway such as AI Prime Tech, which resells cheaper Claude API access alongside GPT-5.5 and Gemini 3, the same principle applies: cache writes cost more than reads, and the savings compound when you reuse stable prefixes often.

TTL: The Cache Is Temporary

Prompt caches are not permanent storage. They have a TTL, or time-to-live.

Claude commonly supports short-lived caching, often around five minutes by default, with longer TTL options available for some configurations. The exact TTL support can vary by model, API version, and provider.

This has important architectural consequences:

If users ask several follow-up questions within a few minutes, cache hit rates can be excellent.
If the same document is used once every few hours, a short TTL may not help much.
If a background agent loops through many tool calls quickly, caching can save a lot.
If traffic is sporadic, you may need to batch, prewarm, or accept occasional cache writes.

Think of prompt caching as a hot working-set optimization, not a replacement for a vector database, object store, or long-term memory system.

What Breaks a Cache Hit?

Cache hits are fragile by design. Claude can only reuse a cache when the request prefix matches the cached prefix. The most common cache breakers are surprisingly mundane.

1. Changing Content Before the Cache Boundary

If you put a timestamp near the top of the system prompt, every request becomes unique:

Current time: 2026-06-11T10:31:02Z
You are a helpful assistant...

That timestamp changes every call, so the prefix changes every call.

Better:

You are a helpful assistant...
[large stable instructions]

Then put dynamic values later:

Current time: 2026-06-11T10:31:02Z
User question: ...

2. Reordering Tools or Instructions

Tool definitions are often large, especially for agents. Reordering tools, changing JSON schema formatting, or injecting dynamic descriptions can invalidate a cache.

Keep tool definitions:

Deterministically ordered
Minified or consistently formatted
Versioned explicitly
Free of per-request metadata

3. Appending Conversation History Before Cached Content

A common mistake is building prompts like this:

System instructions
Conversation history
Large documentation bundle
Latest user message

The conversation history changes every turn, so the documentation bundle may no longer be part of a stable prefix.

A better structure:

System instructions
Large documentation bundle
Few-shot examples
Conversation history
Latest user message

Now the large reusable part is before the dynamic part.

4. Tiny Formatting Differences

Whitespace, serialization differences, changed key order in JSON, newline normalization, or template changes can all cause misses.

Use stable renderers:

Deterministic JSON serialization
Fixed section ordering
Stable markdown templates
No random IDs in cached sections
No “generated at” text in cacheable blocks

5. Switching Models

Caches are generally model-specific. A cache created for Claude Sonnet 4.6 should not be assumed to apply to Claude Opus 4.8, Haiku 4.5, or Fable 5. If you route requests dynamically across models, expect separate caches.

That does not mean you cannot use multiple models. It just means each model should have its own caching strategy.

Structuring Prompts for Maximum Hit Rate

The highest-leverage prompt caching trick is to design your prompt as a layered prefix.

Recommended Order

1. Tool definitions
2. System/developer instructions
3. Stable policy, documentation, code, or examples
4. Semi-stable context
5. Conversation history
6. Latest user input
7. Per-request metadata

The exact API representation depends on how you call Claude, but the conceptual order is what matters. Cache the largest stable prefix you can, and keep volatile content after it.

Use Versioned Stable Blocks

For long-lived applications, version your cacheable blocks:

<app_instructions version="2026-06-01">
...
</app_instructions>

<tool_contracts version="billing-tools-v17">
...
</tool_contracts>

<support_policy version="refund-policy-v9">
...
</support_policy>

This makes cache invalidation intentional. When the refund policy changes, the version changes. Until then, every request uses the same stable text.

Separate Stable Retrieval from Dynamic Retrieval

In retrieval-augmented generation, not all retrieved content is equally dynamic.

For example, a coding assistant may always include:

Repository architecture overview
Public API docs
Style guide
Testing conventions

Then it dynamically retrieves files relevant to the latest task.

Put the stable repository context in the cached prefix. Put task-specific retrieved snippets later. If a user is working in one area for several turns, you may also cache a semi-stable “working set” of files.

Cost Math: When Does Caching Pay Off?

Let’s use simple numbers. Suppose normal input costs 1 unit per token. Cache writes cost 1.25 units, and cache reads cost 0.10 units. These are illustrative ratios; check your actual model pricing.

Assume you have:

100,000-token stable prefix
2,000-token dynamic user/task section
1,000-token output
10 requests in the same cache window

Without caching, input cost is:

10 × (100,000 + 2,000) = 1,020,000 input-token units

With caching:

First request:
100,000 × 1.25 = 125,000 cache-write units
2,000 × 1.00 = 2,000 normal input units

Next 9 requests:
9 × 100,000 × 0.10 = 90,000 cache-read units
9 × 2,000 × 1.00 = 18,000 normal input units

Total input-equivalent units:
125,000 + 2,000 + 90,000 + 18,000 = 235,000

That is roughly a 77% input-side reduction in this simplified example.

The break-even point comes quickly when the cached prefix is large. With the ratios above, one write plus one read costs:

1.25 + 0.10 = 1.35

Two uncached sends would cost:

1.00 + 1.00 = 2.00

So even the second use can be profitable. The more requests hit the same cache within the TTL, the better the economics.

Combining Prompt Caching with Long Agents

Long-running agents are one of the best fits for Claude prompt caching.

Modern agents often include:

Large tool schemas
Planning instructions
Safety rules
Product documentation
Codebase maps
Prior task summaries
Execution traces
Intermediate observations

If every tool call resends all of that, costs balloon. With caching, you can keep the stable agent substrate hot while each step adds only the latest observation or instruction.

For example:

Cached prefix:
- Tool definitions
- Agent operating rules
- Repo map
- Coding conventions
- Test instructions
- Current task plan

Dynamic suffix:
- Latest tool result
- Next requested action

This is especially valuable with long-context models like Claude Fable 5 with 1M context, where the temptation is to include everything. Long context gives the model room to reason over large inputs; prompt caching makes repeated use of that context economically viable.

A practical pattern for agents:

Start with a large cached base prompt.
Keep frequently reused context before the cache boundary.
Summarize old volatile turns into a stable task memory.
Cache the updated task memory when it will be reused.
Put the newest tool result and next instruction at the end.

This avoids sending a constantly growing conversation as an uncached blob.

Operational Tips for Production

Track Cache Metrics

You should log:

Cache creation/write tokens
Cache read tokens
Normal input tokens
Output tokens
Cache hit rate by route
Cache hit rate by model
Prefix size
TTL expiry patterns

If your hit rate is low, inspect the rendered prompts. Usually something dynamic has crept into the prefix.

Prewarm When It Makes Sense

For high-traffic applications, you can deliberately create a cache before users need it. For example, when a customer opens a large document workspace, prewarm the cache with the document and instructions. Follow-up questions can then hit the cache.

Do this carefully: prewarming costs money, so it only pays off when reuse is likely.

Use Model-Specific Strategies

Claude Opus 4.8, Sonnet 4.6, Haiku 4.5, and Fable 5 have different cost/performance profiles. You may want:

Opus for hardest reasoning over cached expert context
Sonnet for balanced agent workloads
Haiku for fast, cheaper interactions
Fable for very large context windows
GPT-5.5 or Gemini 3 for fallback or comparative routing

If you use AI Prime Tech as a third-party gateway for cheaper Claude API access, model routing and cost monitoring become especially important. Prompt caching should be part of that routing strategy, not an afterthought.

Common Anti-Patterns

Avoid these:

Putting timestamps at the top of the prompt
Randomizing tool order
Injecting request IDs into system instructions
Re-rendering JSON with nondeterministic key order
Placing conversation history before stable docs
Caching tiny prefixes with low reuse
Assuming caches last forever
Switching models and expecting the same cache to hit
Treating prompt caching as semantic similarity caching

Prompt caching is powerful, but it rewards discipline.

Final Checklist

Before shipping Claude prompt caching, ask:

Is my largest stable content at the beginning of the request?
Are dynamic values after the cached prefix?
Are tools and JSON schemas rendered deterministically?
Do I understand the TTL?
Is reuse likely within that TTL?
Am I tracking cache writes, reads, and misses?
Have I calculated break-even for my actual model prices?
Does my agent summarize volatile history into reusable stable memory?

If the answer is yes, prompt caching can be one of the easiest ways to cut Claude input costs without reducing context quality. For long-context workflows and agents, it often turns “too expensive to run repeatedly” into “cheap enough to use continuously.”

Get cheaper Claude API access

One API key for Claude Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5, plus GPT & Gemini — up to 80% off official pricing, pay-as-you-go.

Get Your API Key →

AI Prime Tech is an independent third-party API gateway. Claude™ and Anthropic® are trademarks of Anthropic, PBC. No affiliation or endorsement is implied.