Jun 11, 2026 · 8 min · Dev Guides

Claude Prompt Caching Deep Dive: Cut Input Costs by Reusing Stable Prefixes

Claude Prompt Caching Deep Dive: Cut Input Costs by Reusing Stable Prefixes

Claude Prompt Caching Deep Dive: Cut Input Costs by Reusing Stable Prefixes

Claude is excellent at working with long context: large codebases, policy manuals, agent traces, tool schemas, retrieval bundles, and multi-step instructions. The downside is obvious to anyone running production workloads: if you resend the same 80,000-token prefix on every request, you pay for those input tokens again and again.

Prompt caching solves that problem.

Instead of charging full input price every time you send a stable prompt prefix, Claude can cache that prefix and let later requests reuse it at a much lower “cache read” price. Used well, prompt caching can dramatically reduce costs for coding agents, document chat, customer support copilots, research workflows, and any system where most of the prompt stays the same while the final user query changes.

This deep dive explains how Claude prompt caching works, what counts as a cache write versus a cache read, what breaks cache hits, and how to structure prompts for maximum savings.

What Claude Prompt Caching Actually Does

Prompt caching lets you mark parts of a request as reusable. Claude stores a tokenized representation of a stable prompt prefix. On later requests, if the beginning of the request matches a previously cached prefix, Claude can reuse it instead of processing those tokens from scratch.

The key phrase is stable prefix.

Prompt caching is not a semantic cache. Claude is not saying, “This looks similar enough.” It is matching the beginning of the request exactly enough to safely reuse the cached computation. If the reusable part changes, even slightly, you may get a cache miss.

Typical cacheable content includes:

Typical non-cacheable or less useful content includes:

The design goal is simple: put the expensive, stable stuff first; put the small, changing stuff last.

Cache Writes vs Cache Reads

Claude prompt caching has two relevant billing categories:

Billing categoryWhat it meansTypical cost behavior
Cache writeClaude processes and stores a cacheable prefix for future reuseMore expensive than normal input tokens
Cache readClaude reuses a previously cached prefixMuch cheaper than normal input tokens
Normal inputTokens not cached or not matching a cacheStandard input price
OutputClaude’s generated responseStandard output price

The first time you send a cacheable prefix, you pay a cache write price for those tokens. On subsequent matching requests within the cache’s time-to-live, you pay the cheaper cache read price.

Exact pricing varies by model and provider, so always check the current rate card. In Anthropic-style pricing, cache writes are usually priced at a premium over normal input, while cache reads are a small fraction of normal input cost. That means caching is most valuable when a prefix is reused multiple times.

If you are accessing Claude through a gateway such as AI Prime Tech, which resells cheaper Claude API access alongside GPT-5.5 and Gemini 3, the same principle applies: cache writes cost more than reads, and the savings compound when you reuse stable prefixes often.

TTL: The Cache Is Temporary

Prompt caches are not permanent storage. They have a TTL, or time-to-live.

Claude commonly supports short-lived caching, often around five minutes by default, with longer TTL options available for some configurations. The exact TTL support can vary by model, API version, and provider.

This has important architectural consequences:

Think of prompt caching as a hot working-set optimization, not a replacement for a vector database, object store, or long-term memory system.

What Breaks a Cache Hit?

Cache hits are fragile by design. Claude can only reuse a cache when the request prefix matches the cached prefix. The most common cache breakers are surprisingly mundane.

1. Changing Content Before the Cache Boundary

If you put a timestamp near the top of the system prompt, every request becomes unique:

Current time: 2026-06-11T10:31:02Z
You are a helpful assistant...

That timestamp changes every call, so the prefix changes every call.

Better:

You are a helpful assistant...
[large stable instructions]

Then put dynamic values later:

Current time: 2026-06-11T10:31:02Z
User question: ...

2. Reordering Tools or Instructions

Tool definitions are often large, especially for agents. Reordering tools, changing JSON schema formatting, or injecting dynamic descriptions can invalidate a cache.

Keep tool definitions:

3. Appending Conversation History Before Cached Content

A common mistake is building prompts like this:

System instructions
Conversation history
Large documentation bundle
Latest user message

The conversation history changes every turn, so the documentation bundle may no longer be part of a stable prefix.

A better structure:

System instructions
Large documentation bundle
Few-shot examples
Conversation history
Latest user message

Now the large reusable part is before the dynamic part.

4. Tiny Formatting Differences

Whitespace, serialization differences, changed key order in JSON, newline normalization, or template changes can all cause misses.

Use stable renderers:

5. Switching Models

Caches are generally model-specific. A cache created for Claude Sonnet 4.6 should not be assumed to apply to Claude Opus 4.8, Haiku 4.5, or Fable 5. If you route requests dynamically across models, expect separate caches.

That does not mean you cannot use multiple models. It just means each model should have its own caching strategy.

Structuring Prompts for Maximum Hit Rate

The highest-leverage prompt caching trick is to design your prompt as a layered prefix.

1. Tool definitions
2. System/developer instructions
3. Stable policy, documentation, code, or examples
4. Semi-stable context
5. Conversation history
6. Latest user input
7. Per-request metadata

The exact API representation depends on how you call Claude, but the conceptual order is what matters. Cache the largest stable prefix you can, and keep volatile content after it.

Use Versioned Stable Blocks

For long-lived applications, version your cacheable blocks:

<app_instructions version="2026-06-01">
...
</app_instructions>

<tool_contracts version="billing-tools-v17">
...
</tool_contracts>

<support_policy version="refund-policy-v9">
...
</support_policy>

This makes cache invalidation intentional. When the refund policy changes, the version changes. Until then, every request uses the same stable text.

Separate Stable Retrieval from Dynamic Retrieval

In retrieval-augmented generation, not all retrieved content is equally dynamic.

For example, a coding assistant may always include:

Then it dynamically retrieves files relevant to the latest task.

Put the stable repository context in the cached prefix. Put task-specific retrieved snippets later. If a user is working in one area for several turns, you may also cache a semi-stable “working set” of files.

Cost Math: When Does Caching Pay Off?

Let’s use simple numbers. Suppose normal input costs 1 unit per token. Cache writes cost 1.25 units, and cache reads cost 0.10 units. These are illustrative ratios; check your actual model pricing.

Assume you have:

Without caching, input cost is:

10 × (100,000 + 2,000) = 1,020,000 input-token units

With caching:

First request:
100,000 × 1.25 = 125,000 cache-write units
2,000 × 1.00 = 2,000 normal input units

Next 9 requests:
9 × 100,000 × 0.10 = 90,000 cache-read units
9 × 2,000 × 1.00 = 18,000 normal input units

Total input-equivalent units:
125,000 + 2,000 + 90,000 + 18,000 = 235,000

That is roughly a 77% input-side reduction in this simplified example.

The break-even point comes quickly when the cached prefix is large. With the ratios above, one write plus one read costs:

1.25 + 0.10 = 1.35

Two uncached sends would cost:

1.00 + 1.00 = 2.00

So even the second use can be profitable. The more requests hit the same cache within the TTL, the better the economics.

Combining Prompt Caching with Long Agents

Long-running agents are one of the best fits for Claude prompt caching.

Modern agents often include:

If every tool call resends all of that, costs balloon. With caching, you can keep the stable agent substrate hot while each step adds only the latest observation or instruction.

For example:

Cached prefix:
- Tool definitions
- Agent operating rules
- Repo map
- Coding conventions
- Test instructions
- Current task plan

Dynamic suffix:
- Latest tool result
- Next requested action

This is especially valuable with long-context models like Claude Fable 5 with 1M context, where the temptation is to include everything. Long context gives the model room to reason over large inputs; prompt caching makes repeated use of that context economically viable.

A practical pattern for agents:

  1. Start with a large cached base prompt.
  2. Keep frequently reused context before the cache boundary.
  3. Summarize old volatile turns into a stable task memory.
  4. Cache the updated task memory when it will be reused.
  5. Put the newest tool result and next instruction at the end.

This avoids sending a constantly growing conversation as an uncached blob.

Operational Tips for Production

Track Cache Metrics

You should log:

If your hit rate is low, inspect the rendered prompts. Usually something dynamic has crept into the prefix.

Prewarm When It Makes Sense

For high-traffic applications, you can deliberately create a cache before users need it. For example, when a customer opens a large document workspace, prewarm the cache with the document and instructions. Follow-up questions can then hit the cache.

Do this carefully: prewarming costs money, so it only pays off when reuse is likely.

Use Model-Specific Strategies

Claude Opus 4.8, Sonnet 4.6, Haiku 4.5, and Fable 5 have different cost/performance profiles. You may want:

If you use AI Prime Tech as a third-party gateway for cheaper Claude API access, model routing and cost monitoring become especially important. Prompt caching should be part of that routing strategy, not an afterthought.

Common Anti-Patterns

Avoid these:

Prompt caching is powerful, but it rewards discipline.

Final Checklist

Before shipping Claude prompt caching, ask:

If the answer is yes, prompt caching can be one of the easiest ways to cut Claude input costs without reducing context quality. For long-context workflows and agents, it often turns “too expensive to run repeatedly” into “cheap enough to use continuously.”

Get cheaper Claude API access

One API key for Claude Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5, plus GPT & Gemini — up to 80% off official pricing, pay-as-you-go.

Get Your API Key →
AI Prime Tech is an independent third-party API gateway. Claude™ and Anthropic® are trademarks of Anthropic, PBC. No affiliation or endorsement is implied.