Jun 11, 2026 · 11 min · Dev Guides

Managing Context in Long-Running Claude Agents: Tool Search, Context Editing & Compaction

Managing Context in Long-Running Claude Agents: Tool Search, Context Editing & Compaction

A Claude agent that runs for dozens of steps faces one inevitable enemy: its own context window. Every tool call adds a result, every turn adds history, and eventually the window fills with material that’s no longer relevant. Three techniques — tool search, context editing, and compaction — keep long agents running reliably and cheaply. Here’s how to use them, and how they combine with prompt caching.

Why long agents overflow

The window doesn’t fill with your prompt — it fills with accumulation:

Past a certain point you either truncate (losing information) or pay to carry dead weight every single step. The fix is to manage what’s in the window deliberately.

1. Tool search: load tools on demand

If your agent has 30 tools but uses 3 per task, defining all 30 on every request wastes input tokens and clutters the model’s choices. Tool search lets the agent discover and load tools when it needs them instead of carrying the full catalog upfront.

This alone can dramatically shrink the fixed overhead of a tool-heavy agent.

2. Context editing: remove stale tool results

Once a tool result has been used, it often doesn’t need to stay verbatim in the window. Context editing removes or condenses stale tool_result blocks while preserving the reasoning that depended on them.

The art is removing bytes without removing meaning — keep the conclusions, drop the raw data.

3. Compaction: summarize and continue

When the window is genuinely near full, compaction summarizes the conversation so far into a compact form and continues from there. Instead of hitting a hard wall, the agent carries forward a distilled memory.

Compaction is what lets an agent run effectively “forever” instead of dying when it hits the context ceiling.

Combining them with prompt caching

These techniques interact with caching, and the order matters:

Used together, you get a long agent that holds only what’s relevant, reads its stable prefix from cache cheaply, and never falls off the end of its window.

Putting it together

A robust long-agent loop looks like:

  1. Load a small core toolset; expose the rest via tool search.
  2. After each step, context-edit away the raw tool result you no longer need.
  3. Monitor window usage; when it crosses your threshold, compact.
  4. Keep the cacheable prefix immutable so cache hits stay high.

Cost angle

All three techniques reduce input tokens, which compounds with the rate you pay per token. Running the agent through a discounted gateway like AI Prime Tech — same Claude models, up to 80% off official pricing, one key across Opus 4.8, Sonnet 4.6 and Haiku 4.5 — means a well-managed long agent costs a small fraction of the naive version that carries its full history every step. Lean context plus a lean rate is how you run agents at scale without the bill scaling with them.

Get cheaper Claude API access

One API key for Claude Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5, plus GPT & Gemini — up to 80% off official pricing, pay-as-you-go.

Get Your API Key →
AI Prime Tech is an independent third-party API gateway. Claude™ and Anthropic® are trademarks of Anthropic, PBC. No affiliation or endorsement is implied.