In Part 1, we showed that Claude Code
sessions were burning through quota at 26-28% per turn — because the prompt
cache was being rebuilt from scratch on every API call. But why?
To answer that, we need to understand how Claude Code structures its API
requests and where prompt caching fits in.
The Cache Hierarchy
Every Claude Code API call sends a layered payload to the Messages API. The
layers, in order:
- Tools — the full schema of every available tool (Read, Write, Bash,
Grep, MCP tools, etc.) - System prompt — instructions, attribution header (including a
cc_versionfingerprint), environment info - Messages — the full conversation history: user messages, assistant
responses, tool calls, tool results
Anthropic's prompt caching works on prefix matching. The server compares
the incoming request against previously cached requests. If the beginning of
the request — the "prefix" — is byte-identical to a cached version, the server
can skip re-processing those tokens and charge a reduced rate.
The critical word is prefix. The match starts at the top (tools) and works
down. If the tools are identical, the server checks the system prompt. If that
matches too, it checks the messages. The moment any byte differs, everything
from that point forward is a cache miss.
This means changes cascade downward:
| What Changed | What Gets Cache-Busted |
|---|---|
| A tool definition | Tools + System + All messages |
| The system prompt | System + All messages |
| A message in the conversation | That message + all subsequent messages |
A single reordered tool definition at the top of the hierarchy invalidates the
cache for the entire request — system prompt, conversation history, everything.
Cache Breakpoints and TTL
Claude Code places cache_control markers at strategic points in the request
to tell the API "cache up to here." The API supports two TTL tiers:
ephemeral— 5-minute TTL (the default)ephemeralwithttl: "1h"— 1-hour TTL (available on some plans)
The 1-hour TTL is significant for agentic workflows. A developer who pauses to
read docs, review a diff, or take a coffee break easily exceeds 5 minutes
between turns. With the default 5-minute TTL, every pause longer than 5 minutes
means a full cache rebuild on the next turn.
Claude Code requests ttl: "1h" for tools and system prompt blocks. Message
history uses the default 5-minute TTL. This is a reasonable design — the tools
and system prompt are stable across turns and worth caching longer, while
messages change every turn.
The cache is also org-scoped, not connection-scoped. If two Claude Code
sessions in the same org send the same prefix within the TTL window, the second
one gets a cache hit. This means cross-process cache sharing works — if the
prefix is byte-identical.
That "if" is where things break.
Bug 1: Resume Block Scatter
When you start a fresh Claude Code session, the first user message
(messages[0]) contains a structured set of metadata blocks:
messages[0].content = [
{ text: "<system-reminder>deferred tools...</system-reminder>" },
{ text: "<system-reminder># MCP Server Instructions...</system-reminder>" },
{ text: "<system-reminder>skills listing...</system-reminder>" },
{ text: "<system-reminder>hooks output...</system-reminder>" },
{ text: "the actual user prompt" }
]
These blocks — deferred tools, MCP configuration, skills, hooks — are part of
the cache prefix. On a fresh session, they're always in messages[0], always
in the same order. The cache sees the same prefix every turn. Everything works.
On --resume, the blocks scatter. Instead of landing in messages[0],
some or all of them end up in later user messages — sometimes the last user
message, sometimes spread across multiple messages. The exact placement depends
on the Claude Code version and the session state.
The result: messages[0] has a different structure than what's cached. Prefix
match fails. The entire conversation history — potentially hundreds of
thousands of tokens — gets rebuilt from scratch.
This was tracked in GitHub issue #34629.
@VictorSun92 identified the exact code path and proposed relocating the blocks
back to messages[0]. @jmarianski ran the Claude Code binary through Ghidra
to reverse-engineer additional caching mechanisms in the native Zig layer,
discovering a cch sentinel that added another dimension to the problem.
What the numbers look like
On our 605,800-token session:
- Before resume:
cache_read: ~605,000/cache_creation: ~0— prefix
matches, cache serves the whole conversation - After resume:
cache_read: ~14,500/cache_creation: ~605,000— only
the tools and system prompt survive; everything else rebuilds
That 14,500 tokens of cache read was the ceiling — it represented only the
tool schemas and system prompt, the layers above the broken messages prefix.
Bug 2: Fingerprint Instability
Claude Code embeds a version fingerprint in the system prompt's attribution
header:
x-anthropic-billing-header: cc_version=2.1.87.a3f
That a3f suffix is a 3-character hex hash computed by
extractFirstMessageText() in src/utils/fingerprint.ts. The function grabs
content from messages[0] and hashes it with a salt and specific character
indices:
SHA256(SALT + msg[4] + msg[7] + msg[20] + version)[:3]
(This algorithm is reconstructed from source analysis of the npm package.
The salt, character indices, and hash truncation length are from the code;
Anthropic could change any of these without notice.)
The problem: extractFirstMessageText() doesn't filter for real user text. It
grabs whatever is in messages[0] — including the synthetic metadata blocks
(skills, MCP, hooks). When those blocks change between turns — a tool gets
added, an MCP server reconnects, a skill reloads — the text at indices 4, 7,
and 20 changes. The hash changes. The cc_version string changes. The system
prompt changes. Cache bust.
This is particularly insidious because the system prompt sits in the middle
of the cache hierarchy. A fingerprint change doesn't just bust the system
prompt cache — it busts every message below it too.
This was tracked in GitHub issue #40524.
Bug 3: Non-Deterministic Tool Ordering
Tool definitions are sent as an array at the top of every API request. If the
order of tools changes between calls — which happens when MCP servers
reconnect, tools are dynamically registered, or internal iteration order
varies — the tools layer changes. Since tools are at the top of the cache
hierarchy, this cascades and invalidates everything below: system prompt and
all messages.
Anthropic acknowledged tool schema instability as a bug in the v2.1.88
changelog. But the fix was incomplete — the ordering issue persisted in
subsequent versions.
The Compound Effect
Each bug independently breaks the cache. But in practice, they often fire
together:
- You resume a session → blocks scatter (Bug 1)
- The scattered blocks change the fingerprint → system prompt changes (Bug 2)
- An MCP server reconnects during resume → tools reorder (Bug 3)
The result is a triple cache bust: tools, system, and messages all miss. On a
600K-token conversation, that's 600K tokens of cache_creation — charged at
the write rate — on a single turn. Every turn.
We measured a 30-minute window where a settings change triggered this cascade:
5.25 million tokens of cache creation at an 87% bust rate across 35+
consecutive API calls.
The Zig Layer Complication
One more wrinkle: the standalone Claude Code binary (the ELF download) has a
Zig/Bun native HTTP layer that bypasses Node.js entirely. @jmarianski's Ghidra
analysis revealed a cch sentinel replacement mechanism in this layer — an
additional caching behavior that isn't present in the npm package.
This matters because any fix that operates at the Node.js fetch level — which
is where community fixes naturally land — won't work on the standalone binary.
The npm package (@anthropic-ai/claude-code) avoids this issue because it runs
on standard Node.js without the native attestation layer.
What This Means
The prompt caching architecture is actually well-designed in principle. Prefix
matching with TTL-based invalidation is a sound approach for conversational AI.
The problems are all in the implementation details: metadata blocks that don't
stay put, a fingerprint computed from the wrong input, tool arrays that aren't
sorted.
These aren't deep architectural flaws. They're integration bugs — the kind that
fall through the cracks when features are developed independently and
integration testing doesn't cover the cross-cutting interactions.
In Part 3, we'll walk through the fetch interceptor we
built to fix all three bugs at the network layer — without modifying Claude
Code itself.
This is Part 2 of a six-part series on Claude Code's cache management. Previous:
Part 1 — The Problem. Next:
Part 3 — The Community Fix.
Published by Veritas Supera IT Solutions — we build AI-augmented systems for technical teams. Dealing with unexplained AI tooling costs? Let’s talk.