In Part 2, we mapped out three bugs that break
Claude Code's prompt caching: resume block scatter, fingerprint instability,
and non-deterministic tool ordering. Each one independently busts the cache.
Together, they made caching essentially non-functional.
We couldn't patch Claude Code directly — it's a distributed binary/npm package
that updates frequently. We needed a fix that sits outside the application,
survives upgrades, and can be iterated independently.
The answer: a Node.js fetch interceptor loaded via NODE_OPTIONS="--import".
The Architecture
Node.js supports preload modules — scripts that execute before the main
application starts. By setting NODE_OPTIONS="--import ~/.claude/cache-fix-preload.mjs",
we inject code that overrides globalThis.fetch before Claude Code makes its
first API call.
The interceptor watches for requests to Anthropic's /v1/messages endpoint.
When it sees one, it parses the JSON payload, applies fixes, re-serializes,
and passes the modified request to the original fetch. The response flows
back unmodified.
const _origFetch = globalThis.fetch;
globalThis.fetch = async function (url, options) {
const urlStr = typeof url === "string" ? url : url?.url || String(url);
const isMessagesEndpoint =
urlStr.includes("/v1/messages") &&
!urlStr.includes("batches") &&
!urlStr.includes("count_tokens");
if (isMessagesEndpoint && options?.body) {
const payload = JSON.parse(options.body);
// ... apply fixes ...
options = { ...options, body: JSON.stringify(payload) };
}
return _origFetch.apply(this, [url, options]);
};
A wrapper script at ~/bin/claude handles the loading:
#!/bin/bash
NPM_ROOT="$(npm root -g 2>/dev/null)"
CLAUDE_NPM_CLI="$NPM_ROOT/@anthropic-ai/claude-code/cli.js"
CACHE_FIX="$NPM_ROOT/claude-code-cache-fix/preload.mjs"
export CACHE_FIX_DEBUG=1
export CACHE_FIX_IMAGE_KEEP_LAST=3
exec env NODE_OPTIONS="--import $CACHE_FIX" node "$CLAUDE_NPM_CLI" "$@"
This routes through the npm-installed Claude Code rather than the standalone
binary. That's deliberate — the standalone ELF binary has a Zig-level native
HTTP layer that bypasses globalThis.fetch entirely (see Part 2's discussion
of the cch sentinel). The npm package uses standard Node.js networking, so
our interceptor catches every API call.
Fix 1: Resume Block Relocation
The core fix for Bug 1. On every API call, the interceptor scans the entire
messages array backwards to find the latest instance of each relocatable block
type (skills, MCP, deferred tools, hooks). It removes them from wherever
they've scattered and prepends them to messages[0] in deterministic order.
// Scan ALL user messages in reverse to collect the LATEST
// version of each block type
const found = new Map();
for (let i = messages.length - 1; i >= firstUserIdx; i--) {
const msg = messages[i];
if (msg.role !== "user" || !Array.isArray(msg.content)) continue;
for (let j = msg.content.length - 1; j >= 0; j--) {
const block = msg.content[j];
const text = block.text || "";
if (!isRelocatableBlock(text)) continue;
let blockType;
if (isSkillsBlock(text)) blockType = "skills";
else if (isMcpBlock(text)) blockType = "mcp";
else if (isDeferredToolsBlock(text)) blockType = "deferred";
else if (isHooksBlock(text)) blockType = "hooks";
else continue;
if (!found.has(blockType)) {
found.set(blockType, { ...block, text: fixedText });
}
}
}
The original community fix by @VictorSun92 only checked the last user message.
This worked on the first call after resume, but broke on subsequent turns
because in-memory state wasn't updated — the blocks would appear to be in a
middle message on the next call, and the prefix would change again.
Our version scans the full array every time, making it idempotent across calls.
The block order is fixed to match fresh session layout:
deferred → mcp → skills → hooks.
The partial scatter problem
We discovered that v2.1.89+ introduced a subtler failure mode: blocks could
partially scatter, with some remaining in messages[0] while others drifted
to later messages. The original fix had an early bail-out — if it saw any
relocatable blocks in messages[0], it assumed everything was fine. Our
version checks whether blocks exist outside messages[0] before deciding
whether relocation is needed.
Fix 2: Fingerprint Stabilization
The cc_version fingerprint in the attribution header is computed from
character indices in messages[0] content. Since messages[0] contains
metadata blocks that change between turns, the fingerprint is unstable.
Our fix extracts the real user message text — the first text block that isn't a
<system-reminder> — and recomputes the hash from that stable input:
function stabilizeFingerprint(system, messages) {
// Find attribution header in system blocks
const attrBlock = system.find(b =>
b.text?.includes("x-anthropic-billing-header:")
);
// Extract cc_version=X.Y.Z.abc
const fullVersion = attrBlock.text.match(/cc_version=([^;]+)/)[1];
const baseVersion = fullVersion.split(".").slice(0, 3).join(".");
// Recompute fingerprint from REAL user text, not meta blocks
const realText = extractRealUserMessageText(messages);
const stableFingerprint = computeFingerprint(realText, baseVersion);
// Replace in attribution header
attrBlock.text = attrBlock.text.replace(
`cc_version=${fullVersion}`,
`cc_version=${baseVersion}.${stableFingerprint}`
);
}
The fingerprint computation matches the original algorithm — SHA256(SALT + msg[4] + msg[7] + msg[20] + version)[:3] — but feeds it the right input.
Fix 3: Tool Schema Stabilization
A one-liner with outsized impact: sort tool definitions by name.
function stabilizeToolOrder(tools) {
return [...tools].sort((a, b) => a.name.localeCompare(b.name));
}
Tools sit at the top of the cache hierarchy. A single reorder cascades and
invalidates everything below. Sorting guarantees deterministic ordering
regardless of MCP reconnection timing or internal iteration order.
Beyond the Original Bugs
Once the interceptor framework was in place, we added several enhancements
that address cache stability issues beyond the three core bugs:
Skills block sorting
Skill entries within the skills <system-reminder> block can arrive in
non-deterministic order. We split on entry boundaries, sort, and rejoin:
function sortSkillsBlock(text) {
// Capture groups: (1) header up to blank line, (2) dash-prefixed entries, (3) closing tag
const [, header, entriesText, footer] =
text.match(/^([\s\S]*?\n\n)(- [\s\S]+?)(\n<\/system-reminder>\s*)$/);
const entries = entriesText.split(/\n(?=- )/);
entries.sort();
return header + entries.join("\n") + footer;
}
Session knowledge stripping
Hooks blocks can contain <session_knowledge> tags with ephemeral content
that differs between sessions. Since this content is inside a block that's
part of the cache prefix, it causes busts on resume. The fix strips it:
function stripSessionKnowledge(text) {
return text.replace(
/\n<session_knowledge[^>]*>[\s\S]*?<\/session_knowledge>/g, ""
);
}
Prefix snapshot diffing
For diagnosing new cache busts, the interceptor captures a snapshot of the
first 5 messages plus hashes of the tools and system prompt on every call.
On the first call after a process restart, it diffs against the previous
snapshot and writes a report. This is how we identified several of the
subtler instability sources.
Quota monitoring
The interceptor reads anthropic-ratelimit-unified-5h-utilization from
response headers and writes it to ~/.claude/quota-status.json. This gives
hooks and external tools real-time visibility into quota state — which, as
we'll see in Part 4, turns out to control something important.
Results
With all fixes active, our cache hit ratio across 7,094 API calls was
98.2%. Cache busts occurred only on TTL expiry (idle gaps exceeding the
cache lifetime) — exactly the behavior you'd expect from a working cache.
On a resumed 605K-token session, the per-turn numbers flipped:
| Metric | Without interceptor | With interceptor |
|---|---|---|
| Cache read | ~14,500 tokens | ~605,000 tokens |
| Cache creation | ~605,000 tokens | ~0 tokens |
| Quota per turn | ~26-28% | ~2-3% |
The 21,040-token baseline of cache reads (tools + system prompt on 1h TTL)
survived across process restarts even without our fix. With it, the full
conversation prefix also survived — giving us the cache read numbers the
system was designed to produce.
What We Learned Building This
Interceptors are underrated. The NODE_OPTIONS="--import" pattern gives
you a clean injection point into any Node.js application without forking or
patching. It survives application updates. It composes — you can chain
multiple interceptors. And it's debuggable: our interceptor writes a detailed
log of every fix it applies.
Fix at the boundary. We could have forked Claude Code and patched the
source. Instead, we fixed at the network boundary — the last point before
data leaves the process. This is more resilient to upstream changes and easier
to reason about: the interceptor sees exactly what the API sees.
Idempotency matters. The early versions of the community fix worked on the
first call but drifted on subsequent calls. Making every operation idempotent —
scanning the full array every time, sorting rather than checking order — was
the difference between a fix that worked once and one that held steady across
thousands of calls.
In Part 4, we'll cover a discovery that came directly
from the quota monitoring we built into the interceptor: an undocumented TTL
downgrade that turns the cache economics upside down.
This is Part 3 of a six-part series on Claude Code's cache management. Previous:
Part 2 — The Cache Architecture. Next:
Part 4 — The TTL Discovery.
Veritas Supera IT Solutions (VSITS LLC) builds AI-augmented systems for technical teams. If your organization is working with AI tooling and running into problems like these, let’s talk.