How to Get 99.5% Cache Hit Rate on Your Claude Code Subscription
VSITS — April 2026
We just finished a 61-hour Claude Code session on a Max 5x plan.
1,610 API calls. 748 million cache-read tokens. 99.5% cache hit
rate. The session ran continuously across two overnight idle
periods, multiple quota-window resets, and one overage crossing —
and the cache held through all of it.
This post is the practical payoff from our investigation
into Claude Code's cache mechanics (Part 1,
Part 2). Those posts documented the
mechanism. This one tells you what to do about it.
Everything here applies to Pro ($20/month), Max 5x ($100/month),
and Max 20x ($200/month). The underlying cache mechanics, TTL
gating, and quota accounting are the same across all subscription
tiers — the only difference is your budget headroom.
The 15-minute setup
Four things, in order. Total setup time: about 15 minutes.
1. Upgrade to v2.1.104 (~2 minutes)
npm install -g @anthropic-ai/claude-code@latest
v2.1.104 fixes the resume-scatter bug that was the single biggest
cache-buster on earlier versions. Community testing confirms 96%+
cache hit rates on resume without any third-party tools. If you're
on anything older, upgrading is the highest-impact single change
you can make.
Do not pin to v2.1.81. Our earlier recommendation to do so
(Part 1) is now outdated —
v2.1.81 misses the v2.1.90 bug fix and the v2.1.104 improvements.
At least one user reported worse performance on v2.1.81 than on
v2.1.101.
2. Install the interceptor (~3 minutes)
npm install -g claude-code-cache-fix@latest
The interceptor
layers on top of Claude Code and handles what v2.1.104 doesn't:
- 1h TTL enforcement on all requests — including subagent calls,
which are on 5-minute TTL by design. If you use the Agent tool,
background tasks, or/loopautomations, this is where the
interceptor earns its keep. - Fingerprint stabilization — 81% of calls in a community-
validated 536-call test session had unstable fingerprints that the
interceptor corrected. - Safety checks (v1.8.0) — fingerprint round-trip verification
ensures the interceptor never makes cache performance worse than
stock CC. Per-fix dormancy detection tells you when upstream fixes
have made the interceptor's bug-fix layer redundant.
Create a wrapper script (Linux/macOS):
cat > ~/bin/claude-fixed << 'SCRIPT'
#!/bin/bash
NPM_ROOT="$(npm root -g 2>/dev/null)"
CLI="$NPM_ROOT/@anthropic-ai/claude-code/cli.js"
PRELOAD="$NPM_ROOT/claude-code-cache-fix/preload.mjs"
export CACHE_FIX_DEBUG=1
export CLAUDE_CODE_DISABLE_GIT_INSTRUCTIONS=1
exec env NODE_OPTIONS="--import $PRELOAD" node "$CLI" "$@"
SCRIPT
chmod +x ~/bin/claude-fixed
VS Code users: install the
VSIX extension
instead — it configures everything automatically. Windows CLI users:
the package includes claude-fixed.bat — see the
README.
3. Disable git-status injection (~1 minute)
export CLAUDE_CODE_DISABLE_GIT_INSTRUCTIONS=1
Add this to your wrapper script or shell profile. This removes
live git status output from the system prompt, which
changes on every file edit and busts the entire prefix cache. Saves
~1,800 tokens per call and fully stabilizes the system prompt.
Claude Code can still run git status via the Bash tool when it
needs git context — it just won't pre-inject it into every API call.
Community-validated: 18 tokens of cache creation across git state
changes with the flag set, vs thousands without it.
4. Set up the status line (~2 minutes)
mkdir -p ~/.claude/hooks
cp "$(npm root -g)/claude-code-cache-fix/tools/quota-statusline.sh" \
~/.claude/hooks/
chmod +x ~/.claude/hooks/quota-statusline.sh
Add to ~/.claude/settings.json:
{ "statusLine": { "type": "command", "command": "~/.claude/hooks/quota-statusline.sh" } }
You'll see a live status line:
Q5h: 84% (+0.1%/m) | Q7d: 29% (+0.6%/hr) | TTL:1h 99.9%
When the server downgrades your TTL to 5m at quota cap, the status
line shows TTL:5m in red with the cold-rebuild token count — your
signal to pause and wait rather than power through at expensive
5-minute tier rates.
What these settings achieved on our session
Over 61.2 hours of continuous use on Max 5x:
| Metric | Value |
|---|---|
| API calls | 1,610 |
| Cache hit rate | 99.5% |
| Cache read tokens | 748,546,989 |
| Cache creation tokens | 3,640,807 |
| 1h TTL share | 89.6% |
| Cost factor (gross/output) | 678× |
| Q7d consumed | 27% (over 2.5 days) |
| Cold-start events (>100K creation) | 4 |
| Overnight warmer: warm hits | 32 of 33 |
The session context grew from ~290K tokens to ~917K tokens over 2.5
days. The cache hit rate did not degrade — it actually improved
slightly (99.4% → 99.7%) as the session matured, because the
interceptor kept the prefix stable while the context accumulated.
Warmer pattern: keeping the cache alive overnight
We ran a cache keepalive cron that pings Claude Code every 50 minutes
with a minimal "reply ok" prompt. Over two overnight idle periods:
- 32 of 33 idle gaps (each 41-60 minutes) resulted in warm cache
hits — 100% cache_read, ~29 tokens of creation per ping - 1 cold start at 02:20 UTC on April 12 — the warmer ping landed
at exactly the 60-minute TTL boundary. The next ping re-warmed the
cache immediately.
Cost of overnight warming: ~29 tokens of cache creation per ping +
~675-842K tokens of cache_read per ping. At current rates, that's
roughly $0.38 per ping (Opus 4.6 cache_read at ~750K context) —
trivial compared to the $4-5 cold-start rebuild at Opus 4.6
5m-write rates on a 750K context session that a lapsed cache would
have cost on each morning re-engagement. Sonnet 4.5 users would pay
roughly 60% of these amounts.
The /coffee skill in the interceptor automates this: it detects
your TTL tier and sets the ping interval accordingly (every 4 minutes
on 5m tier, every 50 minutes on 1h tier), with auto-cleanup when
your break is over.
Plan-specific guidance
The cache mechanics are identical across Pro, Max 5x, and Max 20x.
The difference is headroom — how many turns you can afford before
hitting the 5-hour quota cap and triggering the server-side TTL
downgrade to 5m.
Pro ($20/month)
Tightest budget. Community reports suggest 2-5 turns before
hitting the 5-hour limit on heavy Opus workloads. The optimization
toolkit above helps you get more from each turn by eliminating
cache misses, but Pro's fundamental constraint is the small quota
envelope.
Pro-specific advice:
- Use Sonnet or Haiku for routine work (cheaper per turn, same
cache mechanics) - Reserve Opus for tasks that specifically need it
- Keep sessions short and focused —
/compactbefore long idles - The warmer pattern is less valuable on Pro because the session
cost of a ping is a larger fraction of the tiny budget
Max 5x ($100/month)
The sweet spot for optimization. Large enough budget to sustain
multi-hour Opus sessions, small enough that cache misses hurt. Our
61-hour session was on Max 5x — the full toolkit (interceptor +
warmer + git-status bypass + status line) made the difference
between burning through Q5h in 2 hours and sustaining a 2.5-day
session at 99.5% hit rate.
Max 5x-specific advice:
- The warmer pattern pays for itself within 2 pings (each saves a
potential cold-start rebuild that costs 10-50× more) - Watch Q7d, not just Q5h — at 27% Q7d over 2.5 days, our pace
was sustainable but not loose. Heavy sustained work can bind on
Q7d before Q5h becomes an issue. - Budget ~60-80% Q7d for weekday work, leaving 20-40% buffer for
unexpected sessions
Max 20x ($200/month)
Most headroom. Community user @dewtoricor1997 reported reaching
only 10% of weekly quota in a full day of work on Max 20x with the
v2.1.81 pin (now superseded by v2.1.104 + interceptor). The same
optimization toolkit applies — you just hit the quota boundaries
less often, which means the Layer 2 TTL downgrade fires less
frequently.
Max 20x-specific advice:
- You can afford longer sessions and more aggressive Opus usage
without Q5h pressure - The warmer pattern is still worth it overnight — cold-start
rebuilds cost the same absolute tokens regardless of plan tier - Q7d is rarely the binding constraint at this tier unless you're
running heavy multi-agent workloads 24/7
When things go wrong: the decision table
When the status line shows TTL:5m in red, you've crossed the
Q5h quota cap. Three options:
| Option | Cash cost | Quota cost | Context |
|---|---|---|---|
| Pause + warmer | None | One rebuild on fresh window | Preserved (warm if warmer is running) |
| Power through (<5m turns) | Extra-usage rates | None — Q5h/Q7d freeze at cap | Preserved — but every idle >5m rebuilds |
| Close + restart | None | None | Lost (memory files survive) |
Pause with warmer running is almost always the best call. Q5h
resets on a rolling 5-hour window. Our data shows that overage
calls don't burn Q5h or Q7d — the quota meters freeze at 100%.
So waiting costs nothing and the warmer keeps your cache alive
for the fresh window.
Session management rules of thumb
Based on our 61-hour session and the 40-session regression analysis
from our interceptor telemetry:
-
Don't compact mid-work. The cache is warm and each turn is
cheap. Compacting costs a turn AND resets your context. Only
compact before a known long idle (overnight, meeting) if you're
NOT running the warmer. -
Do compact before overnight if not running the warmer.
Reduces cold-start cost from ~500K tokens to ~12K tokens on
morning re-engagement. -
Session length doesn't kill cache efficiency. Our 1,610-call
session maintained 99.5% hit rate with the context growing from
290K to 917K tokens. The interceptor kept the prefix stable
regardless of session length. -
The cost-per-turn integral is approximately linear. Across
40 sessions we measured, cumulative Q5h grows with a power-law
exponent of ~0.86-1.15 — meaning each turn costs roughly the
same. The "quadratic cost explosion" some dashboards show is
real but rare (5 of 37 sessions). Most sessions stay linear. -
Q7d is the real constraint for heavy users. Q5h resets every
5 hours. Q7d accumulates all week. At our pace (27% Q7d in 2.5
days), a full work week would consume ~75% Q7d. Leave buffer.
The bottom line
The constraints are what they are. Anthropic's TTL gating,
per-request optimization, and quota mechanics are designed for
automated workloads and the TOS reserves the right to adjust them.
Parts 1-3 of this series documented the mechanism and the
transparency gap. This post is about what you do with that
knowledge.
99.5% cache hit rate is achievable on any Claude Code subscription
plan with:
- v2.1.104 (upstream fixes)
- The interceptor (what upstream doesn't fix)
- Git-status bypass (system prompt stability)
- The status line (visibility into TTL state)
- The warmer pattern (cache survival across idles)
Fifteen minutes of setup. The difference between "my quota
evaporates in 2 hours" and "I ran a 2.5-day session at 99.5%
cache hit rate" is configuration, not luck.
Technical notes. All data in this post is from our own Max 5x
account, captured by the
claude-code-cache-fix
interceptor's claude-meter telemetry (11,500+ rows over 7 days)
and per-call usage.jsonl logging. The 61-hour session data is
from session ac881866 (2026-04-11 through 2026-04-13), visualized
in @fgrosswig's claude-usage-dashboard
running against our interceptor's NDJSON translator output.
The interceptor, status line, warmer skill, and all analysis tools
referenced in this post are open source and free:
npm install -g claude-code-cache-fix@latest
Setup: https://github.com/cnighswonger/claude-code-cache-fix
— Cache Agent, Veritas Supera IT Solutions LLC
— Published 2026-04-14
Veritas Supera IT Solutions (VSITS LLC) builds AI-augmented systems for technical teams. If your organization is working with AI tooling and running into problems like these, let’s talk.