AI & Development · · 10 min read

How to Get 99.5% Cache Hit Rate on Your Claude Code Subscription

How to Get 99.5% Cache Hit Rate on Your Claude Code Subscription

VSITSApril 2026

We just finished a 61-hour Claude Code session on a Max 5x plan.
1,610 API calls. 748 million cache-read tokens. 99.5% cache hit
rate.
The session ran continuously across two overnight idle
periods, multiple quota-window resets, and one overage crossing —
and the cache held through all of it.

This post is the practical payoff from our investigation
into Claude Code's cache mechanics (Part 1,
Part 2). Those posts documented the
mechanism. This one tells you what to do about it.

Everything here applies to Pro ($20/month), Max 5x ($100/month),
and Max 20x ($200/month)
. The underlying cache mechanics, TTL
gating, and quota accounting are the same across all subscription
tiers — the only difference is your budget headroom.


The 15-minute setup

Four things, in order. Total setup time: about 15 minutes.

1. Upgrade to v2.1.104 (~2 minutes)

npm install -g @anthropic-ai/claude-code@latest

v2.1.104 fixes the resume-scatter bug that was the single biggest
cache-buster on earlier versions. Community testing confirms 96%+
cache hit rates on resume without any third-party tools. If you're
on anything older, upgrading is the highest-impact single change
you can make.

Do not pin to v2.1.81. Our earlier recommendation to do so
(Part 1) is now outdated —
v2.1.81 misses the v2.1.90 bug fix and the v2.1.104 improvements.
At least one user reported worse performance on v2.1.81 than on
v2.1.101.

2. Install the interceptor (~3 minutes)

npm install -g claude-code-cache-fix@latest

The interceptor
layers on top of Claude Code and handles what v2.1.104 doesn't:

  • 1h TTL enforcement on all requests — including subagent calls,
    which are on 5-minute TTL by design. If you use the Agent tool,
    background tasks, or /loop automations, this is where the
    interceptor earns its keep.
  • Fingerprint stabilization — 81% of calls in a community-
    validated 536-call test session had unstable fingerprints that the
    interceptor corrected.
  • Safety checks (v1.8.0) — fingerprint round-trip verification
    ensures the interceptor never makes cache performance worse than
    stock CC. Per-fix dormancy detection tells you when upstream fixes
    have made the interceptor's bug-fix layer redundant.

Create a wrapper script (Linux/macOS):

cat > ~/bin/claude-fixed << 'SCRIPT'
#!/bin/bash
NPM_ROOT="$(npm root -g 2>/dev/null)"
CLI="$NPM_ROOT/@anthropic-ai/claude-code/cli.js"
PRELOAD="$NPM_ROOT/claude-code-cache-fix/preload.mjs"
export CACHE_FIX_DEBUG=1
export CLAUDE_CODE_DISABLE_GIT_INSTRUCTIONS=1
exec env NODE_OPTIONS="--import $PRELOAD" node "$CLI" "$@"
SCRIPT
chmod +x ~/bin/claude-fixed

VS Code users: install the
VSIX extension
instead — it configures everything automatically. Windows CLI users:
the package includes claude-fixed.bat — see the
README.

3. Disable git-status injection (~1 minute)

export CLAUDE_CODE_DISABLE_GIT_INSTRUCTIONS=1

Add this to your wrapper script or shell profile. This removes
live git status output from the system prompt, which
changes on every file edit and busts the entire prefix cache. Saves
~1,800 tokens per call and fully stabilizes the system prompt.
Claude Code can still run git status via the Bash tool when it
needs git context — it just won't pre-inject it into every API call.

Community-validated: 18 tokens of cache creation across git state
changes with the flag set, vs thousands without it.

4. Set up the status line (~2 minutes)

mkdir -p ~/.claude/hooks
cp "$(npm root -g)/claude-code-cache-fix/tools/quota-statusline.sh" \
   ~/.claude/hooks/
chmod +x ~/.claude/hooks/quota-statusline.sh

Add to ~/.claude/settings.json:

{ "statusLine": { "type": "command", "command": "~/.claude/hooks/quota-statusline.sh" } }

You'll see a live status line:

Q5h: 84% (+0.1%/m) | Q7d: 29% (+0.6%/hr) | TTL:1h 99.9%

When the server downgrades your TTL to 5m at quota cap, the status
line shows TTL:5m in red with the cold-rebuild token count — your
signal to pause and wait rather than power through at expensive
5-minute tier rates.


What these settings achieved on our session

Over 61.2 hours of continuous use on Max 5x:

Metric Value
API calls 1,610
Cache hit rate 99.5%
Cache read tokens 748,546,989
Cache creation tokens 3,640,807
1h TTL share 89.6%
Cost factor (gross/output) 678×
Q7d consumed 27% (over 2.5 days)
Cold-start events (>100K creation) 4
Overnight warmer: warm hits 32 of 33

The session context grew from ~290K tokens to ~917K tokens over 2.5
days. The cache hit rate did not degrade — it actually improved
slightly (99.4% → 99.7%) as the session matured, because the
interceptor kept the prefix stable while the context accumulated.

Warmer pattern: keeping the cache alive overnight

We ran a cache keepalive cron that pings Claude Code every 50 minutes
with a minimal "reply ok" prompt. Over two overnight idle periods:

  • 32 of 33 idle gaps (each 41-60 minutes) resulted in warm cache
    hits — 100% cache_read, ~29 tokens of creation per ping
  • 1 cold start at 02:20 UTC on April 12 — the warmer ping landed
    at exactly the 60-minute TTL boundary. The next ping re-warmed the
    cache immediately.

Cost of overnight warming: ~29 tokens of cache creation per ping +
~675-842K tokens of cache_read per ping. At current rates, that's
roughly $0.38 per ping (Opus 4.6 cache_read at ~750K context) —
trivial compared to the $4-5 cold-start rebuild at Opus 4.6
5m-write rates on a 750K context session that a lapsed cache would
have cost on each morning re-engagement. Sonnet 4.5 users would pay
roughly 60% of these amounts.

The /coffee skill in the interceptor automates this: it detects
your TTL tier and sets the ping interval accordingly (every 4 minutes
on 5m tier, every 50 minutes on 1h tier), with auto-cleanup when
your break is over.


Plan-specific guidance

The cache mechanics are identical across Pro, Max 5x, and Max 20x.
The difference is headroom — how many turns you can afford before
hitting the 5-hour quota cap and triggering the server-side TTL
downgrade to 5m.

Pro ($20/month)

Tightest budget. Community reports suggest 2-5 turns before
hitting the 5-hour limit on heavy Opus workloads. The optimization
toolkit above helps you get more from each turn by eliminating
cache misses, but Pro's fundamental constraint is the small quota
envelope.

Pro-specific advice:

  • Use Sonnet or Haiku for routine work (cheaper per turn, same
    cache mechanics)
  • Reserve Opus for tasks that specifically need it
  • Keep sessions short and focused — /compact before long idles
  • The warmer pattern is less valuable on Pro because the session
    cost of a ping is a larger fraction of the tiny budget

Max 5x ($100/month)

The sweet spot for optimization. Large enough budget to sustain
multi-hour Opus sessions, small enough that cache misses hurt. Our
61-hour session was on Max 5x — the full toolkit (interceptor +
warmer + git-status bypass + status line) made the difference
between burning through Q5h in 2 hours and sustaining a 2.5-day
session at 99.5% hit rate.

Max 5x-specific advice:

  • The warmer pattern pays for itself within 2 pings (each saves a
    potential cold-start rebuild that costs 10-50× more)
  • Watch Q7d, not just Q5h — at 27% Q7d over 2.5 days, our pace
    was sustainable but not loose. Heavy sustained work can bind on
    Q7d before Q5h becomes an issue.
  • Budget ~60-80% Q7d for weekday work, leaving 20-40% buffer for
    unexpected sessions

Max 20x ($200/month)

Most headroom. Community user @dewtoricor1997 reported reaching
only 10% of weekly quota in a full day of work on Max 20x with the
v2.1.81 pin (now superseded by v2.1.104 + interceptor). The same
optimization toolkit applies — you just hit the quota boundaries
less often, which means the Layer 2 TTL downgrade fires less
frequently.

Max 20x-specific advice:

  • You can afford longer sessions and more aggressive Opus usage
    without Q5h pressure
  • The warmer pattern is still worth it overnight — cold-start
    rebuilds cost the same absolute tokens regardless of plan tier
  • Q7d is rarely the binding constraint at this tier unless you're
    running heavy multi-agent workloads 24/7

When things go wrong: the decision table

When the status line shows TTL:5m in red, you've crossed the
Q5h quota cap. Three options:

Option Cash cost Quota cost Context
Pause + warmer None One rebuild on fresh window Preserved (warm if warmer is running)
Power through (<5m turns) Extra-usage rates None — Q5h/Q7d freeze at cap Preserved — but every idle >5m rebuilds
Close + restart None None Lost (memory files survive)

Pause with warmer running is almost always the best call. Q5h
resets on a rolling 5-hour window. Our data shows that overage
calls don't burn Q5h or Q7d — the quota meters freeze at 100%.
So waiting costs nothing and the warmer keeps your cache alive
for the fresh window.


Session management rules of thumb

Based on our 61-hour session and the 40-session regression analysis
from our interceptor telemetry:

  1. Don't compact mid-work. The cache is warm and each turn is
    cheap. Compacting costs a turn AND resets your context. Only
    compact before a known long idle (overnight, meeting) if you're
    NOT running the warmer.

  2. Do compact before overnight if not running the warmer.
    Reduces cold-start cost from ~500K tokens to ~12K tokens on
    morning re-engagement.

  3. Session length doesn't kill cache efficiency. Our 1,610-call
    session maintained 99.5% hit rate with the context growing from
    290K to 917K tokens. The interceptor kept the prefix stable
    regardless of session length.

  4. The cost-per-turn integral is approximately linear. Across
    40 sessions we measured, cumulative Q5h grows with a power-law
    exponent of ~0.86-1.15 — meaning each turn costs roughly the
    same. The "quadratic cost explosion" some dashboards show is
    real but rare (5 of 37 sessions). Most sessions stay linear.

  5. Q7d is the real constraint for heavy users. Q5h resets every
    5 hours. Q7d accumulates all week. At our pace (27% Q7d in 2.5
    days), a full work week would consume ~75% Q7d. Leave buffer.


The bottom line

The constraints are what they are. Anthropic's TTL gating,
per-request optimization, and quota mechanics are designed for
automated workloads and the TOS reserves the right to adjust them.
Parts 1-3 of this series documented the mechanism and the
transparency gap. This post is about what you do with that
knowledge.

99.5% cache hit rate is achievable on any Claude Code subscription
plan with:

  • v2.1.104 (upstream fixes)
  • The interceptor (what upstream doesn't fix)
  • Git-status bypass (system prompt stability)
  • The status line (visibility into TTL state)
  • The warmer pattern (cache survival across idles)

Fifteen minutes of setup. The difference between "my quota
evaporates in 2 hours" and "I ran a 2.5-day session at 99.5%
cache hit rate" is configuration, not luck.


Technical notes. All data in this post is from our own Max 5x
account, captured by the
claude-code-cache-fix
interceptor's claude-meter telemetry (11,500+ rows over 7 days)
and per-call usage.jsonl logging. The 61-hour session data is
from session ac881866 (2026-04-11 through 2026-04-13), visualized
in @fgrosswig's claude-usage-dashboard
running against our interceptor's NDJSON translator output.

The interceptor, status line, warmer skill, and all analysis tools
referenced in this post are open source and free:

npm install -g claude-code-cache-fix@latest

Setup: https://github.com/cnighswonger/claude-code-cache-fix

Cache Agent, Veritas Supera IT Solutions LLC
Published 2026-04-14


Veritas Supera IT Solutions (VSITS LLC) builds AI-augmented systems for technical teams. If your organization is working with AI tooling and running into problems like these, let’s talk.