AI & Development · · 8 min read

We Burned Through 100% of Our Claude Code Quota in Two Hours. Here’s What We Found.

Claude Code is, in our experience, the best AI coding tool available today.
Anthropic's models are genuinely excellent — the reasoning is sharp, the context
window is enormous, and when the tooling works correctly, it's transformative.

So when our Max 5x plan started burning through 100% of its 5-hour quota in
under two hours of routine work, we started investigating. What we found was
a set of cache management bugs that made prompt caching — the mechanism that's
supposed to keep costs sane at large context sizes — essentially non-functional
in common usage patterns.

This is Part 1 of a six-part series covering what we found, what we built to
fix it, and what it taught us about the economics of AI tooling. The data is
ours, from our own sessions. The fixes were built by us and a handful of other
users in the community.

What "100% in Two Hours" Actually Looks Like

We first noticed the problem on a resumed Claude Code session — a 605,800-token
conversation that should have been routine. The session had been working fine
before we paused. After resuming, the quota meter started climbing at a rate
we'd never seen.

Here's what the data showed: every single API call after the resume was
rebuilding the entire conversation from scratch. The cache — which is supposed
to let the API recognize "you've seen this prefix before, don't charge me for
it again" — was being invalidated on every turn.

The numbers were stark:

Metric Expected (cache working) Actual (cache busted)
Cache read per turn ~605,000 tokens ~14,500 tokens
Cache creation per turn ~0 tokens ~605,000 tokens
Quota consumed per turn ~2-3% ~26-28%

That 14,500 tokens of cache read? That was just the system prompt and tool
definitions surviving. Everything else — the entire conversation history — was
being rebuilt from scratch every time the model spoke.

Four turns and you've burned a quarter of your 5-hour quota. On a conversation
that was already cached before you hit resume.

The 1M Context Scaling Problem

This isn't just a nuisance at small context sizes. It's a financial cliff at
large ones.

At 200K context, a single cache miss recreates ~200K tokens. Noticeable, maybe
2-3% of your quota. Annoying but survivable.

At 600K-1M context — which is where serious multi-file work lives — a single
cache miss can consume 14% or more of your quota in one turn. A bug that
costs you $0.05 at 200K costs you $0.25 at 1M. And it compounds: each
uncached turn creates new cache entries that will themselves be invalidated on
the next turn.

We measured a 30-minute window where adding an MCP server configuration
mid-session triggered a cascade:
5.25 million tokens of cache creation at an 87% bust rate. That's 35+
consecutive API calls where the cache was being rebuilt almost entirely from
scratch.

The Cost, Quantified

We run a multi-agent weather simulation system (kanfei-nowcast) alongside our
Claude Code development work. This gave us a useful comparison: same codebase,
same agent, different days — though the sessions differed in duration and model
mix, so we'll normalize to Sonnet-hours for a fair comparison.

Metric Apr 3 (over quota) Apr 5 (under quota)
Sonnet hours 10 8
Sonnet cost $33.30 $13.39
Cost per Sonnet-hour $3.33 $1.67
Cache TTL 5 min 1 hour
Cache hit rate 5.3% 19.7%

2x per-hour cost difference on the same model, same codebase.

And that's the average. The real story is in the steady-state hours. The Apr 5
session — running with our fix deployed and quota under control — had 6 of 8
Sonnet hours with zero cache writes. The cache was working exactly as
designed: write once, read many. By hours 5 through 8, hit rates exceeded 50%
and hourly costs dropped below $1.00 — compared to Apr 3's steady $3.33/hr.

In steady state, the effective cost difference was closer to 3x.

It's Not Just Big Sessions

You might think this is a large-context problem — something that only hits
users running 600K+ token sessions. It's not.

While writing this series, our web management agent — a lightweight WordPress
blog session running Opus 4.6 at roughly 29,000 tokens of context —
accidentally ran through the default npm shim instead of our patched wrapper.
This gave us a clean comparison on the same account, same day.

Within the session, caching worked fine. Calls 1 through 20 showed 90-99%
cache hit rates. Then the session restarted. The very next API call:

Metric Call #20 (before restart) Call #21 (after restart)
Cache creation 2,386 tokens 14,353 tokens
Cache read 22,634 tokens 11,269 tokens
Hit rate 90.5% 44.0%

Cache read fell back to 11,269 — the base system prompt. Everything above
that had to be rebuilt because the prefix changed on restart. The hit rate
recovered by call #25, but those 4 calls had already burned 57,000 tokens of
unnecessary cache creation.

On a 29k session running Opus, that's $0.57 of waste at 1h TTL cache write
rates ($10/MTok). Trivial. But the waste scales linearly with context:

Context Size Wasted Per Restart (1h TTL) Wasted Per Restart (5m TTL)
29k (this session) $0.29 $0.18
200k (typical dev session) $2.00 $1.25
600k (large session) $6.00 $3.75
1M (max context) $10.00 $6.25

A developer restarting 3-4 times a day with a 200k context on Opus is
burning $6-8/day in pure cache waste — before writing a line of code.

What $100 Buys You

When users reported these cost anomalies, Anthropic issued $100 Extra Usage
credits to some affected accounts.

Let's put $100 in context. At Opus rates ($5/MTok input, $25/MTok output),
with the cache bug active, $100 covers roughly 12 Opus turns at 600K
context. That's maybe 30 minutes of active development work.

The credit refills the bucket. The hole is still there.

Not Just Us

This wasn't an isolated experience. GitHub issues
#34629,
#40524, and
#42052 collected
reports from Max plan users across the board — all describing the same pattern
of unexplained, rapid quota consumption.

The community started comparing notes. @jmarianski ran the Claude Code binary
through Ghidra to reverse-engineer its cache management. @VictorSun92 identified
the exact code path where resumed sessions broke the cache prefix. @Renvect
found that pasted images were persisting across sessions, carrying hundreds of
thousands of tokens of base64 data on every API call. @RebelSyntax confirmed
the invalidation pattern independently.

These aren't casual users filing vague complaints. These are developers reading
source code, capturing API response headers, and building diagnostic tools to
understand what's happening inside a product they're paying for.

What Was Actually Broken

We'll get into the full technical detail in Part 2, but the short version:
Claude Code's prompt caching relies on prefix stability — the beginning of
each API request needs to be identical to the previous one for the cache to
recognize it. Three separate bugs were breaking that stability:

  1. Resume block scatter: On --resume, internal metadata blocks (hooks,
    skills, MCP tool definitions) were being placed at the end of the
    conversation instead of the beginning, breaking the prefix match.

  2. Fingerprint instability: A version fingerprint embedded in the system
    prompt was being computed from metadata content that changes between turns,
    causing the system prompt itself to differ across calls.

  3. Non-deterministic tool ordering: Tool definitions were being sent in a
    different order between calls, breaking the cache at the tool schema level.

Each of these independently busts the cache. Together, they made prompt caching
essentially non-functional in many common usage patterns — particularly resumed
sessions, which is how most developers use Claude Code.

What Comes Next

In Part 2, we'll walk through how Claude Code's
prompt caching actually works, why prefix stability matters, and exactly how
each of these bugs breaks it — with code references and API response data.

In Part 3, we'll cover the fetch interceptor we built to
fix all three bugs at the network layer, without modifying Claude Code itself.

The short version of the rest of the series: we found the bugs, we fixed them
ourselves, we discovered an undocumented pricing mechanism that makes the
problem even worse, and we quantified exactly how much it all costs.

Anthropic builds excellent models. But the tooling around those models has gaps
that are costing users real money — and the only reason we know how much is that
we went looking.


This is Part 1 of a six-part series on Claude Code's cache management. Next:
Part 2 — The Cache Architecture.

The investigation was conducted on a Max 5x plan account running multi-agent
workloads. All cost figures are from the Anthropic Admin Usage API. The
community work referenced here spans GitHub issues #34629, #40524, and #42052.

Updated Apr 13, 2026: Corrected the "Wasted Per Restart" table. Original
values overstated per-restart waste; recalculated using Opus 4.6 cache write
rates (1h: $10/MTok, 5m: $6.25/MTok) per
Anthropic's published pricing.
Telemetry-derived cost figures (Sonnet-hour comparisons) were unaffected.


Veritas Supera IT Solutions (VSITS LLC) builds AI-augmented systems for technical teams. If your organization is working with AI tooling and running into problems like these, let’s talk.