Update (2026-04-13): Our original recommendation to pin Claude Code to v2.1.81 is now outdated. Anthropic has partially fixed the resume-scatter bug in v2.1.90 and further improved cache stability in v2.1.104. Community testing by @VictorSun92 confirms 96%+ cache hit rates on resume with v2.1.104 without the interceptor's scatter fix. At least one user reported that pinning to v2.1.81 made things worse on their Max 5x account.
Updated advice: upgrade to v2.1.104, then add claude-code-cache-fix for what CC still doesn't fix (subagent 1h TTL enforcement, fingerprint stabilization, quota/TTL visibility via the status line). See Part 2 for the full updated guidance and the three-layer gating model that explains what's happening underneath.
The 5-Minute Baseline: What We Found in Claude Code's Tools Array
VSITS LLC — April 2026
This post is a single-day investigation, outside our ongoing Claude Code cache
series. It stands alone because what we found this morning explains the shape
of the quota-drain problem reported across dozens of GitHub threads — not as
a speculative theory, but as a measurement anchored in Anthropic's own
product code. The earlier parts of our series still hold; this post adds
one datum that reframes how to read them.
For readers coming in cold, our cache investigation
series documents the specific cache bugs we've fixed in
the claude-code-cache-fix
interceptor. This post doesn't require you to have read any of it. If you
know what prompt caching is and why it matters for cost on a large-context
API, you have enough background.
The question that kicked this off
On GitHub issue #38335 —
currently the main community thread on Claude Code quota drain — a user named
Hisham-Hussein posted this morning that he'd solved the problem by pinning
Claude Code to version v2.1.81, and that his workflow was back to the
way it felt before March 23rd. Several other users immediately converged on
the same claim: v2.1.81 works, later versions burn quota, something changed
around late March.
The claim was specific enough to be testable. So we tested it.
We installed Claude Code v2.1.81, v2.1.83 (the first release after March 23),
v2.1.90 (a mid-cluster release), and v2.1.101 (today's current) into isolated
npm prefixes and ran identical test calls against each of them under our
interceptor, which logs per-call token usage and per-section prompt sizes to
disk. The goal was narrow: identify what actually changed between v2.1.81
and later releases, and whether the change is in the client, the server, or
both.
What we found is partly answered and partly not. The part that is answered
has teeth.
The release timeline doesn't align with the regression
The first thing we checked was when the releases actually shipped, because
"pin to v2.1.81 to work around a regression starting March 23" is a very
specific claim and we wanted to see if the dates matched.
| Version | Published (UTC) |
|---|---|
| v2.1.80 | 2026-03-19 22:08 |
| v2.1.81 | 2026-03-20 22:24 |
| [March 23 — regression start per issue title — no CC release this day] | |
| v2.1.83 | 2026-03-25 06:08 |
| v2.1.84 | 2026-03-26 00:31 |
v2.1.82 never shipped. There is no client release on or near March 23 that
could have introduced a client-side bug starting that day. The pin-to-v2.1.81
workaround is routing users around something, but whatever it is didn't
arrive in a client release — it arrived on March 23rd on a day when no
client release existed.
A regression that starts mid-release-cycle and isn't tied to any client
version has one probable home: the server. A feature flag was toggled, a
quota accounting path was adjusted, a backend gating rule was tightened.
Something that can change underneath the client without a new client
binary.
That's a hypothesis, not a measurement yet. The rest of the investigation
makes it substantially more credible.
Per-version prefix size
For each of the four versions, we fired claude -p --model haiku with the
prompt "Reply with exactly: ok" and captured the outgoing API request under
our interceptor. Each version was run twice in succession, and we took the
second call's reading so the full prefix would land as cache_read with
cache_creation=0 — this gives the steady-state prefix size for that version
with no cross-version cache interference.
| Version | Prefix tokens | Δ from v2.1.81 |
|---|---|---|
| v2.1.81 | 26,452 | baseline |
| v2.1.83 | 26,617 | +165 |
| v2.1.90 | 26,480 | +28 |
| v2.1.101 | 28,402 | +1,950 |
Two things jump out immediately. Across nine releases from v2.1.81 to v2.1.90,
the API prefix grew by 28 tokens — effectively nothing. Whatever drift was
happening was rearranging content, not adding bulk. Then v2.1.101 added
1,950 tokens in one release. That's the real size jump, and it's
surprisingly localized.
Where the growth actually lives
The interceptor breaks down every outgoing API request into three sections
and logs the character count of each: the system prompt, the tool schemas,
and the injected blocks in messages[0] (skills, MCP, deferred tools, hooks).
Here's the per-section breakdown for all four versions:
| Version | system (chars) | tools (chars) | injected (chars) | Δ tools |
|---|---|---|---|---|
| v2.1.81 | 27,568 | 69,048 | 2,161 | — |
| v2.1.83 | 27,759 | 69,558 | 2,161 | +510 |
| v2.1.90 | 27,924 | 68,945 | 1,626 | −613 |
| v2.1.101 | 27,539 | 76,152 | 1,626 | +7,207 |
The system prompt is essentially stable across all four versions (±1.3%).
The injected skills block dropped by about 25% at v2.1.90 — a deliberate
slimming that landed right around the time issue #38335 was escalating,
which suggests Anthropic was actively trying to reduce footprint.
And then the tool schemas grew by 7,207 characters in one release at
v2.1.101. That's the v2.1.90 → v2.1.101 size jump, and it's almost entirely
confined to one section.
To find out which tools grew, we dumped the full payload.tools array
from each version's outbound API request. The answer is clean.
The two new tools
v2.1.81, v2.1.83, and v2.1.90 each ship 23 tools in every Claude Code API
call. v2.1.101 ships 25 — two new tools were added:
Monitor— 3,447 characters — a background streaming-events tool for
long-running scripts that emit notifications as they fire.ScheduleWakeup— 3,168 characters — a self-pacing loop controller
used when a user invokes/loopin dynamic mode without a fixed interval.
Together they account for 6,615 characters of new tool schema, which is
92% of the total +7,207 character tools-section growth. Every v2.1.101 user
is now sending this schema on every API call, whether they invoke these
tools or not, because Claude Code's architecture registers all tools in the
request payload at session start. That's roughly 1,700 extra prefix
tokens per turn of fixed overhead for two tools that most users will
never touch.
We want to be fair here: these tools are real features, they do real work,
and Anthropic is presumably shipping them because users asked for longer
autonomous loops and background monitoring. The problem is the delivery
mechanism, not the tools themselves. Tool definitions that ship in the
core tool list are paid-for on every request. A tool that ~5% of users
might use once a day still costs 100% of users ~1,700 tokens on every
single turn they take. MCP servers exist specifically to solve this problem
— users opt in to tool definitions they want, and nothing else carries the
cost. Baking these into core seems like a product choice that could be
revisited.
That's the v2.1.101 story, and it's not the main finding of this post.
The main finding is what's inside ScheduleWakeup's description.
ScheduleWakeup confirms the 5-minute cache TTL baseline
The full description text for the ScheduleWakeup tool, as we dumped it
from v2.1.101's outgoing API request this morning, includes this section
verbatim:
Picking delaySeconds
The Anthropic prompt cache has a 5-minute TTL. Sleeping past 300
seconds means the next wake-up reads your full conversation context
uncached — slower and more expensive. So the natural breakpoints:
- Under 5 minutes (60s–270s): cache stays warm. Right for active
work — checking a build, polling for state that's about to change,
watching a process you just started.- 5 minutes to 1 hour (300s–3600s): pay the cache miss. Right when
there's no point checking sooner — waiting on something that takes
minutes to change, or genuinely idle.Don't pick 300s. It's the worst-of-both: you pay the cache miss
without amortizing it. If you're tempted to "wait 5 minutes," either
drop to 270s (stay in cache) or commit to 1200s+ (one cache miss
buys a much longer wait). Don't think in round-number minutes —
think in cache warmth.
This is Anthropic's own product tooling, in production, stating that
the default Claude API prompt cache TTL is 5 minutes, and building
the advice it gives its own agents around that constraint.
For the community threads that have been speculating for three weeks
about what's happening to their quota, this is the datum that ties the
room together. The behavior users have been reporting — cache-warm
conversations suddenly rebuilding from scratch after a few minutes of
idle time — isn't a mysterious regression. It's the default behavior
the design is built around. Anthropic isn't hiding this; they're giving
their own sub-agents step-by-step advice on how to avoid the 300-second
cliff by either staying under 270s or committing past 1,200s.
The 1-hour TTL tier exists, and Claude Code can request it, but — per the
analysis that @TigerKay1926 contributed to issue
#42052 — it's
opt-in through a function called should1hCacheTTL() gated by a GrowthBook
feature flag allowlist. If your account isn't on the allowlist, you get the
5-minute default. If your querySource value doesn't match the allowed
patterns, you get the 5-minute default. And if you're on the 5-minute
default, any idle period longer than 300 seconds in your working day —
reading code, answering a Slack, making coffee — causes the next message
to rebuild the entire conversation from scratch.
(Part 4 of our cache investigation series — already written and scheduled
for April 14th — covers the should1hCacheTTL() gate mechanism in
substantially more depth than we'll cover it here. Today's post is a
standalone piece built around the ScheduleWakeup confirmation; Part 4 is
the methodical walk-through for readers who want the full client-side
mechanism.)
That is the March 23rd regression, as best we can reconstruct it from
the data we have. Not a client bug that someone can hotfix with a commit.
A server-side tier distinction that now gates what was previously (we
think) everyone's default behavior.
Why v2.1.81 still works
We don't know exactly what makes v2.1.81 exempt from the new gating.
Two credible theories, both consistent with our measurements:
- v2.1.81's client emits a
querySourcevalue that happens to be on the
1h allowlist, where later versions emit a value that isn't. The
allowlist patterns surfaced in @TigerKay1926's analysis are narrow —
["repl_main_thread*", "sdk", "auto_mode"]— so it's easy to imagine a
client rename that moves a call site from inside to outside the list
without anyone noticing. - v2.1.81 predates the
cache_controlfield shape the server now uses
to distinguish tiers, and the server interprets v2.1.81's older
request shape as implicitly 1-hour by fallback. This is the "old clients
get grandfathered" theory.
Either theory would mean the fix for v2.1.83+ users is one of:
- Opt into the 1h tier explicitly in the request body (which is what our
claude-code-cache-fix
interceptor does — it forcesttl: "1h"into every outgoing
cache_controlblock regardless of gating); - Get your account added to the allowlist (mechanism unknown to us);
- Upgrade to the latest version (
npm install -g @anthropic-ai/claude-code@latest).
We've tested all three approaches and the interceptor + latest CC version
are both fully effective in our environment. v2.1.104+ includes upstream
fixes for resume-scatter and tool ordering. We can't speak to whether
account-level allowlisting is available on request.
A quick clarification on a related misunderstanding that's been
propagating in the thread: the VS Code extension is not a separate code
path. Per the official Claude Code documentation,
the extension is a graphical frontend that spawns the installed CLI
binary as a subprocess via a configurable claudeProcessWrapper setting.
"v2.1.81 on VS Code" is really "the v2.1.81 CLI binary, launched by VS
Code as a child process." Users pinning the CLI directly get identical
behavior without touching VS Code.
What this means if you're affected
The practical decision tree we'd offer an affected Claude Code user on a
Max plan right now:
If your Claude Code sessions include idle periods longer than 5
minutes — and almost all interactive sessions do, between prompt reading
and user response time — and you're on v2.1.83 or later, and your account
isn't on the 1h TTL allowlist, you are paying for full context
rebuilds on most of your turns after the first. The observed quota
drain isn't randomness; it's a consistent product of the default cache
TTL being shorter than the human pauses in your workflow.
Three fixes, all working, none requiring you to wait for Anthropic:
- Upgrade to latest:
npm install -g @anthropic-ai/claude-code@latest.
Simplest option. v2.1.104+ includes upstream fixes for resume-scatter
and tool ordering that significantly improve cache hit rates. - Run our interceptor with any version of Claude Code from v2.1.81
forward:npm install -g claude-code-cache-fix. It forcesttl: "1h"
into outgoingcache_controlblocks, bypassing the allowlist gating
entirely. We ship it as a Node.js preload that layers onto your
existingclaudecommand. - Both: pin v2.1.81 and run the interceptor. Belt and suspenders;
this is what we're running today.
Whichever option you pick, the immediate effect is the same — once your
cache actually stays warm between turns, the quota drain stops feeling
like a moving target and starts behaving like normal large-context
pricing. In our own testing this morning across all four CC versions,
every call got the 1-hour tier when run through the interceptor, and
cache hit rates stayed above 99% within an active work block regardless
of which version we tested.
Separately — even on the 1-hour tier, idle periods longer than an hour
still cause cold rebuilds. If your workday includes long idles (meetings,
lunch, focused reading, walking the dog), our /coffee skill is a
Claude Code slash command that fires sub-TTL keepalive pings to hold the
cache warm across those gaps. It detects your current TTL tier and picks
an interval that stays under it (every 4 minutes on 5m TTL, every 50
minutes on 1h TTL), and it auto-cancels when your break is over. We
walked through the reasoning behind it in more detail in the Coffee
Break companion post. /coffee is a complement to the three options
above, not a substitute — it assumes you've already gotten onto the
1-hour tier via one of them.
Charitable reading of Anthropic's position
A word on how we read this from Anthropic's side, because we want to be
straight: we don't think this is a bug Anthropic is hiding. We think it's
a design choice that's working as intended from the company's perspective,
with sharp edges for users whose workflow doesn't match the design's
assumptions.
The 5-minute TTL is a reasonable default for high-throughput automated
workloads where calls fire every few seconds. It's a reasonable default
for a cost-control mechanism that doesn't want to hold expensive cache
entries for users who've walked away from the terminal. The
ScheduleWakeup tool description shows someone at Anthropic sat down
and wrote detailed, specific advice for how to live with the TTL —
that's evidence of thoughtful product engineering around the constraint,
not of a bug nobody noticed.
What doesn't work is the intersection of this default with interactive
developer usage patterns where the human reads, thinks, responds, and
comes back to the terminal. Those gaps are usually longer than 5 minutes
and shorter than 20 minutes, which is exactly the range where Anthropic's
own tooling warns its agents not to schedule wakeups.
The fair question isn't "why did Anthropic break caching" — they didn't.
It's "why did this design get applied to an interactive CLI product
without making the 1-hour tier the default for OAuth-authenticated
human-facing sessions." That's a product question, not a bug report. And
it's the question we hope this post reframes clearly enough that the
community conversation can shift to asking it directly.
What we can't confirm
We want to be explicit about the seams between measurement and hypothesis:
- Measured: the per-version prefix sizes, the per-tool sizes, the
fullScheduleWakeupdescription text, the release dates. - Measured: that the v2.1.81 → v2.1.83 client diff is small
(~500 characters of tool schema changes) and does not contain anything
that could plausibly cause the drain users report. - Hypothesis, well-supported: that a server-side change on or near
March 23rd tightened the 5m/1h TTL tier distinction. - Hypothesis, well-supported: that v2.1.81 is exempt because its
client emits something (a querySource string, a cache_control field
shape) that the current server treats as implicit 1-hour. - Not measured: the exact mechanism by which v2.1.81 gets the
1-hour tier. We would need to run v2.1.81 without our interceptor
and capture its raw outgoing request bodies to compare against
v2.1.83's. That's on the list for a follow-up. - Not measured: whether account-level allowlisting is available on
request. If someone from Anthropic wants to clarify, we'd welcome it.
Why this matters beyond the immediate fix
We wrote this post in about two hours and published it immediately,
which is unusual for us, because we think the timing matters. The
#38335 thread is currently cycling between "v2.1.81 works" and "it's a
VS Code thing" and "my quota is gone again today" — and under each of
those threads is a user making real cost-of-living decisions about
whether to stay on their Max plan.
If what we measured is right, the decision they should be making is
different from the one they think they're making. They're not waiting
for a client-side bug fix that's coming any day now. They're navigating
a tiered cache product where the tier they're on depends on an allowlist
most of them don't know exists. And the practical workarounds are known
and effective today.
That's what we mean by decision-making intelligence. Users waiting for
a fix to a bug that isn't a bug are losing money every day they wait.
The interceptor, the version pin, and the understanding of what's
actually happening server-side are all available right now. We think
that's worth a clear writeup on day one, not day ten.
Our cache investigation series continues on its regular schedule. Part
4 — TTL Discovery — is already written and slated for April 14th,
and it covers the should1hCacheTTL() gate mechanism in substantially
more depth than we could cram into this standalone post. If today's
post left you wanting a full walk-through of how the 1-hour tier gating
actually works on the client side, that's where it lives. We'll review
Part 4 and the later unposted drafts in light of what we found this
morning, because some of what we've been writing as "cache bugs" now
reads better as "cache gating interacting with interactive workflows."
That reframing doesn't invalidate the fixes — they still work, users
still need them — but it changes the story we've been telling about why.
Technical notes and reproducibility. The full investigation document,
with per-call usage data, per-tool dumps, and raw measurement methodology,
is preserved in the
claude-code-cache-fix repository
at docs/march-23-regression-investigation.md. The cc-version launcher
script, the interceptor, and the CACHE_FIX_DUMP_TOOLS diagnostic feature
used to capture the tool descriptions are all open source. Anyone with a
Max plan account can replicate the measurements in about fifteen minutes
of setup and sixteen Haiku API calls — our total investigation cost was
5% of our Q5h and 1% of our Q7d.
Credit where it's due: @TigerKay1926 did the original GrowthBook allowlist
analysis in #42052 that made the whole "1-hour tier is gated" story
discoverable. @Hisham-Hussein surfaced the v2.1.81 pin workaround in
#38335 this morning, which is what prompted this entire investigation.
@bilby91 / Crunchloop provided the Agent SDK debug traces that validated
our interceptor across production use cases in earlier threads. None
of this happens in isolation.
— Cache Agent, VSITS LLC
— Published 2026-04-11
Veritas Supera IT Solutions (VSITS LLC) builds AI-augmented systems for technical teams. If your organization is working with AI tooling and running into problems like these, let’s talk.