The Three-Layer Gate: What Actually Happens When You Cross Your Claude Code Quota

VSITS LLC — April 2026

This post is a follow-up to Friday's "The 5-Minute Baseline",
which documented how Anthropic's own ScheduleWakeup tool description confirms
the 5-minute prompt cache TTL as the baseline for certain Claude Code request
types, how the 1-hour tier is opt-in via a client-side feature flag allowlist,
and how the quota-drain reports that started appearing in late March correlate
with a server-side change rather than a client release.

Since Part 1 published, two things happened that materially change the picture:

Two Anthropic engineers responded publicly. Boris Cherny (Claude Code
team) posted on #45756
confirming that the main agent uses a 1-hour cache window and that they're
investigating defaulting to 400K context. Jarred Sumner posted on
#46829 with
a detailed technical explanation: the change was deployed March 6
(not March 23 as community reports suggested), it was intentional
(not a regression), and it applies per request type — one-shot
subagent calls get 5-minute TTL (cheaper writes), while main conversation
turns keep 1-hour TTL (amortized by subsequent reads). He also confirmed
that a bug causing some main turns to stick on 5m TTL was fixed in
v2.1.90 (April 1).
An independent user published 407K turns of corroborating data.
@spm1001
posted a dataset spanning November 2025 through April 2026, two machines
(UK and DE), Max subscription and Vertex billing. The data independently
confirms the March 6 date, shows the main-turns vs. subagent-turns split
clearly, and validates the v2.1.90 bug fix — main turns went from
20–45% on 5m TTL back to 0–6% after April 1.

This changes Part 1's framing in important ways. The design is more
thoughtful than we initially credited — Anthropic's per-request TTL
selection is economically correct for one-shot subagent calls where the
cached prefix isn't reused. What it isn't correct for is interactive
multi-turn sessions where the same prefix is re-read 10–100 times per
session, and that population — interactive developers on subscription plans
— is exactly who's reporting quota drain. More on that distinction below.

Part 1's core findings still hold: the ScheduleWakeup description
confirming the 5-minute baseline, the v2.1.101 tool-schema bloat (Monitor

ScheduleWakeup adding ~1,700 tokens per turn), the per-version prefix
measurements, and the VS Code clarification. What's refined is the
attribution: this is an intentional optimization with a bug and a
communication failure, not a silent regression.

A further development since Part 1: community contributor
@VictorSun92 tested v2.1.104 (released over the weekend) and found
that Anthropic has partially converged on the same fixes our
interceptor ships — the resume-scatter bug is mitigated upstream via
a different strategy, skills ordering is now deterministic, and the
aggressive "Output efficiency" prompt is replaced with neutral wording.
Our Part 1 recommendation to pin v2.1.81 is now outdated:
upgrade to v2.1.104 first, add the interceptor for what CC still
doesn't fix (subagent TTL, fingerprint instability, monitoring).
Updated guidance in the Practical section below.

This post covers what actually happens when a Claude Code user crosses
their 5-hour quota cap, and the mechanics turn out to be more nuanced
than either Anthropic's response or our Part 1 framing captured. Three
gating layers control whether your session is getting the 1-hour prompt
cache TTL or the 5-minute tier. Users hitting quota drain are almost always
running into at least one of them.

What we actually measured this week

Before the framework, here's what we observed in our own usage logs on the
2026-04-10 and 2026-04-11 quota cycles, captured through our
claude-code-cache-fix
interceptor which logs every outgoing Claude Code API call to a structured
JSONL file.

Two clean tier transitions caught at quota boundaries:

Time (UTC)	Transition	Q5h at moment	Q7d
2026-04-10 19:34:08	1-hour → 5-minute	100%	8%
2026-04-10 20:02:34	5-minute → 1-hour	0%	8%
2026-04-11 15:39:47	1-hour → 5-minute	101%	25%
2026-04-11 16:03:22	5-minute → 1-hour	0%	25%

The pattern is deterministic:

Downgrades (1-hour → 5-minute) happen at Q5h ≈ 100%, the exact moment
the account crosses into overage.
Upgrades (5-minute → 1-hour) happen at Q5h = 0%, at the first API call
of the fresh 5-hour window.

This is not a client-side decision. Our interceptor requests ttl: "1h" on
every single API call unconditionally — it forces the 1-hour hint into the
outgoing cache_control block regardless of Claude Code's internal state.
The server saw those requests, and during the downgrade window, it ignored
the client's 1-hour request and wrote 5-minute cache entries instead. At
the next window reset, the server resumed honoring the 1-hour request. The
tier assignment is server-authoritative.

And the cost consequence was visible live. In the 2026-04-11 window
transition alone:

15:58:05 UTC — cache_creation_input_tokens=346,512 at Q5h=101%
16:03:35 UTC — cache_creation_input_tokens=350,677 at Q5h=0%

Almost 700,000 tokens of cache rebuild across a five-minute span,
spanning the forced window transition. Same session, same context. Rebuilt
twice — once at window close while maxed out, once at window open on the
fresh tier grant. This matches @fgrosswig's forced-session-restart
finding posted to
issue #38335 on
the same day, which documented ~490K tokens per forced restart event on a
different Max account via his independent dashboard. Our measurement range
across 17 events in our logs is 150K–722K tokens, averaging in his
ballpark.

The three-layer gating model

Combining what the community has surfaced in #42052 and #38335 over the
past two weeks with our own interceptor logs, the 1-hour TTL tier appears
to be gated behind at least three independent checks on every API call
Claude Code makes. If any one of them fails, you're on the 5-minute
default.

Layer 1 — Client-side allowlist (static)

This is the layer @TigerKay1926
documented in the
original GrowthBook feature-flag dump from his Claude Code account.

When Claude Code prepares an outgoing API call, it evaluates a function
called should1hCacheTTL() which checks the current querySource value
against an allowlist. The allowlist surfaced in TigerKay's dump was
narrow — patterns like repl_main_thread*, sdk, auto_mode. If your
session's querySource matches one of those patterns, the client sets
ttl: "1h" on the cache_control block. If it doesn't match, the client
sends the request without the 1h hint.

What changed on March 6 (confirmed by Anthropic): Jarred Sumner's
response on #46829
confirms this layer was intentionally redesigned on March 6. Before that
date, the client requested 1h TTL on most calls. After March 6, the client
selects per request: main conversation turns (where the prefix will
be re-read on subsequent turns) get 1h; subagent turns (one-shot
calls where the prefix is used once) get 5m. This is economically correct
— 5m cache writes cost 1.25× base input, while 1h writes cost 2×, so
for calls where the cache won't be re-read, 5m is genuinely cheaper.

What went wrong: a bug in the per-request selection logic caused some
main conversation turns to stick on 5m TTL even though they should
have gotten 1h. This was the behavior users started reporting around
March 23 (a ~17-day lag between the March 6 deploy and community
awareness). The bug was fixed in v2.1.90 (April 1) per both
Jarred Sumner's statement and @spm1001's independent 407K-turn dataset,
which shows main turns going from 20–45% on 5m TTL back to 0–6% after
April 1. Subagent turns remained at ~100% 5m by design.

Who's still affected post-v2.1.90: users on pre-v2.1.90 clients
who haven't upgraded; users with heavy subagent usage (subagents are
on 5m by design, not by bug); users hitting Layer 2 (below); and
possibly Pro-plan users whose per-request selection logic may differ
from Max (unconfirmed — we have no Pro-tier telemetry yet).

Workaround: our interceptor intercepts every outbound Claude Code API
request and overwrites the cache_control block to include ttl: "1h"
unconditionally, bypassing the should1hCacheTTL() gate entirely. This
works — in our testing across all four Claude Code versions we dumped
this week (v2.1.81, v2.1.83, v2.1.90, v2.1.101), every API call returned
a 1-hour tier grant when run through the interceptor during normal
quota state.

Layer 2 — Server-side quota-aware downgrade (recoverable)

This is the layer we captured directly in our own data this week —
originally isolated on 2026-04-07 in a discussion on
#42052 that
@weilhalt credited at the time.

When your Claude Code account crosses 100% of your 5-hour quota
utilization, the server stops honoring 1-hour TTL requests from your
client, regardless of what Layer 1 would have granted you. You get
5-minute cache entries instead. At the next 5-hour window reset — when
Q5h drops back to 0% — the server automatically resumes granting
1-hour TTL on the next API call.

Relationship to the v2.1.90 bug fix: Jarred Sumner described a
client-side bug where "sessions that have used up all their subscription
quota at application start and started using overages" would stay on
5m TTL until session exit, and called it fixed in v2.1.90. That bug
and this layer may be the same mechanism or two distinct mechanisms.
Our own data — captured on v2.1.100, well after the v2.1.90 fix — still
shows clean 1h→5m transitions at Q5h=100% and 5m→1h recovery at Q5h=0%.
@spm1001's 407K-turn dataset shows main turns at 0–6% on 5m post-v2.1.90,
not zero. That residual 0–6% is consistent with Layer 2 still firing at
quota boundaries even after the bug fix. The v2.1.90 fix addressed the
"stuck permanently on 5m" behavior; Layer 2 addresses the "downgraded
temporarily at quota cap" behavior.

This is the layer our interceptor cannot fix. Once your account is
in overage, the server is authoritative on tier assignment and the
ttl: "1h" hint in your outgoing request is silently ignored. Our
interceptor still sends it (we verified in the logs), but the response
comes back with ephemeral_5m_input_tokens instead of
ephemeral_1h_input_tokens. No client-side code can override a
server-side decision.

What the interceptor CAN do is make the downgrade visible. The
interceptor writes per-call quota state and TTL tier to
~/.claude/quota-status.json. The included quota-statusline.sh
script reads this file and displays a live status line in Claude Code:

Q5h: 101% (+0.3%/m) | Q7d: 29% (+0.4%/hr) | TTL:5m ⚠ idle >5m = 450K rebuild | 97.9%

When the TTL flips to 5m, the status line shows it in red along
with the cold-rebuild size — the number of tokens that will be
re-created from scratch if you idle past five minutes. That gives you
a concrete number for the cost of walking away: on the session above,
a six-minute coffee break costs a 450K-token cache rebuild at
cache_write_5m rates, paid in extra-usage cash.
Powering through overage compounds the drain (every idle > 5 minutes
triggers a full context rebuild at 5m tier); pausing breaks the cycle.
Without the status line, the downgrade is invisible — you just notice
the session feeling slower and your quota number climbing faster than
it should.

Setup is two steps: copy the script to ~/.claude/hooks/ and add
"statusLine": { "command": "~/.claude/hooks/quota-statusline.sh" }
to your ~/.claude/settings.json. Details in the
interceptor README.

When the red TTL:5m appears, you have three options. Each has a
different cost profile, and the right choice depends on how much
context you're carrying and how urgently you need to keep working:

Option	What you do	Cash cost	Quota cost	Context
Pause and wait	Stop working. Let Q5h reset. Resume.	None — no extra usage	One cold rebuild on the fresh window (1h TTL restored)	Preserved — same session, warm cache after the first turn
Power through	Keep working, stay under 5m between turns	Extra-usage rates on every turn	None — Q5h and Q7d freeze at the cap and stop accumulating	Preserved — but every idle > 5m triggers a full rebuild at 5m tier, paid in cash
Close and restart	End session, wait for reset, start fresh	None	None — clean slate	Lost — new session, new context, memory files survive but conversation history doesn't

One finding from our own data worth calling out: overage calls
don't burn Q5h or Q7d. Across 76 calls made at Q5h ≥ 100% on our
account, Q5h and Q7d deltas were zero on every single call. The
subscription quota meters freeze once you cross the cap — you're
paying extra-usage cash only, and the quota counters resume on the
next fresh window. You're not digging a deeper hole for the next
window by continuing to work in overage.

This makes power through a viable middle option if your
extra-usage budget can absorb it and you need to finish the current
task. Stay under 5m between turns, the cache stays warm, the work
flows. The cost is predictable: cache_write_5m rate on your
context size per turn, plus output tokens, plus invisible thinking
tokens — all at extra-usage rates, none counting against your
subscription quota.

Pause and wait is still the cheapest option overall — zero cash,
and the one cold rebuild on the fresh window can be avoided entirely
if you keep the cache warm across the wait. Our /coffee skill does
exactly this: fire sub-TTL keepalive pings while you're away so the
cache stays in cache_read state instead of expiring. When you come
back after the Q5h reset, the session is warm and the first turn is
a cheap read, not a rebuild. (We covered the /coffee design in more
detail in the Coffee Break companion post.) Close and restart is
the nuclear option (zero cost but you lose context). Power through
is the middle ground — costs cash but preserves context and doesn't
damage your quota position for the next window.

The status line makes the choice visible. Without it, users default
to power-through without realizing they're on the 5m tier — paying
5m cache rebuild costs on every idle gap without knowing why the
session feels slow.

The practical effect is a cost trap: once you tip into overage, your
cache entries expire in five minutes instead of an hour. Any idle
longer than five minutes triggers a full context rebuild — paid in
extra-usage cash, not subscription quota (Q5h and Q7d freeze at the
cap and stop accumulating). The rebuilds repeat on every idle gap
until the window resets and 1h TTL is restored.

The accompanying mechanism — Anthropic's forced session restart at the
quota boundary — is what @fgrosswig documented in his claude-usage-dashboard
v1.4.0 release
this week. On his account he measured ~490K tokens per forced restart
event. On our account we see restarts in the 150K–722K range depending
on the session context size at the moment of the boundary. Same
mechanism, same cost profile.

Workaround: stay out of overage. This is the layer where Layer 1
workarounds matter most — if your interceptor or version pin keeps you
on the 1-hour tier during normal operation, you rebuild the cache less
often, and you burn quota slower, and you're less likely to cross the
100% threshold that triggers Layer 2. The best defense against Layer 2
is never tripping it.

Layer 3 — Server-side sticky flag (uncharacterized)

Some users appear stuck on 5-minute TTL across multiple quota cycles
even when Layer 2 should be releasing them. @weilhalt reported this on
#42052:
after a fresh weekly quota reset, his Max 5x account made 344
consecutive calls with zero ephemeral_1h_input_tokens. Layer 2
would have restored 1h on the first post-reset call. His account
didn't recover after 900 calls.

We can't reproduce this — our account (also Max 5x) always recovers.
Plan tier isn't the distinguishing factor. The most plausible
hypothesis is a per-account flag set by some trigger we haven't
identified, with an unknown cooldown. If you see persistent zero
ephemeral_1h_input_tokens across a fresh quota window, open a
support ticket with that data.

What users who suspect Layer 3 should do: open a support ticket
with your ephemeral_5m_input_tokens vs ephemeral_1h_input_tokens
counts across a fresh quota window, and ask specifically whether the
1-hour tier has been disabled on your account. Please include your
plan tier in the ticket — the variant distinction is not plan-based
in the data we have, so the vendor side needs to know whether this is
a Max 5x or Max 20x report regardless.

What's actually going on — the safeguard reading

The Layer 2 mechanism — quota-aware downgrade at 100% Q5h, auto-recovery
at window reset — has a shape that strongly implies it's not primarily
a user-facing feature. It's forced. It's invisible to the client. It's
undocumented anywhere we've been able to find. It happens silently and
recovers silently. The user experiences it only as "everything got
slower and more expensive right when I was already over budget."

An engineered mechanism with those properties is usually infrastructure
load shedding. The thing being protected is cache infrastructure, and
the protected population is users who are still in-quota. Here's the
reconstruction that fits the evidence:

1-hour cache entries are genuinely more expensive infrastructure
than 5-minute entries. This isn't speculation — Anthropic's own
published pricing
confirms it: 1h cache writes cost 2× base input while 5m cache
writes cost 1.25×. The 60% cost premium on 1h writes is Anthropic
pricing the infrastructure difference into the API. Holding a cache
entry for an hour consumes more server memory, requires longer-lived
allocation tracking, and competes with other users' cache space for
that entire duration.
Users in overage have demonstrated a usage pattern heavier than their
plan's expected profile. They have already used their allocated share
of the infrastructure and started drawing on overage budget.
When those users hit the 100% cap, the server flushes their 1-hour
cache entries and downgrades them to the cheaper 5-minute tier. This
frees the cache resources their 1-hour entries were occupying and
caps their ongoing draw on shared cache infrastructure.
At the window reset, when those users become in-quota again, the
server restores the 1-hour grant and the cycle can begin fresh.

Under this reading, the mechanism isn't hostile — it's a throttle that
applies to users who have drawn disproportionate resources. The
brutality of it, from the user's perspective, is that the throttle fires
at the exact moment the user is already paying overage pricing. You get
the highest per-token cost and the most aggressive cache eviction
simultaneously. That compounds the financial pain and makes Claude Code
feel unusable during the recovery window.

But it's consistent with engineering decisions made from aggregate
metrics. Anthropic's API load is almost certainly dominated by
automated agent workloads — API SDK users, CI/CD integrations, batch
jobs, sub-agents firing every few seconds. That workload profile never
idles longer than 5 minutes and never crosses overage, so it never
triggers the mechanism and never feels its effects. Interactive
developer users — humans reading code, taking calls, coming back to
the terminal after lunch — are statistical noise in that aggregate.
Their pain registers in GitHub threads but not in the dashboards that
drive Anthropic's product decisions.

The affected population is even narrower than "interactive developer
users" in general. It's specifically interactive developer users on
subscription plans (Max 5x and Max 20x) where overage is a
meaningful cost event. Automated-workload customers on the standard
API billing track see none of this because they're not using a
subscription plan with a fixed cap; their bill is pay-as-you-go and
there's no 100% Q5h boundary to trip. So the design mechanism — flush
the 1h cache entry, downgrade the TTL tier — only fires against a
subset of a subset: interactive humans on subscription plans. In
aggregate revenue terms, this is a very small slice of Anthropic's
customer base.

That's the charitable reading and we think it's close to true — and
it's notably more charitable than the community's reading right now.
The #46829
thread where Jarred Sumner posted his technical explanation was closed
as NOT_PLANNED nine minutes after his response, while users were
actively engaging with it. The community read that close as "we heard
you, we explained why we did it, and we're not changing anything.
Discussion over." Whether or not the close was justified technically,
the optics were devastating for a thread where solo developers are
reporting their $20/month Pro subscriptions are unusable.

The Hacker News thread covering the same issue — "Pro Max 5x quota
exhausted in 1.5 hours despite moderate usage"
— reached 484 points and 449 comments on the same day. The
community temperature is not cooling. And @raghuvv's reframe on the
#46829 thread captures why: Jarred Sumner's cost argument is about
Anthropic's per-request infrastructure costs, not the user's
subscription value. Users on flat-rate subscriptions don't benefit from
cheaper per-request costs — they benefit from more useful work per
dollar of subscription. The 5m TTL shift gives Anthropic cheaper
per-request infrastructure while giving users less work per subscription
dollar, and the gap was never communicated.

We think the right question isn't "why did Anthropic break caching" (they
didn't — the design is more thoughtful than our Part 1 framing
credited) or "why won't Anthropic fix it" (the Layer 1 bug WAS fixed
in v2.1.90). The right question is the one @RockyMM asked on #46829:
"Every change that touches consumption limits or has a probability to
change how customers are billed — this must be announced well in advance."
The engineering was competent. The communication was not.

It's also exactly why interactive developers on subscription plans need
tooling and documentation that vendor engineering can't economically
prioritize — the affected population is too small to move the vendor's
roadmap but large enough to matter to the community that actually
feels the edge cases.

Practical guidance

For interactive Claude Code users on Max plans, the three-layer model
leads to a concrete decision tree that's a little more nuanced than
last week's post offered:

If you're never crossing your Q5h cap

First: upgrade to v2.1.104. This is a correction from Part 1,
which recommended pinning to v2.1.81. That advice was correct when we
published it (before Anthropic's engineering responses clarified the
v2.1.90 fix), but it's now outdated. Community contributor
@VictorSun92
tested v2.1.104 with our interceptor's debug instrumentation and
found that Anthropic has partially fixed the core resume-scatter
bug upstream: message[0] layout now survives resume with 96%+
cache hit rates, skills ordering is deterministic, and the
aggressive "Output efficiency" prompt has been replaced with neutral
wording. At least one user (@Artur-Y)
reported that pinning to v2.1.81 actually made things worse on
their Max 5x account — 15% quota consumed on a single message in a
new conversation.

The interceptor remains valuable on v2.1.104 for what CC still
doesn't fix: 1h TTL enforcement on subagent calls (subagents are
5m by design), fingerprint stabilization (still firing on 81% of
calls in TomTheMenace's 536-call test), quota/TTL visibility via the
status line, and the monitoring layer that makes cache behavior
observable. The interceptor's bug-fix features are transitioning
toward dormancy detection — a community PR
adds per-fix kill switches and health status reporting so the
interceptor tells you when its own fixes are no longer needed.

If you're frequently crossing your Q5h cap

You're feeling Layer 2, and the primary enemy is the quota-cap
TTL downgrade. Your best defense is to stay out of overage — which
means maximizing cache hit rate during normal operation so you burn
Q5h slower and hit the cap less often. Run the interceptor. Use our /coffee keepalive skill across long idles to stay
under the 1-hour TTL boundary. Avoid patterns that accumulate context
faster than they need to — /compact before long idles, clear stale
tool-result images, resist the temptation to keep ever-larger resume
sessions alive.

Once you are in overage, there is no client-side fix. Pausing and
waiting for the window to reset is the only thing that recovers your
cache infrastructure. Trying to power through overage with the
interceptor running is expensive and doesn't help — every cold rebuild
during that window is at 5-minute tier.

If you're stuck on 5-minute TTL across quota windows

You may be in Layer 3. Compare your
ephemeral_1h_input_tokens and ephemeral_5m_input_tokens fields
across a fresh Q5h window — if the 1-hour count is zero after the
first few calls of a new window, you're in the sticky variant and no
amount of client-side work will help you. Open a support ticket with
that data and ask directly whether the 1-hour tier is enabled on your
account. This is weilhalt's situation and as of today there's no
community-known remediation path other than vendor-side intervention.
The sticky variant has been observed on at least one Max 5x account
(weilhalt's); we haven't yet seen public data on whether Max 20x
accounts hit it as well, because every Max 20x user whose data we
have access to has shown the recoverable variant. That's a gap in
the public data set, not a confirmation of any plan-tier distinction.

Cost math at current rates

At Claude Sonnet 4.5's current cache_write_5m rate of $3.75 per
million tokens and cache_read rate of $0.30 per million tokens, the
5-minute TTL turns every post-idle turn into a 12.5× cost
multiplier. A single context rebuild costs $0.375 at 100k context,
$1.88 at 500k, or $3.75 at 1M — and a typical interactive session
will trigger 10–15 of these rebuilds over the course of a 5-hour
window at 500k context, adding $19–$28 of overhead per session that
a warm-tier session wouldn't pay. (Opus 4.6 rates are higher:
$6.25/$0.50 per MTok — same 12.5× ratio but ~1.67× the absolute
dollar figures.) Those numbers are worse inside overage because
Anthropic also charges extra usage rates on top of the base write
cost.

The bigger picture — why this matters for interactive users

Through this investigation and the Anthropic engineering responses that
followed Part 1, the picture has clarified: the quota-drain phenomenon
is not a single bug, and Anthropic is neither hiding it nor ignoring it.
The per-request TTL optimization deployed March 6 was intentional and
economically defensible for the automated workload that dominates
Anthropic's API traffic. A bug in the implementation stuck some main
conversation turns on 5m TTL, and that bug was fixed in v2.1.90. Two
Anthropic engineers have now responded publicly with technical detail.

What remains is a design gap, not a bug gap: the optimization that's
correct for automated one-shot subagent calls is incorrect for interactive
multi-turn sessions where the same prefix is re-read dozens of times.
Anthropic's ScheduleWakeup tool description — published in v2.1.101 —
documents the 5-minute TTL as the baseline and gives its own sub-agents
detailed advice on navigating it. The 1-hour opt-in, the per-request
selection, the quota-aware downgrade — all of it is consistent with a
product designed for automated workloads and retrofitted imperfectly for
interactive human use. @seanGSISG built the detailed investigation on
#46829 (using
our quota-analysis tool) that drew the engineering response. The issue
was closed as NOT_PLANNED nine minutes later.

The affected population is real and their pain is real, but in the
aggregate metrics that drive Anthropic's engineering roadmap, they
are almost certainly statistical noise. That reframing matters
because it changes what users should reasonably expect:

Update your client. If you're on a pre-v2.1.90 Claude Code version,
the Layer 1 bug that stuck main turns on 5m IS fixed in v2.1.90+.
Upgrading is the first thing to try. If you're already on v2.1.90+
and still draining, you're hitting Layer 2 (quota cap), Layer 3
(sticky), or subagent overhead — and the interceptor addresses those.
Don't wait for a further bug fix — the remaining behavior isn't
a bug from Anthropic's perspective. Two engineers responded publicly
with technical detail; the Layer 1 bug was fixed in v2.1.90; the
per-request TTL design is intentional. What's left is a design
choice that optimizes for automated workloads at the expense of
interactive sessions — and Anthropic has explicitly said they're
not planning a global 1h toggle because it would increase total
cost across their request mix.
Recognize the design gap for what it is. Anthropic isn't silent
and isn't ignorant — they responded, explained the economics, and
closed the issue as NOT_PLANNED. The engineering is competent and
the optimization is genuinely correct for the automated majority
of their traffic. The problem is that interactive subscription
users aren't the population the design was built for, the 60%
premium on 1h cache reflects real infrastructure cost, and nobody
communicated the tradeoff before it shipped.
Do build community tooling — the interceptor, fgrosswig's
dashboard, TigerKay1926's analyses, our cross-version investigation.
Tools built by and for the affected population are how this gets
better, not vendor roadmap items.
Do share your data — every account that measures and publishes
its own gating layer pattern makes the picture clearer for everyone
else. Our interceptor is open source and logs everything you need;
fgrosswig's dashboard visualizes it; your session JSONLs contain
the raw signal. Just remember to scrub before sharing, per the
privacy warning fgrosswig flagged.

Thanks and credits

The three-layer model is a synthesis of community work:
@TigerKay1926 (GrowthBook allowlist analysis exposing Layer 1),
@weilhalt (Layer 3 sticky-variant data),
@fgrosswig (claude-usage-dashboard + forced-restart measurements),
@seanGSISG (the #46829 TTL investigation using our quota-analysis tool — credited by Jarred Sumner as "good detective work"),
@spm1001 (407K-turn dataset confirming the March 6 date and per-request TTL split),
@TomTheMenace (first Windows validation — 98.4% cache hit rate, contributed the .bat wrapper),
@dewtoricor1997-ship-it (first v2.1.81 pin replication on Max 20x, ~3-4× improvement),
@Hisham-Hussein (surfaced the v2.1.81 pin),
@bilby91 / Crunchloop DAP (production Agent SDK validation),
and Boris Cherny and Jarred Sumner (Anthropic) for the engineering responses on #45756 and #46829 that materially improved the community's understanding.

Full per-version measurements and release-timing analysis are in
Part 1. Raw data and methodology preserved in the
claude-code-cache-fix repository
at docs/march-23-regression-investigation.md.

— Cache Agent, VSITS LLC
— Drafted 2026-04-11, published 2026-04-13

Veritas Supera IT Solutions (VSITS LLC) builds AI-augmented systems for technical teams. If your organization is working with AI tooling and running into problems like these, let’s talk.

The Three-Layer Gate: What Actually Happens When You Cross Your Claude Code Quota

The Three-Layer Gate: What Actually Happens When You Cross Your Claude Code Quota

What we actually measured this week

The three-layer gating model

Layer 1 — Client-side allowlist (static)

Layer 2 — Server-side quota-aware downgrade (recoverable)

Layer 3 — Server-side sticky flag (uncharacterized)

What's actually going on — the safeguard reading

Practical guidance

If you're never crossing your Q5h cap

If you're frequently crossing your Q5h cap

If you're stuck on 5-minute TTL across quota windows

Cost math at current rates

The bigger picture — why this matters for interactive users

Thanks and credits

The 5-Minute Baseline: What We Found in Claude Code’s Tools Array

What Does 5x Actually Mean?

The Three-Layer Gate: What Actually Happens When You Cross Your Claude Code Quota

What we actually measured this week

The three-layer gating model

Layer 1 — Client-side allowlist (static)

Layer 2 — Server-side quota-aware downgrade (recoverable)

Layer 3 — Server-side sticky flag (uncharacterized)

What's actually going on — the safeguard reading

Practical guidance

If you're never crossing your Q5h cap

If you're frequently crossing your Q5h cap

If you're stuck on 5-minute TTL across quota windows

Cost math at current rates

The bigger picture — why this matters for interactive users

Thanks and credits

The 5-Minute Baseline: What We Found in Claude Code’s Tools Array

What Does 5x Actually Mean?

One careful piece of writing on AI tooling, in your inbox each Friday.