Two Ways to Lose Your Cache: The TTL Mechanisms Nobody Told You About
In Part 3, we described the quota monitoring we built
into our fetch interceptor — reading anthropic-ratelimit-unified-5h-utilization
from API response headers and writing it to a local JSON file.
That monitoring capability led directly to the most consequential finding of
this investigation: Anthropic's prompt caching TTL silently downgrades from
1 hour to 5 minutes — and it can happen two completely different ways.
This is undocumented. The API accepts the 1-hour TTL request without error and
ignores it server-side. There is no warning, no header indicating the
downgrade, and no mention of this behavior in any official documentation we
could find.
Mechanism 1: The Quota-Driven Downgrade
We were tracking quota utilization alongside cache performance when we noticed
a pattern: sessions that ran over the 5-hour quota boundary had dramatically
worse cache hit rates than sessions that stayed under it — even with our
interceptor fixes active.
We tested it live. At 99% quota utilization (q5h=0.99), the API honored our
ttl: "1h" request and returned tokens under ephemeral_1h_input_tokens. We
let utilization tick past 100%. The next API call — same session, same request
structure — returned tokens under ephemeral_5m_input_tokens.
| Quota State | TTL Requested | TTL Honored | Response Field |
|---|---|---|---|
| q5h = 0.99 | 1 hour | 1 hour | ephemeral_1h_input_tokens |
| q5h = 1.00 | 1 hour | 5 minutes | ephemeral_5m_input_tokens |
The request body was identical. Claude Code was still sending
{"type": "ephemeral", "ttl": "1h"}. The server silently overrode it.
The Double Penalty
This isn't just a TTL change. It creates a compounding cost spiral.
Penalty 1: Overage pricing. Anthropic's Extra Usage FAQ states that usage
beyond the included quota is billed at "standard API rates." You're now paying
per-token at API prices instead of consuming from your subscription quota.
Penalty 2: TTL downgrade. With a 5-minute TTL instead of 1 hour, your
cache expires 12x faster. Any pause longer than 5 minutes — reading docs,
reviewing a diff, a bathroom break — means a full cache rebuild on the next
turn. More cache rebuilds means more cache_creation tokens, which means
more cost, which means you stay over the quota boundary longer.
It's a feedback loop:
Over quota → 5m TTL → more cache misses → more tokens consumed
→ deeper into overage → still 5m TTL → more cache misses → ...
The pricing difference between TTL tiers compounds the effect:
| TTL Tier | Cache Write Cost (Opus) | Cache Read Cost | Write/Read Ratio |
|---|---|---|---|
| 5-minute | $6.25/MTok | $0.50/MTok | 12.5x |
| 1-hour | $10.00/MTok | $0.50/MTok | 20x |
The 1-hour TTL costs more per write ($10 vs $6.25) but far less in aggregate
because you write once and read for an hour. The 5-minute TTL costs less per
write but forces you to write far more often.
On our kanfei-nowcast workload, the steady-state impact was measurable. The
Apr 3 session (over quota, 5m TTL) averaged $3.33 per Sonnet-hour. The Apr 5
session (under quota, 1h TTL) averaged $1.67 per Sonnet-hour. By hours 5-8,
when the cache was fully warm, Apr 5's hourly cost dropped below $1.00 —
compared to Apr 3's steady $3.33 with constant cache rebuilds.
The Reversibility
The good news: the downgrade is reversible. When the 5-hour utilization window
rolls and your usage drops below 100%, the 1-hour TTL returns immediately.
We observed this directly on our account. The Apr 3 session was entirely on
5-minute TTL (quota over 100%). The Apr 5 session, after the window rolled,
was on 1-hour TTL. We saw ephemeral_1h_input_tokens in the API responses
and confirmed the cache was holding across longer idle gaps.
This bidirectional behavior suggests the mechanism is straightforward: the
billing path determines the TTL. Under quota, you're on the subscription
billing path, which includes 1-hour TTL as a feature. Over quota, you're
routed to the standard API billing path, where 5-minute is the default.
But Not for Everyone
After we published our findings on GitHub issue #42052,
@TigerKay1926 — another Max 5x user — reported being stuck on 5-minute TTL for
7 consecutive days across approximately 900 API calls. This included periods
after a fresh weekly quota reset, when utilization was at 0%.
This contradicts the clean quota-boundary model we observed. Our account showed
consistent bidirectional switching tied to the 5-hour utilization threshold.
@TigerKay1926's account appeared to be locked to 5-minute TTL regardless of
quota state.
Our conclusion at the time: there may be more than one mechanism controlling
TTL assignment. The quota-driven downgrade we documented is one. Whatever was
affecting @TigerKay1926's account appeared to be another.
That hypothesis held for about 48 hours. Then @TigerKay1926 found the answer.
Mechanism 2: The Client-Side Gate
@TigerKay1926 installed our interceptor — which includes a GrowthBook feature
flag dump capability — and immediately found what we'd been looking for. Claude
Code checks a GrowthBook feature flag called tengu_prompt_cache_1h_config
containing an allowlist:
"prompt_cache_1h_config": {
"allowlist": [
"repl_main_thread*",
"sdk",
"auto_mode"
]
}
The corresponding function in Claude Code's source (src/services/api/claude.ts)
checks whether the current session's querySource matches one of these patterns:
function should1hCacheTTL(querySource?: QuerySource): boolean {
// ...eligibility checks...
let allowlist = getPromptCache1hAllowlist()
// ...loads from GrowthBook...
return (
querySource !== undefined &&
allowlist.some(pattern =>
pattern.endsWith('*')
? querySource.startsWith(pattern.slice(0, -1))
: querySource === pattern,
)
)
}
If querySource doesn't match, the client never requests ttl: "1h" on
cache control blocks. The server defaults to 5 minutes — not because of quota
state, but because nobody asked for anything else.
That's what was happening to @TigerKay1926. Their quota was fine. The server
would have honored a 1-hour request. But Claude Code's client never sent one.
Two mechanisms, confirmed
| Mechanism | Trigger | Where | Reversible? |
|---|---|---|---|
| Quota downgrade | Q5h crosses 100% | Server-side | Yes — recovers when window resets |
| Allowlist gating | querySource not in allowlist |
Client-side | No — stuck until client code changes or interceptor patches it |
The critical difference: the quota downgrade is a temporary penalty that
resolves on its own. The allowlist gate is a persistent condition — if your
session type isn't in the list, you're on 5-minute TTL for every call, every
session, indefinitely.
The Fix: Interceptor v1.6.0
We shipped a fix the same day @TigerKay1926 reported the finding.
The interceptor now inspects every outgoing cache_control block. If it sees
{"type": "ephemeral"} without a ttl field, it injects ttl: "1h":
- Without fix (affected accounts):
{"type": "ephemeral"}— no TTL
specified, defaults to 5m - With fix (v1.6.0):
{"type": "ephemeral", "ttl": "1h"}— 1h enforced - Already correct (unaffected accounts):
{"type": "ephemeral", "ttl": "1h", "scope": "global"}— interceptor is a no-op
This bypasses the client-side gating entirely. The server honors whatever TTL
the client requests — the restriction was purely in the client's decision about
what to request.
The Cost of Being on the Wrong Side
The difference between 5-minute and 1-hour TTL sounds like a caching detail.
In practice, it's a cost multiplier.
| Scenario | Cache writes/hour | Write cost/hour (100k context, Sonnet) | Read cost | Total/hour |
|---|---|---|---|---|
| 1h TTL | 1 | $0.60 | $0.03/call | ~$0.60 + reads |
| 5m TTL | 12 (every pause > 5m rebuilds) | $4.50 | $0.03/call | ~$4.50 + reads |
That's 7.5x more expensive on cache writes for any user who takes breaks
longer than 5 minutes — which is every user in a real workflow. Reading docs,
reviewing diffs, answering Slack, getting coffee. Each pause is a full cache
rebuild.
A user stuck on 5m TTL due to the allowlist gate pays this penalty on every
session, at every quota level, with no way to know it's happening — unless
they inspect the raw API request bodies or install monitoring that checks for
the ttl field.
The Documentation Gap
Neither mechanism is documented. We reviewed Anthropic's official docs
thoroughly:
- Prompt caching docs: Present 1-hour TTL as a feature with use cases
("agentic workflows where follow-up prompts may take longer than 5 minutes,"
"long chat conversations"). No mention of conditions where it's unavailable. - Pricing docs: List cache write/read rates for both 5-minute and 1-hour
TTL tiers. No footnotes about eligibility restrictions. - Service tiers docs: No mention of TTL being tied to plan tier or quota
state. - Extra Usage FAQ: States overage uses "standard API rates." Does not
mention that standard API rates imply 5-minute TTL.
The closest thing to documentation is the pricing footnote that states the
listed prompt caching prices "reflect 5-minute TTL." If you read that
carefully and know that overage routes to "standard API rates," you can infer
the quota-driven downgrade. But the connection is never made explicit. And the
allowlist gating has no documentation at all.
The API itself provides no indication either way. It accepts ttl: "1h" in the
request, returns a 200 response, and quietly caches at 5 minutes when the quota
mechanism overrides it. For the allowlist gate, the client simply never includes
the ttl field — there's nothing to override because nothing was requested. The
only way to detect either mechanism is to inspect whether tokens appear under
ephemeral_1h_input_tokens or ephemeral_5m_input_tokens in the usage
response.
What This Means in Practice
There are now two things to watch for:
If you're a heavy user who occasionally pushes past the 5-hour quota
boundary:
- Your cache TTL silently drops from 1 hour to 5 minutes at the server
- More cache misses → more tokens → deeper into overage → the cycle feeds
itself - It recovers when the quota window resets
If you're on a session type not in the allowlist (which may include
standard interactive CLI sessions):
- You've been on 5-minute TTL from your first API call
- Quota state doesn't matter — the client never requests 1h
- This doesn't resolve on its own
The practical defense for both: our interceptor handles them. For Mechanism 1,
it writes quota state to ~/.claude/quota-status.json on every API call, making
the boundary visible to hooks and dashboards. For Mechanism 2, the v1.6.0 TTL
enforcement injects ttl: "1h" on any cache control block that's missing it.
One install covers both:
npm install -g claude-code-cache-fix
This finding is a strong example of the community feedback loop working.
@TigerKay1926 used our tool to surface data we couldn't see from our own
account. That data confirmed our hypothesis, identified the root cause, and
the fix shipped the same day. Open source at its best.
In Part 5, we'll cover two more cost amplifiers we
discovered: image persistence in conversation history and cross-project
directory contamination.
This is Part 4 of a six-part series on Claude Code's cache management. Previous:
Part 3 — The Community Fix. Next:
Part 5 — The Hidden Costs.
Updated Apr 13, 2026: Corrected the Sonnet cost-per-hour comparison table.
The original 5m TTL row overstated hourly write cost ($9.00 → $4.50) and the
multiplier (15x → 7.5x). The Opus rate table and qualitative conclusions are
unchanged. Rates per
Anthropic's published pricing:
cache write multipliers are 1.25× (5m) and 2× (1h) of base input price.
Veritas Supera IT Solutions (VSITS LLC) builds AI-augmented systems for technical teams. If your organization is working with AI tooling and running into problems like these, let’s talk.