Update (April 17, 2026): This post was drafted April 7. In the 10 days since, the investigation expanded significantly:
- The interceptor now fixes 8+ cache bugs (not just the original three), has 151 stars, 10 contributors, and has been independently audited as legitimate.
- We built the status bar we called for in Observation 1 — it ships with the interceptor (
quota-statusline.sh).- Anthropic engineers confirmed the TTL change (Observation 3) was intentional, though it remains undocumented in official docs.
- claude-code-meter and a community dashboard now provide the cost observability this post argues for.
- Opus 4.7 launched with a 2.4x Q5h burn rate from invisible adaptive thinking tokens — reinforcing every observation below. See Discussion #25.
- Anthropic moved enterprise plans to per-token billing, making cache optimization a direct cost control, not just a quota optimization.
The original observations stand. They’re stronger now.
Over the previous five posts, we’ve walked through a specific set of bugs in Claude Code’s cache management — what broke, how we found it, what we built to fix it, and what it cost. In this final post, we want to step back from the implementation details and talk about what this investigation revealed about the current state of AI tooling.
These aren’t complaints. They’re observations from practitioners who build with these tools every day and want them to get better.
Observation 1: AI Tool Costs Are Not Observable by Default
We spent roughly a week investigating Claude Code’s cache behavior. The single most enabling thing we did was build quota monitoring into our fetch interceptor — reading utilization headers from API responses and writing them to a local file.
Before that, our cost visibility was: check the Anthropic dashboard, see a number, wonder why it was high. After, we could correlate quota state with cache behavior on a per-call basis. That’s how we found the TTL downgrade.
This shouldn’t require a custom fetch interceptor. The information is in the API responses — Anthropic is already computing and sending it. But Claude Code doesn’t surface it. There’s no built-in way to see your cache hit rate, your per-session cost, your TTL tier, or your quota trajectory.
The lesson: If you’re building AI-powered tools that consume metered resources, cost observability isn’t a nice-to-have. It’s a requirement. Users need to understand what they’re paying for while they’re using the tool, not after the bill arrives.
This is solvable. A status bar showing cache hit rate and quota utilization would have made these bugs discoverable in hours instead of weeks.
Observation 2: Integration Testing Hasn’t Caught Up to Feature Velocity
The three cache bugs we documented aren’t deep architectural problems. They’re integration issues:
- Metadata blocks that land in the wrong position after resume
- A fingerprint computed from the wrong input
- Tool definitions sent in non-deterministic order
Each of these works correctly in isolation. The resume feature works. The fingerprint computation works. Tool registration works. They break when they interact with the caching system, which expects byte-stable prefixes across calls.
This is the classic integration testing gap — features developed and tested independently, with cross-cutting concerns like caching verified only at the integration level (if at all). It’s understandable in a product iterating as fast as Claude Code. But the cost to users is real: every cache bust on a 600K-token conversation is a measurable dollar amount, not an abstract performance metric.
The lesson: For AI tools where each API call has a direct cost, cache behavior is a first-class correctness property, not a performance optimization. Integration tests that verify prefix stability across resume, tool changes, and MCP reconnections would catch these bugs before release.
Observation 3: Undocumented Behavior Is a Cost Risk
The TTL downgrade at the quota boundary (Part 4) is arguably the most impactful finding of this investigation. It creates a feedback loop where exceeding your quota makes the problem worse. And it’s entirely undocumented.
We’re not suggesting this behavior is intentional obfuscation. The most likely explanation is that it’s a side effect of routing overage traffic to the standard API billing path, where 5-minute TTL is the default. The behavior makes sense mechanically — it just isn’t documented, and the API doesn’t indicate when it’s happening.
The broader point: in AI tooling, undocumented cost behavior is equivalent to a billing bug. Users can’t optimize what they can’t see. When the API silently accepts a 1-hour TTL request and caches at 5 minutes, the user has no basis for understanding their cost trajectory.
This applies beyond Anthropic. Any AI service where pricing depends on caching, batching, or tiered rate structures has the same risk: if the rules aren’t documented, users will discover them the expensive way.
Observation 4: Power Users Will Reverse-Engineer Your Product
The community work on this investigation was remarkable. @jmarianski ran the Claude Code binary through Ghidra to reverse-engineer caching mechanisms in the native Zig layer. @VictorSun92 traced the exact code path where resumed sessions broke the cache prefix. @Renvect analyzed how images and directories accumulate in conversation context. @RebelSyntax confirmed cache invalidation patterns independently.
We built a fetch interceptor that fixes three bugs, strips accumulated images, monitors quota in real time, and diffs cache prefixes across process restarts. All at the network layer, without modifying Claude Code.
This isn’t exceptional user behavior for developer tools. When developers hit cost anomalies in tools they depend on, they investigate. They read source code. They build instrumentation. They share findings. That’s what happened here.
The lesson: Your most engaged users — the ones on your highest-tier plans — are also the ones most capable of understanding exactly what your tool is doing. Building in the dark doesn’t save time; it just means the reverse-engineering happens in GitHub issues instead of in your documentation.
Observation 5: The 1M Context Window Changes the Economics
The move from 200K to 1M context windows is marketed as a capability upgrade. It is. But it also fundamentally changes the cost profile of every bug.
At 200K context, a cache miss rebuilds 200K tokens. At Opus input rates, that’s about $1. Noticeable but manageable.
At 1M context, a cache miss rebuilds 1M tokens. That’s $5 per miss. Three misses in a row — easy if you’re hitting the TTL downgrade — is $15. In a 30-minute window, we measured 5.25 million tokens of cache creation across 35 API calls.
Larger context windows mean:
- Higher absolute cost per cache miss
- More tokens to rebuild, which takes longer, which makes TTL expiry more likely
- Higher quota consumption rate, which pushes you toward the TTL downgrade boundary faster
The lesson: Context window increases need to be paired with improvements to cache management, not treated as independent features. The cost implications of cache bugs scale linearly with context size.
What We’d Like to See
We like Claude Code. We use it daily for serious work — multi-agent systems, production infrastructure, the kind of development where context window size and model quality genuinely matter. The models are excellent.
The tooling around those models has room to grow. Specifically:
-
Built-in cost observability. Cache hit rate, TTL tier, quota utilization, per-session cost — visible in the tool, not requiring API header inspection.
-
Documented cache behavior. What breaks the cache, what the TTL tiers are, how quota affects TTL. Users shouldn’t need to run live experiments to understand their cost model.
-
Image lifecycle management. Automatic summarization or expiry for images in conversation history. The current behavior — carry full base64 indefinitely — is the most expensive possible default.
-
Integration tests for cache stability. Prefix stability across resume, tool changes, MCP reconnections, and skill reloads as a tested invariant.
These aren’t unreasonable asks. They’re the standard expectations for any metered developer tool. AI coding assistants are still new enough that the ecosystem is figuring out norms. Our hope is that this series contributes to that conversation — with data, not just opinions.
The Full Series
- Part 1: The Problem — 100% quota burn in 2 hours, and what the numbers looked like
- Part 2: The Cache Architecture — How prompt caching works and three ways it breaks
- Part 3: The Community Fix — Building a fetch interceptor at the network layer
- Part 4: The TTL Discovery — The undocumented 1h→5m downgrade at the quota boundary
- Part 5: The Hidden Costs — Image persistence and cross-project contamination
- Part 6: What This Says About AI Tooling — You’re reading it
This series was written by Veritas Supera IT Solutions (VSITS) based on first-hand investigation on a Max 5x plan account running multi-agent workloads. All cost figures are from the Anthropic Admin Usage API. The fetch interceptor and investigation tools described in this series are our own work, built on foundations laid by community contributors @jmarianski, @VictorSun92, @TigerKay1926, @Renvect, and @RebelSyntax.
We build AI-augmented systems for technical teams. If you’re dealing with similar cost management challenges in your AI tooling, we’d like to hear from you.
Veritas Supera IT Solutions (VSITS LLC) builds AI-augmented systems for technical teams. If your organization is working with AI tooling and running into problems like these, let’s talk.