Update (April 24, 2026): Anthropic published a post-mortem on April 23 acknowledging three issues: a reasoning effort default change (March 4–April 7), a caching bug that continuously cleared thinking on resumed sessions and “drove faster usage limit consumption” (March 26–April 10, fixed in v2.1.101), and a system prompt verbosity instruction (
≤25 words between tool calls) that degraded coding quality by 3% (April 16–20, reverted April 20). Usage limits were reset for all subscribers as of April 23.None of the three issues documented in this post — thinking token invisibility, the false rate limiter, or JSONL duplication — appear in the post-mortem. The requests we made in the final section remain unaddressed: thinking token counts are still not surfaced in API responses, client-side rate limit errors are still indistinguishable from API-side limits in the user-facing message, and PRELIM entries in local logs are still unlabeled.
In our companion post, Silent Context Degradation, we covered how Claude Code silently compacts your conversation history while you work. That post was about context quality — what the model can see.
This one is about money — specifically, quota consumption you can’t see and rate limit errors that aren’t real.
Both findings come from @ArkNill’s systematic analysis of 3,700+ captured API requests. We’re building on their work here and adding our own perspective from running multi-agent workloads on a Max 5x plan.
The Thinking Token Gap
Claude Code uses extended thinking — the model’s internal reasoning process that produces a chain-of-thought before generating a visible response. This is generally a good thing. Thinking tokens improve response quality, especially on complex tasks.
The problem: thinking tokens are invisible to the client but appear to count against your quota.
Here’s what ArkNill measured:
| Metric | Value |
|---|---|
| Visible output tokens per 1% of quota | 9,000 – 16,000 |
| Estimated total tokens per 1% of quota | 1,500,000 – 2,100,000 |
Thinking tokens in API output_tokens field |
Not included |
| Thinking tokens in quota consumption | Appears to count |
The gap between 16,000 visible tokens and 2,100,000 total tokens is where thinking lives. Each 1% of your 5-hour quota produces a modest amount of visible output but consumes a large volume of thinking tokens that never appear in the API response’s output_tokens field.
This matters because it breaks every cost model users try to build. If you’re tracking your token usage from API responses — which is the only client-side data available — you’re seeing a fraction of your actual consumption. Your quota meter climbs at a rate that doesn’t match your visible output, and there’s no field in the API response that explains the difference.
What you can do about it
Update (April 17, 2026): Since this post was originally written, we’ve confirmed a workaround and quantified the impact:
CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1disables adaptive thinking on Opus 4.7. In our testing (Phase 2 burn rate collection through @fgrosswig’s gateway proxy), the Q5h burn rate dropped dramatically with thinking disabled. The invisible thinking tokens appear to account for roughly half of total quota consumption — consistent with the 2.4x burn rate multiplier we measured across 1,180 Opus 4.7 calls.For contrast: OpenAI’s Codex GPT interface reports reasoning tokens explicitly as a separate line item (
reasoning: 77,473). Anthropic’s adaptive thinking tokens remain invisible in the API response but are charged against your quota.
Our interceptor captures the anthropic-ratelimit-unified-5h-utilization header on every API call. If you’re tracking quota deltas between calls, you can estimate thinking overhead by comparing the quota consumed per call against the visible input + output tokens. But it’s an estimate — the decomposition isn’t available. The fact that a competitor surfaces this data by default makes the omission a choice, not a limitation.
The False Rate Limiter
This one is particularly frustrating for multi-agent workflows.
Claude Code implements a client-side rate limiter that generates synthetic “Rate limit reached” errors without ever making an API call. The user sees a rate limit message, assumes they’ve hit Anthropic’s API limit, and waits. But the API was never contacted.
How to identify false rate limits
ArkNill identified the tell: synthetic rate limit errors have "model": "<synthetic>" and report zero token usage. A real API rate limit comes back from the server with actual model and usage data. A false one is manufactured locally.
When it triggers
The client-side limiter fires when context_size × concurrent_requests exceeds an internal threshold. The exact threshold isn’t documented, but it disproportionately affects:
- Large context sessions — a single 600K context session has a much higher product than three 50K sessions
- Multi-agent workflows — running 3-5 concurrent agents (exactly our setup) pushes the product past the threshold even at modest context sizes
- Sessions after context growth — the limiter can start firing mid-session as context accumulates, even though the API would accept the request
Why this matters
The false rate limiter creates a perverse dynamic: the users most likely to hit it are the ones on Max plans running serious workloads — the exact users paying the most and expecting the highest throughput.
When an agent hits a false rate limit, it pauses. It doesn’t retry immediately because it thinks it’s been rate-limited. In a multi-agent workflow, this introduces artificial serialization — agents waiting in line for a gate that doesn’t need to be there.
From our experience running 5 concurrent agents: unexplained pauses and “rate limit” messages were a recurring frustration. We attributed them to API-side limits and adjusted our workload patterns accordingly. Knowing that some of these were client-side phantom limits changes the calculus entirely.
What we’re building
Update (April 17, 2026): Claude Code v2.1.113 replaced the Node.js runtime with a compiled Bun binary, which killed the
--importpreload mechanism our interceptor used. We’re migrating to a local proxy architecture usingANTHROPIC_BASE_URL— an SDK contract that survives runtime changes. The proxy will include false rate limit detection: intercept at the request layer, detect synthetic errors by checking for the"model": "<synthetic>"marker, log them separately from real API errors, and optionally suppress them to let the actual API call through. Design details at #40.
The JSONL Duplication Problem
This is a local tooling issue, not a billing issue — but it will bite you if you’re building cost accounting from Claude Code’s conversation logs.
Claude Code’s local .jsonl journals contain duplicate token accounting entries created by extended thinking. Each API call generates 2-5 PRELIM entries before the FINAL entry, each carrying the same cache_read and cache_creation token counts:
| Entry Type | Frequency | Token Values |
|---|---|---|
| PRELIM | 2-5 per API call | Same as FINAL |
| FINAL | 1 per API call | Actual values |
| Local inflation | 2.87x on average |
If you’re building local cost accounting from JSONL data — which is a reasonable thing to do, and something our claude-code-meter project does — you need to filter PRELIM entries or your totals will be inflated by roughly 3x. Server-side quota uses only FINAL entries; the PRELIM entries are a client-side logging artifact.
The Compound Picture
Step back and consider what a Max plan user is dealing with:
- Cache bugs (Parts 1-3) inflate token consumption 10-20x on resume
- TTL downgrade (Part 4) makes cache misses 12x more frequent past the quota boundary
- Image persistence (Part 5) carries hundreds of thousands of base64 tokens on every call
- Microcompact (companion post) silently degrades context quality while you work
- Thinking tokens consume quota at rates invisible to the client
- False rate limits throttle throughput on requests the API would accept
- JSONL duplication inflates any local accounting by ~3x
Each of these was discovered independently by different community members running different instrumentation approaches. ArkNill’s proxy captured what our interceptor couldn’t see. Our interceptor fixes what their proxy can only observe. @rwp65 identified the false rate limiter. @Sn3th documented microcompact internals.
No single user has visibility into all of these simultaneously. That’s the core problem: the observability gap between what Claude Code does and what users can see is wide enough for real money to disappear into.
What We’d Like
The pattern across all of these findings is consistent: mechanisms that affect cost and quality, operating without user visibility or control.
We’d like to see:
- Thinking token reporting in API responses — even an aggregate count would let users understand their quota consumption
- Rate limit source identification — distinguish client-side throttling from API-side limits in the user-facing message
- Compaction notifications — tell the user when context is being degraded, and which results were cleared
- JSONL deduplication — either filter PRELIM entries from the log or mark them clearly so tooling can distinguish them
These are observability features, not behavior changes. They don’t require Anthropic to change how any of these mechanisms work — just to make their operation visible.
Credit
This post draws directly from @ArkNill’s claude-code-hidden-problem-analysis — their systematic proxy-based capture approach is what made the thinking token gap and JSONL duplication measurable. Additional credit to @rwp65 for identifying the false rate limiter mechanism, and @Sn3th for microcompact analysis.
The community continues to build shared understanding of how Claude Code manages tokens and costs. If you’re instrumenting your own sessions — whether via proxy, interceptor, or log analysis — the more data points we have, the clearer the picture gets.
This post extends our Claude Code Cache Investigation series. The token analysis data is from ArkNill’s analysis (linked above). Our cache-fix interceptor (migrating to proxy architecture) and /coffee keepalive tool are at VSITS GitHub. Updated April 18, 2026 with adaptive thinking workaround, burn rate data, and proxy migration notes.
Built by Veritas Supera IT Solutions (VSITS). We build AI-augmented systems for technical teams. If you’re dealing with similar cost management challenges in your AI tooling, we’d like to hear from you.