AI & Development · · 8 min read

The Bugs Go Deeper: Silent Context Degradation in Claude Code

In our cache investigation series, we documented how prompt cache bugs were causing 10-20x cost inflation on Max plans. We built a fetch interceptor to fix them. We thought we’d found the bottom.

We hadn’t.

@ArkNill and their claude-code-hidden-problem-analysis repository have done something we hadn’t: set up a transparent proxy (cc-relay) to capture 3,700+ API requests with full rate-limit headers, then systematically cataloged what Claude Code does to your conversation context between the application layer and the API. Their findings extend well beyond what our interceptor-based approach could see.

What they found: Claude Code silently guts your conversation history while you work. The context window you’re paying for — which may not even be 1M anymore (see below) — is reduced to roughly 40-80K tokens of live context by mechanisms running on every API call.


Three Compaction Systems You Can’t Disable

Claude Code runs three separate context compaction mechanisms on every API call. All three operate silently. All three bypass the DISABLE_AUTO_COMPACT and CLAUDE_AUTOCOMPACT_PCT_OVERRIDE environment variables that users assume control compaction behavior. We confirmed in the source that no DISABLE_MICROCOMPACT variable exists — the only compact-related disable vars (DISABLE_COMPACT, DISABLE_AUTO_COMPACT, DISABLE_CLAUDE_CODE_SM_COMPACT) don’t touch the microcompact or budget enforcement code paths. These are controlled exclusively via server-side GrowthBook flags.

Mechanism 1: Microcompact

Microcompact targets tool_use/tool_result pairs — the records of what tools Claude called and what they returned. It replaces actual tool results with placeholder text like [Old tool result content cleared], reducing content that may have been thousands of characters down to 1-41 characters.

ArkNill’s data across their captured sessions:

Metric Value
Microcompact clearing events detected 327
Target Even-numbered content positions (tool_use/tool_result pairs)
Replacement size 1-41 characters
User notification None

The clearing targets even-numbered positions in the content array — the tool_use and tool_result pairs that make up the working memory of a coding session. The file reads, grep results, bash outputs, and edit confirmations that Claude used to make decisions are silently replaced with placeholders.

From the model’s perspective, the context that informed earlier decisions simply disappears mid-session.

Mechanism 2: Budget Enforcement

A separate mechanism enforces character-count budgets on tool results. When aggregate tool result content exceeds a threshold (200,000 characters in ArkNill’s observation), budget enforcement kicks in and truncates results.

Metric Value
Budget enforcement events detected 261
Aggregate character threshold ~200,000 characters
Result after enforcement Truncated or cleared
User notification None

Mechanism 3: Per-Tool Caps

Individual tool types have their own character caps, separate from the aggregate budget. These are controlled by server-side feature flags:

Tool Character Cap
Bash 30,000
Grep 20,000
Global (other tools) 50,000

A single grep result that returns 25,000 characters gets truncated to 20,000. A bash command that produces 40,000 characters of output gets truncated to 30,000. This happens before the result reaches the model.


GrowthBook: Server-Controlled, User-Invisible

These thresholds aren’t hardcoded. They’re controlled via GrowthBook feature flags — server-side configuration that Anthropic can change at any time without a Claude Code release:

Flag Controls Observed Value
tengu_hawthorn_window Aggregate character cap 200,000
tengu_pewter_kestrel Per-tool character caps Bash: 30K, Grep: 20K, Global: 50K
tengu_summarize_tool_results Tells model to expect cleared content true

These flags are cached locally in ~/.claude.json — you can read them, but you can’t change them. They’re fetched from Anthropic’s servers and applied silently.

The tengu_summarize_tool_results flag is telling: it instructs the model that tool results may have been cleared and to work with what remains. The system is designed around the assumption that context will be degraded.


What This Means for Working Sessions

Consider a typical development session. You’re working through a codebase:

  1. You read 10 files (tool_result: file contents)
  2. You grep for patterns across the project (tool_result: match listings)
  3. You run tests (tool_result: test output)
  4. You edit files and verify changes (tool_result: edit confirmations, file reads)

Each of these creates tool_use/tool_result pairs in the conversation history. The context grows as the model builds understanding of your codebase.

Microcompact starts clearing the earlier results. The file contents Claude read on step 1? Replaced with [Old tool result content cleared] by step 4. The grep results that informed a refactoring decision? Gone. The model is now making decisions based on conclusions it drew from evidence it can no longer see.

This isn’t the same as the /compact command, which a user invokes deliberately and which provides a summary of the compacted context. Microcompact provides no summary. It replaces content with a placeholder and moves on.


The Effective Context Gap

The practical impact: a session advertised as having up to 1M tokens of context is operating with roughly 40-80K tokens of live context at any given time. Earlier context is progressively gutted by compaction mechanisms, leaving placeholder text where working memory used to be.

Update (April 17, 2026): The context gap is now worse than when we first wrote this. Anthropic has silently revoked the 1M context window from Max subscribers via a server-side experiment flag (context-1m-2025-08-07), dropping users from 1M to 200K tokens with no notification. This has happened at least three times — March 26, April 13, and April 17. At 200K, auto-compaction triggers every 15-20 exchanges instead of every few hours. Each compaction cycle is lossy. We experienced this firsthand: a long-running session that was coherent at 1M became confused, lost track of collaborators and project context, and duplicated work after being forced to 200K. Combined with Opus 4.7’s MRCR benchmark collapse (78.3% → 32.2% on long-context retrieval), the compound effect is a model that’s worse at long context being given less context to work with.

To be clear — some form of context management is necessary. You can’t send unbounded context to the API forever, and microcompact likely does prevent sessions from bloating to the point of failure. Without it, long sessions with heavy tool use would hit context limits faster and produce degraded model performance from sheer volume. The tradeoff isn’t wrong in principle.

But the gap between what’s advertised (1M context window, “5x more room”) and what’s delivered (40-80K of live context with silent degradation, on a window that may have been silently cut to 200K) is significant. And the fact that all three mechanisms bypass user-facing controls (DISABLE_AUTO_COMPACT) means users who think they’ve opted out of automatic compaction haven’t.


What We Can See — and What’s Changing

Our cache-fix interceptor (v2.0.3, 16 fixes, 162 tests) operated at the request layer via Node.js --import preload — we could see the payload after Claude Code applied its compaction, detect the aftermath by scanning for [Old tool result content cleared] markers, and track how aggressively tool results were being gutted.

Update (April 17, 2026): Claude Code v2.1.113 replaced the Node.js runtime with a compiled Bun binary, killing the --import preload mechanism that the interceptor depended on. We’re migrating to a local proxy architecture using ANTHROPIC_BASE_URL — an SDK contract that survives runtime changes. The proxy gives us visibility into both request and response payloads, including cache tier, quota headers, and model-served-vs-requested (spoofing detection). Design details at #40.

The proxy also enables a new capability: real-time structural drift detection. Instead of post-hoc analysis of whether fixes fired, the proxy fingerprints every request’s structure and alerts when Anthropic ships changes that affect cache stability — before users report cost spikes.


Credit Where It’s Due

This analysis builds directly on @ArkNill’s systematic work in their claude-code-hidden-problem-analysis repository. Their transparent proxy approach — capturing 3,700+ API requests with full rate-limit headers — revealed mechanisms that our interceptor-based investigation couldn’t reach. Additional credit to @Sn3th for microcompact mechanism research, and @fgrosswig and @Commandershadow9 for measuring 34-143x token reduction impacts.

The community work on Claude Code cost analysis continues to be collaborative. ArkNill’s proxy-based capture and our interceptor-based fixes are complementary approaches to the same problem: understanding what Claude Code actually does with your tokens.

In our companion post, The Invisible Tax, we cover two more mechanisms ArkNill documented: extended thinking’s hidden quota impact and a client-side false rate limiter that blocks API calls that never needed blocking.


This post extends our Claude Code Cache Investigation series. The compaction data is from ArkNill’s analysis (linked above). Our cache-fix interceptor (migrating to proxy architecture) and investigation tools are at VSITS GitHub. Updated April 18, 2026 with context window revocation and proxy migration notes.

Built by Veritas Supera IT Solutions (VSITS). We build AI-augmented systems for technical teams. If you’re dealing with similar cost management challenges in your AI tooling, we’d like to hear from you.


Veritas Supera IT Solutions (VSITS LLC) builds AI-augmented systems for technical teams. If your organization is working with AI tooling and running into problems like these, let’s talk.