What Is Context Rot? Why Coding Agents Get Worse as Context Grows

Context rot is the performance degradation LLMs experience as input length increases. Here's what causes it, why it breaks coding agents, and how to fix it.

February 23, 2026 · 3 min read

Context rot is the degradation in LLM performance that happens as the input context grows longer. More tokens in, worse output out — even when the model's context window isn't close to full. Chroma's research tested 18 frontier models and found that every single one gets worse as input length increases.

30%+
Performance drop from lost-in-the-middle
18
Frontier models tested (all degraded)
60%
Agent time spent just retrieving context
70%
Context rot reduction with WarpGrep

What Causes Context Rot

Three mechanisms drive context rot, and they compound each other: the lost-in-the-middle effect, attention dilution at scale, and distractor interference. Understanding each one explains why simply "adding more context" makes models worse, not better.

The Lost-in-the-Middle Effect

Liu et al.'s research (Stanford/TACL 2024) showed that LLM performance drops by more than 30% when relevant information sits in the middle of the context rather than at the beginning or end. Performance follows a U-shaped curve: models attend strongly to the first and last tokens and poorly to everything between.

PositionAccuracyImplication
Position 1 (start)~75%Strong primacy bias
Position 10 (middle)~55%Lost-in-the-middle blind spot
Position 20 (end)~72%Strong recency bias

For a coding agent that greps for a function name, reads 8 files, and finds the relevant code in file #4 — that code is now sitting in the model's blind spot. The agent has the right information but can't effectively attend to it.

Attention Dilution at Scale

The mechanism is architectural. Transformer attention is quadratic: at 10,000 tokens, the model tracks 100 million pairwise relationships. At 100,000 tokens, that's 10 billion. More context doesn't just dilute relevance — it makes the model physically worse at attending to what matters.

100M
Pairwise relationships at 10K tokens
10B
Pairwise relationships at 100K tokens
1T
Pairwise relationships at 1M tokens

As Chroma's researchers put it: "What matters more than whether relevant information is present is how that information is presented."

Distractor Interference

Chroma's study found that adding semantically similar but irrelevant content — distractors — causes further degradation beyond what context length alone explains. Distractors that are topically related to the query but factually irrelevant appeared most frequently in hallucinated responses.

This is exactly what happens in code

When a coding agent searches for a webhook handler, its context fills up with test fixtures, deprecated implementations, and similarly-named functions from unrelated modules. Each one is semantically close to the target but factually irrelevant — the worst kind of distractor for an LLM.

Why Context Rot Hits Coding Agents Hardest

General chat conversations might stay under a few thousand tokens. Coding agents routinely push past 100k. A typical multi-step coding task accumulates context like sediment:

Context accumulation in a typical coding task

Step 1: Agent reads the issue description       →   500 tokens
Step 2: Agent greps, reads 4-5 candidate files   → 8,000 tokens
Step 3: Agent needs more context, reads 3 more    → 6,000 tokens
Step 4: Agent backtracks, reads test files         → 5,000 tokens
Step 5: Agent found the right file, but carries    → 20,000 tokens
         ↑ most of this is irrelevant — all of it hurts

Cognition measured this directly: agents spend over 60% of their first turn just retrieving context. Not editing. Not reasoning. Searching. Each search result stays in the context window for the rest of the session.

An OpenReview study on token consumption confirmed that input tokens dominate overall cost, with some runs consuming 10x more tokens than others on similar tasks. The variance was driven almost entirely by how efficiently the agent searched — not how well it coded.

Time kills accuracy

Research on long-running agents shows that every agent's success rate decreases after 35 minutes, and doubling task duration quadruples the failure rate. Longer tasks mean more accumulated context, and more context means worse performance.

What Doesn't Fix Context Rot

Bigger Context Windows Don't Help

The intuitive fix — just give the model more room — doesn't work. Chroma tested models across 8 different input lengths and found that performance degrades at every length increment, not just near the limit. A model with a 1M token context window still exhibits context rot at 50k tokens.

RAG Doesn't Scale for Code

Retrieval-augmented generation hits both mathematical and practical limits for code. Google DeepMind proved that embedding-based retrieval has a hard mathematical ceiling: 512-dimensional embeddings break down around 500K documents. BM25 (keyword search) outperformed neural embedding models by a wide margin on their benchmark.

Code search queries are structurally adversarial for embeddings. "Where does the auth middleware check JWT expiration?" requires understanding call graphs, import chains, and framework conventions. A single embedding vector can't capture these multi-hop relationships.

Compaction Helps but Doesn't Solve

Modern coding agents use context compaction — summarizing conversation history when approaching context limits. This buys time but doesn't solve the root problem: irrelevant context accumulated during search is already in the window before compaction triggers.

ApproachWhat It DoesWhy It Falls Short
Bigger windowsMore room for tokensRot happens at every length, not just near limits
RAG / embeddingsVector similarity searchMath ceiling at 500K docs; can't capture code structure
CompactionSummarize history at limitsIrrelevant context already accumulated before trigger
Context isolationSubagent search in separate windowsPrevents rot at the source

Compaction is a treatment. Context isolation is the cure.

How to Prevent Context Rot

The Subagent Architecture

The fix for context rot isn't making models better at long contexts. It's keeping their context short.

Anthropic's multi-agent research system demonstrated this directly. Their architecture — an Opus 4 lead agent delegating to Sonnet 4 subagents — outperformed a single Opus 4 agent by 90.2% on research tasks. Not because the subagents were smarter. Because the lead agent's context stayed clean.

Lead Agent

Holds task-level context: the goal, the plan, high-level progress. Never polluted with search traces or dead-end explorations.

Search Subagent

Explores in its own context window. Reads, rejects, and backtracks without polluting the parent. Returns only relevant file and line ranges.

Condensed Return

Subagent returns 50-200 tokens of precise context. The lead agent never sees the 15 files that were explored and rejected.

This is why every major coding agent has converged on the same pattern. Claude Code uses Task agents in parallel context windows. Cursor runs background search agents. Cognition built SWE-grep. The principle is universal: isolate search into a dedicated context window so the reasoning model's context stays clean.

Context Isolation in Practice

How context isolation prevents rot

# WITHOUT isolation (context rot accumulates):
1. Coding model searches for Stripe webhook handler
2. Reads 15 files — test fixtures, deprecated code, wrong modules
3. All 15 files stay in context (20,000+ tokens of noise)
4. Model finds the right file but can't attend to it effectively
→ Result: hallucinated edit, wrong file path, wasted tokens

# WITH isolation (context stays clean):
1. Coding model delegates search to a subagent
2. Subagent explores 15 files in its own context window
3. Subagent returns: "src/api/webhooks/stripe.ts, lines 47-89"
4. Coding model receives 150 tokens of precise context
→ Result: correct edit on first attempt, 70% less context rot

Measured Results

On long-horizon coding tasks, WarpGrep — an RL-trained search subagent — measured a 70% reduction in context rot and 40% speedup in end-to-end task completion. When paired with frontier models on SWE-Bench Pro, it lifts every model to #1, while being 15.6% cheaper and 28% faster than letting the coding model search on its own.

70%
Context rot reduction
40%
Faster task completion
15.6%
Cheaper than self-search
28%
Faster than self-search

Adding a model makes it cheaper

The cost reduction is counterintuitive. Adding a model to the system makes it cheaper because the expensive model stops wasting tokens on search. It sees fewer irrelevant files, generates fewer wasted tokens, and finishes sooner.

Context Engineering Over Context Capacity

The broader lesson from context rot research is that context window size is the wrong metric to optimize. What matters is context quality — the signal-to-noise ratio of what the model sees.

Anthropic calls this context engineering: the discipline of curating and maintaining the optimal set of tokens during inference. The goal isn't to fit more tokens in. It's to find the smallest possible set of high-signal tokens for the task at hand.

For coding agents, this means:

  • Isolate search into subagents with their own context windows
  • Return precise results — file and line ranges, not whole files
  • Discard exploration traces — the parent model should never see the search process
  • Compress early — summarize intermediate results before they accumulate

The models are already good enough. The constraint is what you put in front of them.

Frequently Asked Questions

What is context rot in LLMs?

Context rot is the degradation in LLM performance that occurs as input context length increases. Models produce less accurate, less reliable outputs when processing longer inputs — even when the context window isn't full. Chroma's 2025 research showed that all 18 tested frontier models exhibit this behavior.

Why does context rot happen?

Context rot has three causes. First, the "lost in the middle" effect: models attend strongly to tokens at the beginning and end of context but poorly to the middle (Liu et al., 2024). Second, quadratic attention scaling means more tokens exponentially increase the pairwise relationships the model must track. Third, semantically similar distractors interfere with the model's ability to identify relevant information.

How does context rot affect coding agents?

Coding agents are especially vulnerable because they accumulate context during multi-step tasks. Each file read, grep result, and exploration dead-end stays in the context window. By the time the agent finds the right code, it may be carrying 20,000+ tokens of irrelevant context that degrades its ability to reason about and edit the relevant files.

Does a bigger context window prevent context rot?

No. Chroma's research showed performance degrades at every context length increment, not just near the limit. A 1M token context window still exhibits context rot at 50k tokens. The fix is reducing context noise, not increasing context capacity.

How do you prevent context rot?

The most effective strategy is context isolation through subagent architectures. Delegate search tasks to specialized agents that operate in their own context windows and return only relevant results. This keeps the main model's context clean. Anthropic's multi-agent system improved performance by 90% using this approach.

What is context engineering?

Context engineering is the discipline of curating and maintaining the optimal set of tokens during LLM inference. Unlike prompt engineering (which focuses on a single input), context engineering manages the entire context state across multi-turn agent interactions — including what to add, what to remove, and when to summarize.

Stop Context Rot Before It Starts

WarpGrep is an RL-trained search subagent that isolates code retrieval into its own context window. 70% less context rot, 40% faster task completion, and every frontier model lifted to #1 on SWE-Bench Pro.