Context rot is the degradation in LLM performance that happens as the input context grows longer. More tokens in, worse output out — even when the model's context window isn't close to full. Chroma's research tested 18 frontier models and found that every single one gets worse as input length increases.
What Causes Context Rot
Three mechanisms drive context rot, and they compound each other: the lost-in-the-middle effect, attention dilution at scale, and distractor interference. Understanding each one explains why simply "adding more context" makes models worse, not better.
The Lost-in-the-Middle Effect
Liu et al.'s research (Stanford/TACL 2024) showed that LLM performance drops by more than 30% when relevant information sits in the middle of the context rather than at the beginning or end. Performance follows a U-shaped curve: models attend strongly to the first and last tokens and poorly to everything between.
| Position | Accuracy | Implication |
|---|---|---|
| Position 1 (start) | ~75% | Strong primacy bias |
| Position 10 (middle) | ~55% | Lost-in-the-middle blind spot |
| Position 20 (end) | ~72% | Strong recency bias |
For a coding agent that greps for a function name, reads 8 files, and finds the relevant code in file #4 — that code is now sitting in the model's blind spot. The agent has the right information but can't effectively attend to it.
Attention Dilution at Scale
The mechanism is architectural. Transformer attention is quadratic: at 10,000 tokens, the model tracks 100 million pairwise relationships. At 100,000 tokens, that's 10 billion. More context doesn't just dilute relevance — it makes the model physically worse at attending to what matters.
As Chroma's researchers put it: "What matters more than whether relevant information is present is how that information is presented."
Distractor Interference
Chroma's study found that adding semantically similar but irrelevant content — distractors — causes further degradation beyond what context length alone explains. Distractors that are topically related to the query but factually irrelevant appeared most frequently in hallucinated responses.
This is exactly what happens in code
When a coding agent searches for a webhook handler, its context fills up with test fixtures, deprecated implementations, and similarly-named functions from unrelated modules. Each one is semantically close to the target but factually irrelevant — the worst kind of distractor for an LLM.
Why Context Rot Hits Coding Agents Hardest
General chat conversations might stay under a few thousand tokens. Coding agents routinely push past 100k. A typical multi-step coding task accumulates context like sediment:
Context accumulation in a typical coding task
Step 1: Agent reads the issue description → 500 tokens
Step 2: Agent greps, reads 4-5 candidate files → 8,000 tokens
Step 3: Agent needs more context, reads 3 more → 6,000 tokens
Step 4: Agent backtracks, reads test files → 5,000 tokens
Step 5: Agent found the right file, but carries → 20,000 tokens
↑ most of this is irrelevant — all of it hurtsCognition measured this directly: agents spend over 60% of their first turn just retrieving context. Not editing. Not reasoning. Searching. Each search result stays in the context window for the rest of the session.
An OpenReview study on token consumption confirmed that input tokens dominate overall cost, with some runs consuming 10x more tokens than others on similar tasks. The variance was driven almost entirely by how efficiently the agent searched — not how well it coded.
Time kills accuracy
Research on long-running agents shows that every agent's success rate decreases after 35 minutes, and doubling task duration quadruples the failure rate. Longer tasks mean more accumulated context, and more context means worse performance.
What Doesn't Fix Context Rot
Bigger Context Windows Don't Help
The intuitive fix — just give the model more room — doesn't work. Chroma tested models across 8 different input lengths and found that performance degrades at every length increment, not just near the limit. A model with a 1M token context window still exhibits context rot at 50k tokens.
RAG Doesn't Scale for Code
Retrieval-augmented generation hits both mathematical and practical limits for code. Google DeepMind proved that embedding-based retrieval has a hard mathematical ceiling: 512-dimensional embeddings break down around 500K documents. BM25 (keyword search) outperformed neural embedding models by a wide margin on their benchmark.
Code search queries are structurally adversarial for embeddings. "Where does the auth middleware check JWT expiration?" requires understanding call graphs, import chains, and framework conventions. A single embedding vector can't capture these multi-hop relationships.
Compaction Helps but Doesn't Solve
Modern coding agents use context compaction — summarizing conversation history when approaching context limits. This buys time but doesn't solve the root problem: irrelevant context accumulated during search is already in the window before compaction triggers.
| Approach | What It Does | Why It Falls Short |
|---|---|---|
| Bigger windows | More room for tokens | Rot happens at every length, not just near limits |
| RAG / embeddings | Vector similarity search | Math ceiling at 500K docs; can't capture code structure |
| Compaction | Summarize history at limits | Irrelevant context already accumulated before trigger |
| Context isolation | Subagent search in separate windows | Prevents rot at the source |
Compaction is a treatment. Context isolation is the cure.
How to Prevent Context Rot
The Subagent Architecture
The fix for context rot isn't making models better at long contexts. It's keeping their context short.
Anthropic's multi-agent research system demonstrated this directly. Their architecture — an Opus 4 lead agent delegating to Sonnet 4 subagents — outperformed a single Opus 4 agent by 90.2% on research tasks. Not because the subagents were smarter. Because the lead agent's context stayed clean.
Lead Agent
Holds task-level context: the goal, the plan, high-level progress. Never polluted with search traces or dead-end explorations.
Search Subagent
Explores in its own context window. Reads, rejects, and backtracks without polluting the parent. Returns only relevant file and line ranges.
Condensed Return
Subagent returns 50-200 tokens of precise context. The lead agent never sees the 15 files that were explored and rejected.
This is why every major coding agent has converged on the same pattern. Claude Code uses Task agents in parallel context windows. Cursor runs background search agents. Cognition built SWE-grep. The principle is universal: isolate search into a dedicated context window so the reasoning model's context stays clean.
Context Isolation in Practice
How context isolation prevents rot
# WITHOUT isolation (context rot accumulates):
1. Coding model searches for Stripe webhook handler
2. Reads 15 files — test fixtures, deprecated code, wrong modules
3. All 15 files stay in context (20,000+ tokens of noise)
4. Model finds the right file but can't attend to it effectively
→ Result: hallucinated edit, wrong file path, wasted tokens
# WITH isolation (context stays clean):
1. Coding model delegates search to a subagent
2. Subagent explores 15 files in its own context window
3. Subagent returns: "src/api/webhooks/stripe.ts, lines 47-89"
4. Coding model receives 150 tokens of precise context
→ Result: correct edit on first attempt, 70% less context rotMeasured Results
On long-horizon coding tasks, WarpGrep — an RL-trained search subagent — measured a 70% reduction in context rot and 40% speedup in end-to-end task completion. When paired with frontier models on SWE-Bench Pro, it lifts every model to #1, while being 15.6% cheaper and 28% faster than letting the coding model search on its own.
Adding a model makes it cheaper
The cost reduction is counterintuitive. Adding a model to the system makes it cheaper because the expensive model stops wasting tokens on search. It sees fewer irrelevant files, generates fewer wasted tokens, and finishes sooner.
Context Engineering Over Context Capacity
The broader lesson from context rot research is that context window size is the wrong metric to optimize. What matters is context quality — the signal-to-noise ratio of what the model sees.
Anthropic calls this context engineering: the discipline of curating and maintaining the optimal set of tokens during inference. The goal isn't to fit more tokens in. It's to find the smallest possible set of high-signal tokens for the task at hand.
For coding agents, this means:
- Isolate search into subagents with their own context windows
- Return precise results — file and line ranges, not whole files
- Discard exploration traces — the parent model should never see the search process
- Compress early — summarize intermediate results before they accumulate
The models are already good enough. The constraint is what you put in front of them.
Frequently Asked Questions
What is context rot in LLMs?
Context rot is the degradation in LLM performance that occurs as input context length increases. Models produce less accurate, less reliable outputs when processing longer inputs — even when the context window isn't full. Chroma's 2025 research showed that all 18 tested frontier models exhibit this behavior.
Why does context rot happen?
Context rot has three causes. First, the "lost in the middle" effect: models attend strongly to tokens at the beginning and end of context but poorly to the middle (Liu et al., 2024). Second, quadratic attention scaling means more tokens exponentially increase the pairwise relationships the model must track. Third, semantically similar distractors interfere with the model's ability to identify relevant information.
How does context rot affect coding agents?
Coding agents are especially vulnerable because they accumulate context during multi-step tasks. Each file read, grep result, and exploration dead-end stays in the context window. By the time the agent finds the right code, it may be carrying 20,000+ tokens of irrelevant context that degrades its ability to reason about and edit the relevant files.
Does a bigger context window prevent context rot?
No. Chroma's research showed performance degrades at every context length increment, not just near the limit. A 1M token context window still exhibits context rot at 50k tokens. The fix is reducing context noise, not increasing context capacity.
How do you prevent context rot?
The most effective strategy is context isolation through subagent architectures. Delegate search tasks to specialized agents that operate in their own context windows and return only relevant results. This keeps the main model's context clean. Anthropic's multi-agent system improved performance by 90% using this approach.
What is context engineering?
Context engineering is the discipline of curating and maintaining the optimal set of tokens during LLM inference. Unlike prompt engineering (which focuses on a single input), context engineering manages the entire context state across multi-turn agent interactions — including what to add, what to remove, and when to summarize.
Stop Context Rot Before It Starts
WarpGrep is an RL-trained search subagent that isolates code retrieval into its own context window. 70% less context rot, 40% faster task completion, and every frontier model lifted to #1 on SWE-Bench Pro.