Coding agents are good at writing code. They're bad at finding it. Agents spend 60%+ of their time searching for context, and the quality of that search determines whether the agent succeeds or fails. Not the model size. Not the context window. The search.
What Is Agentic Search?
Agentic search is an AI search paradigm where a model autonomously plans and executes multi-step searches using tools — grep, file read, directory listing — with reasoning between each step. Instead of retrieving documents in a single pass, the agent iterates: it searches, reads results, decides what to search next, and stops when it has enough context.
The key distinction from traditional retrieval: the search process itself involves reasoning. The agent doesn't just match patterns. It forms hypotheses about where code might live, tests those hypotheses with tool calls, and follows causal chains across files.
A concrete example. Say the agent needs to find how authentication tokens are validated in a large codebase:
- Semantic search embeds "auth token validation" and returns the 10 nearest code chunks. Maybe the right file is in there. Maybe it returns test fixtures, deprecated handlers, and a README section about auth.
- Lexical search (grep) matches
validateTokenand returns every occurrence. The agent gets the function definition, 15 call sites, 4 test mocks, and a changelog entry. - Agentic search greps for
validateToken, finds the function inauth/middleware.ts, reads it, sees it callsdecodeJWTfromlib/crypto, follows that import, finds the actual validation logic, and returns the two precise file spans the coding model needs.
Agentic search follows the logical structure of the code. Semantic search and lexical search return proximity matches. That difference matters.
Agentic Search vs Semantic Search vs Lexical Search
| Dimension | Semantic Search | Lexical Search | Agentic Search |
|---|---|---|---|
| How it works | Embed query + docs, return nearest neighbors | Pattern matching (grep, ripgrep, regex) | Multi-turn reasoning with parallel tool calls |
| Strengths | Fuzzy matching, conceptual similarity | Exact matches, fast, zero indexing cost | Follows causal chains, multi-file reasoning |
| Weaknesses | No causal logic, stale embeddings | Misses semantic intent, returns noise | Higher compute cost, needs training |
| Best for | Document/knowledge-base retrieval | Symbol lookup, known identifiers | Multi-file code understanding |
| Typical latency | < 100ms | < 50ms | 2-8 seconds |
| Indexing required | Yes (embeddings) | No | No |
| Handles code structure | No — treats code as flat text | Partially — matches symbols | Yes — follows imports, call chains, data flow |
Why Semantic Search Falls Short for Code
Semantic search works by converting queries and documents into embedding vectors and finding the closest matches. For natural language documents, this is powerful — "how to reset password" matches documentation about "credential recovery" even without keyword overlap.
For code, it breaks down. Google DeepMind proved a mathematical ceiling: for any embedding dimension, there's a hard cap on the number and complexity of query-document relationships a model can represent. Embeddings of size 512 break down around 500K documents. On the LIMIT benchmark, recall@100 fell below 20% for state-of-the-art embedding models.
But the deeper problem is structural. Code has causal relationships that embeddings cannot capture. The query "where does the auth middleware check JWT expiration?" requires understanding call graphs, import chains, and framework conventions. A single embedding vector flattens all of that into a point in space.
Embeddings go stale
Production codebases change constantly, and stale embeddings cause up to 20% performance declines in downstream tasks. This is why Claude Code uses no embeddings at all — Anthropic chose grep over vector search for their own agent.
Why Grep Alone Isn't Enough
Grep is fast, exact, and requires no indexing. For known identifiers — validateToken, handleWebhook, useAuthContext — it's the right tool.
But grep finds strings, not intent. The query "how does the billing system handle failed payments" has no single string to grep for. The logic might span stripe/webhooks.ts, billing/retry.ts, and notifications/email.ts — connected by imports and function calls, not shared keywords.
Grep also generates noise. Searching for authenticate returns the function definition, every call site, test mocks, documentation, and changelog entries. The agent has to reason about which results matter. Dumping all of them into context pollutes the model's attention.
How Agentic Search Works for Code
The best agentic search implementations share a common architecture:
1. Multi-Turn Reasoning
The agent operates in a loop: search, read results, reason, search again. Each turn narrows the search space based on what was learned in previous turns. A typical search completes in 3-4 turns.
This is fundamentally different from one-shot retrieval. The agent doesn't need to get lucky with the first query. It can start broad ("grep for webhook"), learn from the results ("the Stripe handler is in api/webhooks/"), and then narrow ("read api/webhooks/stripe.ts lines 40-80").
2. Parallel Tool Calls
The most important optimization in agentic search is parallelism. Instead of issuing one grep, waiting for results, then issuing another, the agent fires 4-12 tool calls simultaneously — different grep patterns, different directories, different file globs.
Parallelism data
Cognition found that increasing parallelism from 4 to 8 searches per turn reduced turns from 6 to 4 while maintaining the same retrieval quality. Relace measured a 4x speedup from parallel execution, dropping response times from 12-24 seconds to 1-2 seconds per turn.
Each turn of tool calls incurs prefill overhead, a network roundtrip, and decoding cost. Parallelism amortizes all of that. Eight parallel greps in one turn cost roughly the same latency as one — but explore 8x more of the codebase.
3. Subagent Isolation
The most critical architectural decision: agentic search runs in its own context window, separate from the main coding model.
When a coding model like Opus searches a codebase itself, every file it reads stays in context. After 5-6 exploration turns, the context is polluted with rejected files, wrong matches, and dead-end grep results. Research shows that performance degrades by 30%+ when irrelevant content accumulates in the middle of the context.
Subagent architecture solves this. The search agent explores in isolation, throws away dead ends, and returns only the relevant file spans. The coding model's context stays clean. This is why Anthropic's multi-agent system outperformed single-agent Opus by 90% — not because the subagents were smarter, but because the lead agent's context stayed clean.
4. Precise Output Format
Agentic search doesn't return whole files. It returns (file, [start_line, end_line]) spans — just the relevant code. This is critical for keeping the main model's context tight.
A function definition might be 30 lines in a 500-line file. Returning all 500 lines wastes 94% of the context budget and adds noise that degrades the model's attention.
Why RL Training Matters
You can build an agentic search loop with any model — Haiku, GPT-4o-mini, Gemini Flash. The results are serviceable but inefficient: general-purpose models take 10-12 turns to find the same code that a trained search model finds in 3-4.
The difference is parallelism strategy. General models issue 1-2 tool calls per turn, exploring sequentially. Trained search models issue 8+ parallel calls per turn, diversifying hypotheses on turn 1 and converging on turns 2-3.
RL-trained search behaviors
# Learned behaviors from RL training on tool-call trajectories:
1. Hypothesis diversification
→ First turn: search different directories, file patterns,
and grep queries simultaneously (not one at a time)
2. Early stopping
→ Stop searching when marginal utility drops, rather than
exhausting the full turn budget
3. Package resolution
→ Search into node_modules/ or site-packages/ when the
answer isn't in application code
4. Multi-hop following
→ Find an import, follow it to the source, then follow
that to the actual implementation — all in parallelThe numbers back this up. On code retrieval benchmarks, WarpGrep achieves 0.73 F1 in 3.8 steps versus Claude Haiku at 0.72 F1 in 12.4 steps. Same retrieval quality, 3x fewer turns. Polarity's Omnigrep achieves 0.475 F0.5, outperforming Claude Code by 33% and SWE-grep by 15% using a similar parallel approach.
Counterintuitive cost structure
Adding a specialized search model makes the system cheaper, because the expensive coding model spends fewer tokens reading irrelevant files. WarpGrep v2 is 15.6% cheaper and 28% faster per task on SWE-Bench Pro — while improving accuracy.
Real Implementations
Agentic search has converged as a pattern across the industry. Here's how the leading implementations work:
WarpGrep
RL-trained search subagent by Morph. Up to 8 parallel tool calls per turn across 4 turns. Weighted F1 reward signal (beta=0.5, favoring precision). Runs at ~2,500 tok/s, completing searches in under 4 seconds. Reaches #1 on SWE-Bench Pro when paired with frontier models.
SWE-grep (Cognition)
Cognition's RL-trained search model, built after measuring agents spent 60% of turns on search. Runs at 2,800 tok/s (20x faster than Haiku) with 8 parallel tool calls per turn and 4 serial turns. Integrated into Windsurf as the 'Fast Context' subagent.
Claude Code Task Agents
Parallel subagents that each run in their own context window. When the main agent needs to search, it spawns lightweight Claude instances that explore independently and return results. Up to 10 agents run in parallel.
Cursor
Hybrid approach: custom embedding model trained on agent search traces, combined with grep. Semantic search provides initial candidates; the agent uses grep and file reads to follow specific code paths. Semantic search improves accuracy by 12.5% over grep alone on large codebases.
Omnigrep (Polarity Labs)
State-of-the-art on CodeSearchEval using a general-purpose LLM with a 4-turn, 8-parallel-call loop plus chain-of-thought reasoning between turns. Outperforms both SWE-grep and Claude Code, demonstrating the architecture pattern matters as much as the model.
Frequently Asked Questions
What is agentic search?
Agentic search is an AI paradigm where a language model autonomously plans and executes multi-step searches using tools (grep, file read, directory listing) with reasoning between each step. Instead of retrieving results in a single pass, the agent iterates — searching, reading results, forming hypotheses, and refining its queries until it finds exactly what it needs. For code, this means following import chains, call graphs, and data flow across files rather than just matching keywords or embeddings.
What is the difference between RAG and agentic search?
RAG (retrieval-augmented generation) retrieves documents in a single pass: embed the query, find the nearest neighbors in a vector database, append them to the prompt. Agentic search is iterative and adaptive. The agent reasons about search results, decides what to search next, and can follow multi-hop chains of references. RAG treats retrieval as a preprocessing step; agentic search treats it as an active reasoning process. For code, this distinction is critical — RAG can't follow a function call from handler.ts to utils/auth.ts to lib/jwt.ts, but agentic search can.
What is the difference between semantic search and agentic search?
Semantic search uses embeddings to find conceptually similar content — it converts queries and documents to vectors and returns the closest matches. It's fast (< 100ms) and works well for natural language. Agentic search uses a reasoning model that executes tools (grep, file reads) across multiple turns, following logical structure rather than vector similarity. Semantic search answers "what looks similar to this query?" Agentic search answers "where is this logic implemented and how does it connect to other parts of the system?"
Is agentic search slower than semantic search?
Yes. Semantic search returns results in under 100ms. Agentic search typically takes 2-8 seconds. But the trade-off is worth it for code: agentic search returns precise, contextually relevant spans rather than proximity-based guesses. And because the results are more accurate, the downstream coding model spends fewer tokens on wrong files — often making the overall system faster and cheaper despite the slower search step.
Do you need to index your codebase for agentic search?
No. Agentic search uses runtime tools like grep and file reads. There's no embedding step, no vector database, and no stale index to maintain. The agent searches the live codebase as it exists right now. This is a significant advantage for active codebases where code changes frequently.
WarpGrep: Agentic Search for Coding Agents
WarpGrep is an RL-trained search subagent that finds the right code in under 4 seconds. 8 parallel tool calls per turn, precise file-span output, and a reward signal tuned for precision. Reaches #1 on SWE-Bench Pro when paired with frontier models.