WarpGrep: Fast, Parallel Code Retrieval with RL

How we trained WarpGrep, a fast context model specialized in doing the dirty work of code search, using highly parallel code retrieval that matches frontier coding models while taking 5x less time

Dhruv Bhatia
Dat Quoc
Tejas Bhakta
Dhruv Bhatia & Dat Quoc & Tejas Bhakta
November 11, 20255 min read

WarpSearch Architecture Diagram

Most coding agents are brilliant thinkers and terrible librarians. They can reason about code, but they burn precious time and tokens trying to find it. The result is context pollution: the model’s working memory gets filled with loosely relevant files and quality drops as latency climbs.

WarpSearch is our answer — a specialized retrieval sub‑agent trained to find exactly the right code quickly, so your main model can stay focused on the task.

Why Sub-Agents Matter

Large language models are incredible at reasoning, but they’re inefficient at the many sub tasks that doing a real human's job often involves. Unlike humans, they are not able to move ideas in and out of their mind's focus. Every extra file they read pollutes the context window and drags performance down. Sub‑agents fix this by giving a small, fast model a narrow job. In coding, that means cleaner context, faster end‑to‑end time, and leaner token usage — the main agent only sees what’s relevant, and accuracy improves because retrieval is handled by a specialist.

WarpSearch: Built for Speed

WarpSearch operates in tight, predictable loops. Each query runs up to eight tool calls in parallel — grep, read, and glob — so it can explore multiple hypotheses at once. We cap the search at four turns (three to explore, one to answer). The toolset is intentionally narrow and cross‑platform so behavior stays fast and consistent on Windows, macOS, and Linux.

How We Built It

Dataset Creation

We started with datasets similar to SWE‑Bench and expanded across hundreds of real‑world repositories (apps, libraries, and a few large monorepos). Query candidates were a mix of synthetic real queries then were normalized into “how/where/what” questions that force multi‑hop retrieval instead of exact string matches. For each repo we stratified by:

  • repo size (200–10k+ files) and language family
  • query type (symbol lookup, behavior tracing, routing/config, cross‑file data flow)
  • difficulty (single‑file vs multi‑file, shallow vs deep line‑range precision)

Claude produced ground‑truth as sets of (file, [start, end]) spans. Ambiguous cases were resolved with dual‑judge agreement and programmatic checks to ensure the spans actually answer the question. We added hard negatives (near‑miss files and off‑by‑N line windows) to reward precision, deduplicated near‑duplicate prompts, and version‑locked repositories so labels remain stable over time. The evaluation objective is weighted F1 with β=0.5 computed jointly over file retrieval and line‑range retrieval. Prioritizing precision keeps the main agent’s context clean: missing one extra file is recoverable; over‑including junk is not.

Training Process

This was trained with an end to end RL loop. The policy issues tool calls; the environment returns tool outputs; the terminal reward is the weighted F1 against ground truth (with light shaping for early correct hits). Over training the agent learned to:

  • budget eight parallel calls per turn and diversify hypotheses
  • prune dead ends quickly to preserve the four‑turn budget
  • stop early when marginal utility falls below a learned threshold
  • return tight line ranges instead of whole files

We evaluate under the same constraints used at train time: up to eight parallel tool calls per turn, maximum of four turns (three exploration, one answer), scoring both file and line ranges.

The Results

MetricWarpSearchFrontier ModelsImprovement
Weighted F10.6260.62x
Average Time5s25s5x faster

In large repositories (1,000+ files), that five‑second average holds up. Complex, multi‑file questions finish within the four‑turn budget. And because the main agent only sees the relevant slices, downstream prompts stay short and focused.

RL Setup and Architecture

Training is built for throughput and stability on real tool‑use trajectories:

  • Split resources cleanly: dedicated inference workers (vLLM) generate rollouts continuously while training workers consume a queue. No weight‑swap thrash.
  • Controlled staleness (staleness_threshold > 0): rollouters generate trigger_parameter_sync_step × ppo_mini_batch_size samples before syncing weights; trainers take multiple gradient steps on increasingly stale data. Effective sample size stayed ≈0.99 with staleness=0.5; per‑token importance weights are clipped at 5 to bound variance. Result: ~1.6× throughput with no measurable sample‑efficiency loss.
  • Partial rollout interruption: long‑tail generations can stall syncs, so we added sleep()/resume() to vLLM. We snapshot KV cache, sync weights, then resume in‑flight generations with the new policy. This alone yielded ~2.35× faster end‑to‑end training on long sequences.
  • In‑flight weight updates (PipelineRL): stream weight deltas to inference workers via NCCL during generation. vLLM pauses briefly (milliseconds), loads new weights, and continues — producing naturally mixed‑policy sequences (tokens 0–100 from step t, 101–2000 from step t+k) without explicit sync barriers.
  • Why sample efficiency holds up: token‑level importance weighting and discounting reduce the impact of early‑stale tokens; clipping prevents variance explosions.

Related reading for the curious:

  • Fast, parallel code search with multi‑turn RL: https://cognition.ai/blog/swe-grep
  • Efficient RL: https://arxiv.org/pdf/2509.19128

The future isn’t bigger models — it’s tighter systems where small, fast specialists do the heavy lifting so your smartest model can think clearly.