Skip to content
Tim Frenzel

// Insight

A-RAG: give the model the search box, not the pipeline

6 min read
RAGagentic-retrievalretrieval

The retrieval debate this archive settled in November had a loose end. The SEC-filings head-to-head showed vectors beating structure-walking, with intelligence belonging at the edges of retrieval rather than in the middle. A-RAG is what the fully developed edge looks like. Instead of a fixed pipeline, the model gets three retrieval primitives as tools, keyword search, semantic search, and chunk read, and decides for itself what to look up, at what granularity, and when to stop. The pipeline disappears; the search box remains.

The hierarchy of the three interfaces is the design’s load-bearing choice. Keyword search casts wide and cheap, the BM25 reflex for entities and exact terms. Semantic search catches the conceptual neighbors keywords miss. Chunk read is the fine instrument: once the agent knows roughly where the answer lives, it reads the specific passage rather than retrieving more candidates. A multi-hop question then becomes what it actually is, a sequence of dependent lookups, with the model carrying intermediate findings forward and choosing the next primitive based on what the last one returned, the long-horizon tool-use capability applied to the retrieval problem specifically.

A-RAG: the model drives, the primitives serve
QuestionAgent plans first lookupKeyword search: wide, cheapSemantic search: conceptualChunk read: preciseObserve, carry findings forwardNext primitive or answer
Granularity is a per-step decision the model makes, not a pipeline constant.

The numbers, including the one that matters most

On multi-hop QA with a GPT-5-mini backbone the gaps are not subtle. On 2WikiMultihopQA, A-RAG reaches 89.7% LLM-judged accuracy against 50.2% for naive RAG. HotpotQA: 94.5 against 81.2. MuSiQue, the hardest of the three: 74.1 against 52.8. The same architecture on the weaker GPT-4o-mini backbone still wins everywhere, by single digits on the easier sets and seven points on MuSiQue.

LLM-judged accuracy, GPT-5-mini backbone (%)
2Wiki, A-RAG89.72Wiki, naive RAG50.2HotpotQA, A-RAG94.5HotpotQA, naive RAG81.2MuSiQue, A-RAG74.1MuSiQue, naive RAG52.8
The gap widens with the harder benchmark and the stronger backbone, in both directions.

The token accounting kills the obvious objection before it forms. Agent-driven retrieval sounds expensive, all those steps. A-RAG answers HotpotQA questions on roughly 2,737 retrieved tokens against naive RAG’s roughly 5,400, because precise lookups replace bulk stuffing. The paper’s own ablation makes the point from the other side: give the agent naive retrieval interfaces without the hierarchy and the token count explodes to 27,000-56,000 per question. Agency without good interfaces is waste; interfaces without agency are the ceiling the field has been bumping against; the combination is cheaper and better at once.

Retrieved tokens per HotpotQA question
A-RAG, hierarchical interfaces2737Naive RAG pipeline5400A-RAG, naive interfaces27455
Agency with good interfaces halves the bill; agency with bad ones quintuples it.

The operational governance follows from that middle row. Once the model chooses its own step count, per-query cost and latency become distributions with tails; the production controls are budget caps per query, a step ceiling with a graceful fallback to the one-shot pipeline, and telemetry on the step-count distribution itself. A query population whose average steps drift upward over weeks is telling you something changed, in the corpus, the questions, or the model, before any accuracy metric says so, the same drift-monitoring instinct that mode-escalation logging gave batch pipelines.

The scaling result is the strategic finding. Allow more interaction steps, 5 to 20, and accuracy climbs roughly 8 points on the strong backbone. Scale the backbone’s reasoning effort from minimal to high and both GPT-5 variants gain around 25%. Read those together: fixed pipelines cap what better models can deliver, while interface-driven retrieval converts every model improvement into retrieval improvement automatically. The naive-RAG numbers barely move when the backbone strengthens, because the pipeline was the bottleneck. An architecture that rides the model curve beats one that fights it, the same compounding argument that decided the build-versus-rent questions across this archive.

Reconciling November, and what a filings desk does now

The two retrieval verdicts of the season compose into one instruction. November’s benchmark said do not replace the vector machinery with model reasoning. A-RAG says do let the model command that machinery. Both results draw the same line through the stack: embeddings and indexes are unreasonably good at finding things, models are unreasonably good at deciding what to find next. The failure mode of the last two years was confusing those roles in either direction. The four-agent fintech pipeline from the autumn now reads like the transitional form, fixed agents for fixed failure modes; A-RAG generalizes it by handing the whole decision space to the model with primitives clean enough to be used well.

For a filings workload the translation is direct, with the caveats attached. The benchmarks here are open-domain Wikipedia QA, and financial corpora bring the boilerplate-similarity and orphaned-number pathologies that open-domain sets do not test, which means the gains need re-measuring on 10-Ks before anyone rewires production. The primitives would need the finance-grade versions, hybrid search with metadata filters rather than bare BM25. And per-query latency becomes a distribution with a long tail once step count is model-chosen, which means budget caps per query, the same metering discipline every agentic system in this archive has eventually needed. The 150-question harness from November remains the right instrument: same corpus, same questions, A-RAG against your current stack, one weekend. Score accuracy, tokens, latency tail, and step-count distribution together, because the architecture’s case is the joint improvement rather than any single column.

The primitive set also has an obvious fourth member on a finance corpus, and designing for it now costs nothing. Filings carry structured data, XBRL-tagged fundamentals, exhibit tables, defined-term registries, that neither keyword nor semantic search retrieves faithfully and chunk read retrieves wastefully. A structured-lookup primitive, query a field, get the tagged value with its provenance, would let the agent route numerical questions to numerical infrastructure, the same separation of duties that keeps arithmetic in calculators instead of token streams. The hierarchy generalizes: primitives are cheap to add, the agent learns to choose among them, while every addition shifts work from reasoning over raw text to querying purpose-built indexes.

One more translation step matters for regulated deployment: the primitives are also the permission boundary. A fixed pipeline touches documents through one service account doing one thing; an agent wielding chunk read touches raw document content wherever its reasoning wanders, which makes per-primitive access control the natural enforcement point. Keyword and semantic search over an entitlement-filtered index, chunk read gated by the same document permissions the human analyst carries, budgets enforced at the interface layer where they cannot be reasoned around: the identity-and-delegation gap the protocol layer has not yet standardized gets handled, for retrieval at least, by putting the controls inside the primitives themselves. Design the interfaces once with entitlements native, and every future model upgrade inherits the controls for free.

The result worth sitting with is the scaling curve. Every retrieval architecture this archive has reviewed was a snapshot, tuned for today’s models and obsolete with tomorrow’s. This is the first one whose performance is an increasing function of model capability with the interfaces held fixed. Retrieval infrastructure that appreciates instead of depreciating changes what building it means: the primitives, the indexes, plus the budget controls are the durable asset, while the intelligence that drives them upgrades itself.

Give the model clean retrieval primitives and the granularity decision: 89.7 versus 50.2 on multi-hop QA at half the tokens, with gains that grow as backbones improve, because the pipeline was the bottleneck all along.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.