Skip to content
Tim Frenzel

// Insight

RAG vs long-context: the routing trick that keeps the accuracy and cuts the bill

6 min read
RAGlong-contextcost

The honest headline is that long-context wins on accuracy. In this Google study, feeding the whole document into context beats retrieval by 7.6% on Gemini-1.5-Pro and 13.1% on GPT-4O. The headline that matters more for a desk is the second one. A simple router, Self-Route, matches long-context accuracy at 39 to 65% lower cost. That is the result worth building on.

The study is the cleanest treatment yet of a decision every team faces. You have a 10-K and a question. Do you stuff the entire filing into the model’s context, which is accurate and expensive, or retrieve the relevant passages, which is cheap and lossy? Until now that was a gut call. This paper turns it into a measured trade-off.

What the numbers say

Long-context is the accuracy ceiling. On GPT-4O, feeding the full context scores 48.7 on average across the datasets, against 32.6 for retrieval. That is a large gap. It is the reason teams reach for long context despite the cost. The problem is the cost itself: pushing a whole filing through the model on every query, at scale, is a serious bill in tokens and latency.

The insight that breaks the trade-off is an observation about the queries. Over 60% of them produce the same answer whether you use retrieval or full context. For most questions, the cheap path is already correct. The expensive path only earns its keep on the minority of questions that genuinely need the whole document.

GPT-4O: accuracy by method (avg across datasets)
RAG32.6Self-Route48.9Long context48.7

Self-Route turns that observation into a method. It lets the model answer with retrieval first and, in the same step, judge whether it can answer confidently from the retrieved passages. If it can, the cheap answer stands. If it cannot, the query is escalated to full context. The result is the bar that matters: Self-Route scores 48.9 on GPT-4O, level with full context at 48.7, while spending 39% less. On Gemini-1.5-Pro the saving reaches 65%.

Why this is the right shape for a quant pipeline

Because it is the same logic a desk applies to every expensive resource. You do not run the heavy computation on every input. You run a cheap screen. You escalate only the cases that need the expensive treatment. Self-Route is that pattern applied to context: a cheap first pass, a confidence check, then a costly fallback reserved for the hard minority. A research lead reading a cost report recognizes this immediately, because it is how you keep a high-volume pipeline affordable without giving up accuracy on the cases that matter.

The confidence check is the part to engineer carefully. Self-Route works because the model can usually tell when the retrieved passages are insufficient. That self-assessment is not free of error. A miscalibrated check either escalates too much, losing the savings, or too little, losing the accuracy. The routing decision is a model output like any other, which means it deserves measurement: track how often escalated queries actually improved, and tune the threshold against your own data.

Engineering the confidence gate

Everything in Self-Route rides on one component: the model’s judgment of whether the retrieved passages are enough. Get that judgment right and you keep long-context accuracy at retrieval cost. Get it wrong in either direction and the method fails. Escalate too eagerly and you pay the full-context bill on questions that did not need it. Escalate too rarely and you serve wrong answers the cheap path could not support. The gate is the whole game.

That makes the gate a model output you have to measure, like any other. The honest test is simple. For a sample of escalated queries, check whether full context actually changed the answer. If most escalations did not improve anything, your gate is too cautious. You are leaving savings on the table. For a sample of non-escalated queries, spot-check the answers against full context. If the cheap path was wrong on questions the gate kept, your gate is too loose. You are trading accuracy for cost without meaning to. Tune the threshold against your own data, because the right setting depends on your documents and your tolerance for error.

The cost math is what makes the effort worthwhile at scale. A desk running document QA over a coverage universe asks the same kinds of questions thousands of times a day. At that volume, paying full-context price on every query is a large, recurring bill. Most of it buys nothing, because the cheap path was already correct on the majority. Self-Route turns that bill into a small one plus a targeted premium on the hard minority. The 39 to 65% saving in the paper is not a one-off. It is a structural reduction that compounds every day the pipeline runs.

There is a second-order benefit worth naming. Because Self-Route logs which queries escalated, it hands you a map of where your retrieval is weakest. A cluster of escalations around a particular document type, a particular question shape, or a particular part of the corpus is a signal that your chunking or your index is failing there. The routing data is free instrumentation on the retrieval system underneath it. A desk that reads those logs learns where to invest in better retrieval, which over time shrinks the escalation rate and the bill with it. The gate does not just save money on each query. It tells you how to save more.

The architecture also degrades gracefully, which matters for production. If the confidence gate fails open, escalating everything, you fall back to pure long-context: more expensive, still accurate. If it fails closed, escalating nothing, you fall back to pure retrieval: cheaper, less accurate, and visibly so in your evaluation. Neither failure is catastrophic, and both are detectable from the escalation rate you are already tracking. Compare that to a single fixed choice between RAG and long-context, where the failure mode is silent: you live with the accuracy or the cost you picked, with no signal telling you the trade was wrong.

How I would use it

As the default architecture for document QA at scale, not a choice between two camps. Run retrieval as the first pass, add a confidence gate, and reserve full context for the queries that fail the gate. Measure the escalation rate, because it tells you both your cost and whether the gate is calibrated. For financial documents the economics are stark: most questions about a filing are answerable from a few pages. The few that need the whole document are worth paying for. Self-Route is how you stop paying for the whole document on every question while keeping the accuracy on the ones that need it. The win is structural. The same router that saves the money also tells you, query by query, where your retrieval still needs work, which turns a cost optimization into a continuous improvement loop you can actually act on. That is the second payoff most teams never even notice.

Most document questions are answerable from a few passages. Route the cheap path by default and escalate only the hard minority to full context, for long-context accuracy at a fraction of the cost.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.