// Insight
RAG for financial documents: a field guide
Grounding a language model in your own financial documents is the request that comes up in almost every client conversation. It is harder than it looks. The model is rarely the weak point. On financial filings, retrieval is the bottleneck, not the model: hand the model the wrong evidence and it answers the wrong question fluently. This is the field guide I wish people had before they greenlit a filing-QA project: why it is hard, the moves that work, the evidence behind each.
Retrieval-augmented generation, the idea of fetching relevant passages and feeding them to a model at answer time, was introduced by Lewis and colleagues in 2020. The concept is simple. The simplicity is misleading. The engineering, on financial documents specifically, is where projects succeed or quietly stall. The rest of this guide is about that engineering, and about the judgment around it.
Why is RAG on filings so hard?
Start with the most clarifying result in the literature. FinanceBench tested GPT-4-Turbo on questions about real filings under different conditions. Handed the right pages, it answered 85% correctly. Pointed at a shared vector store and left to retrieve for itself, it managed 19%. The model barely changed. The retrieval did all the damage.
That chart is the whole problem in one image. The capability to answer is already there, sitting at 85% when the evidence is correct. The accuracy collapses in the step that finds the evidence. Two failure modes do most of the damage on filings specifically, and naming them tells you what to fix.
The first is the orphaned number. Standard chunking slices a document into fixed pieces, which severs a figure from the company, segment, and period that give it meaning. A chunk that reads “revenue grew 14%” loses the fact that it is Acme’s cloud segment in fiscal 2023. The retriever then ranks that detached number by surface similarity, and hands the model a figure with no anchor. The second is the near-identical distractor. Revenue for fiscal 2023 and fiscal 2022 read almost identically to a general embedding model, which cheerfully returns the wrong year. In ordinary prose these distinctions are cosmetic. In a filing they are the entire answer.
The proven moves, in order of leverage
The good news is that retrieval failure is a solved problem, or close to it. The fixes are known, measured, and stackable. Take them in the order that returns the most accuracy per unit of effort.
The highest-leverage move is to stop orphaning your chunks. Contextual Retrieval prepends a short, document-aware sentence to each chunk before indexing, restoring the company, period, and topic the raw text had lost. The chunk that gets indexed now carries “Acme 2023 10-K, cloud segment” along with the number. The effect is large and measured.
Contextual embeddings alone cut the failure rate from 5.7% to 3.7%. Adding a keyword index that catches exact entity and number matches takes it to 2.9%. A reranker on top reaches 1.9%, a two-thirds reduction overall. The moves stack because each attacks a different way retrieval fails, which is the key insight: there is no single fix, there is a short pipeline of complementary ones.
The second move is to stop using a generic embedding model. A general model is trained to place semantically similar text together, which is precisely the wrong objective when the fiscal year is the load-bearing token. Finance-tuned embeddings change the geometry of the space to encode the distinctions that matter in finance.
On the same retrieval task, a finance-tuned model reached Recall@1 of 62.8% against 39.2% for the best general model, and on FinanceBench it lifted answer accuracy by eight points. It is a one-line swap with no effect on the rest of the pipeline, which makes it the cheapest large gain on this whole list. The largest improvements land on exactly the queries a desk cares about: date-specific, company-specific, and forward-looking ones.
The third and fourth moves are already in that first ladder. Run a keyword index alongside the vector search. An exact ticker or line-item label is then matched literally rather than blurred into a semantic neighbor. Then add a reranker that reorders the shortlist with a model that reads the query and the candidate passages together. Pair a finance-tuned embedding with a keyword index and you recover both the semantic match and the exact match that either alone would lose. The reranker does the final cleanup.
When should you not use RAG at all?
Sometimes the cheapest reliable answer is to skip retrieval. A study of RAG versus long-context models found that feeding the whole document into a long-context model beats retrieval on accuracy, at a higher cost per query. Its Self-Route hybrid is the move to copy. Answer with cheap retrieval first, let the model judge whether the retrieved passages are sufficient, and escalate only the hard minority to full context. That matched long-context accuracy at 39 to 65% lower cost. For a filing that fits inside the context window, the architecture is mostly this routing decision. A team that reaches for chunked retrieval by reflex is often solving a problem it could route around.
What does this cost to build and run?
The honest part of any field guide is the bill. None of these moves is free. The cost profile decides whether they fit your corpus. Contextual chunking generates a sentence per chunk with an LLM at indexing time. For a static archive that is a one-time cost amortized over every future query. It is clearly worth it. For a corpus that updates constantly, news, intraday filings, a live transcript feed, that indexing cost recurs on every new document. The calculation belongs in your pipeline budget rather than your assumptions. A finance-tuned embedding model is a near-free swap, with no per-query cost beyond the embedding itself. A reranker adds a model call per query, modest but real. The Self-Route gate is the one that saves money, escalating only the hard queries to the expensive long-context path.
The reason to itemize this is that the right architecture depends on your query volume and your corpus churn. A desk that queries a fixed set of filings thousands of times a day amortizes the indexing cost to nothing and should build the full stack. A desk that touches a fast-moving corpus a few times has a different calculation, and may lean harder on long-context routing than on heavy indexing. There is no single correct architecture. There is the one that fits your numbers.
The discipline that makes it safe
Everything above improves the odds of retrieving the right evidence. None of it guarantees a correct answer, and on financial documents a confident wrong number is the failure that actually matters. This is where a model-risk mindset earns its place, and where most of the engineering blogs go quiet.
Three controls are not optional on anything that feeds a decision. Require citations: every answer should point to the source passage a reader can check in seconds, which turns the system from an oracle into an assistant whose work you can verify. Build in abstention: a system that can say it does not know is worth more than one that always answers, because a wrong figure in a memo does more harm than a blank. Measure on your own questions: benchmark accuracy on your filings, your formats, and your edge cases, because the only evaluation that settles a deployment is the one built from the work you actually do. A system that scores well on a public benchmark and has never seen your documents has told you almost nothing.
One more control belongs in any serious deployment: a human in the loop for the answers that matter. Not every query needs review. The ones that feed a published number, a client memo, or a trade decision do. The right design routes high-stakes answers to a person with the citations attached. The check then takes seconds rather than a re-derivation. Far from a failure of automation, this is what makes the automation deployable in a regulated setting, where the cost of a confident wrong number is measured in more than embarrassment. The system does the retrieval and the drafting. The human owns the sign-off on what counts, which on a quant desk is simply how the work has always been done, now with a faster first draft.
The field-guide checklist
If you are greenlighting a filing-QA project, this is the short version. Fix the chunks first, with document-aware context, because that is the largest single gain. Use a finance-tuned embedding model and a keyword index together, then rerank the shortlist. Add a confidence gate that escalates hard queries to long context rather than forcing retrieval to carry everything. Require citations and allow the system to abstain. Evaluate on your own documents rather than a public leaderboard. Do those things, in that order. You will have addressed the failure modes that sink most projects before they reach production.
Where teams go wrong
The failures are predictable. They cluster. The most common is to blame the model. A team sees wrong answers, concludes the LLM is not good enough, and swaps in a bigger one, when the real fault is a retriever handing it the wrong page. FinanceBench is the antidote to that instinct. The second mistake is to demo on easy questions. A system that answers “what was revenue” looks impressive until a user asks for the year-over-year change in the cloud segment excluding a one-time charge, which needs both the right passages and real reasoning over them. The third is to skip evaluation, shipping on the strength of a good demo, then discovering the error rate only when a wrong number reaches a client. The fourth is to over-engineer, bolting on an agentic loop before the one-shot retrieval underneath it works, which just iterates over bad evidence more slowly. Get the base right, measure honestly on hard questions, and add complexity only when the measurement says to.
The bottom line
RAG on financial documents is fundamentally a retrieval-and-discipline problem, and both halves are solved well enough to build on with confidence. The model can already answer the question once the right evidence is in front of it, which FinanceBench demonstrates starkly. The work is getting that evidence there, with contextual chunking, finance-tuned embeddings, keyword matching, and reranking, then refusing to answer when the evidence is thin. Skip retrieval entirely when the document fits in context and a router says to. None of this is exotic. It is a sequence of known moves, applied in the right order, wrapped in a verification gate that treats a wrong number as the real risk it is. That sequence is the difference between a demo that impresses and a system a desk can actually trust.
On filings, retrieval is the bottleneck the model gets blamed for. Fix the chunks, tune the embeddings, rerank, route the hard queries to long context, and refuse to answer without evidence: the moves are known, the discipline is the hard part.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.