// Insight
Contextual Retrieval: fixing the chunk that forgot where it came from
A chunk pulled from a filing usually loses the one thing that gives it meaning. The paragraph says revenue rose 14%. It no longer says which company, which segment, or which quarter. Anthropic’s Contextual Retrieval fixes that at indexing time. The effect is large. Contextual embeddings plus BM25 and reranking cut the top-20 retrieval failure rate by 67%, from 5.7% to 1.9%. That is a direct hit on the failure mode that wrecks RAG on financial documents.
The problem it solves is the one FinanceBench makes brutally clear: on filings, retrieval is the bottleneck, not the model. Standard chunking severs a number from its context. A vector search then ranks that orphaned chunk by surface similarity and hands the model a figure detached from the entity and period it belongs to. The answer comes out fluent and wrong.
How it works
The idea is almost embarrassingly simple. Before you embed a chunk, you ask an LLM to write a short, document-aware sentence that situates it: this chunk is from Acme’s 2023 10-K, discussing the cloud segment’s revenue. That context gets prepended to the chunk before both the embedding and the keyword index are built. The chunk now carries the entity, period, and topic that the raw text had lost.
The gains stack. Contextual embeddings alone cut the failure rate by 35%, from 5.7% to 3.7%. Adding a contextual BM25 keyword index, which catches exact entity and number matches that pure embeddings miss, takes it to 49%, or 2.9%. A reranking step on top reaches 67%, or 1.9%. Each layer attacks a different way retrieval fails, and together they nearly quarter the failure rate.
Why the pieces matter for financial text
The contextual prefix is the headline. The BM25 layer is the part a quant should appreciate. Financial questions hinge on exact matches: a specific ticker, a precise line-item label, a particular fiscal period. Pure vector search blurs those into semantic neighbors. A query for one company’s 2023 figure can retrieve a near-identical 2022 table. A keyword index preserves the exact match that embeddings smear, and pairing the two recovers a signal that either alone would lose. The reranker then does the final cleanup, reordering the shortlist with a model that looks at the query and the candidates together.
None of this changes the model. It changes what the model is handed. That is the same lesson FinanceBench taught from the other direction: the capability is already there. The accuracy lives in getting the right evidence in front of it. Contextual Retrieval is one of the cleanest ways to close that gap.
The catch a desk has to price
The context is generated once per chunk, by an LLM, at indexing time. For a static corpus that is a one-time cost you amortize over every future query. It is clearly worth it. For a corpus that updates constantly, news, intraday filings, a live transcript feed, that indexing cost recurs. The question becomes whether the retrieval gain justifies re-contextualizing every new document. Prompt caching makes the per-chunk cost small, which helps. The calculation still belongs in your pipeline budget rather than your assumptions.
There is also a dependency worth naming. The quality of the prefix depends on the LLM that writes it. A careless context sentence can mislead as easily as a good one helps. Spot-check the generated context on a sample of your own documents before you trust the index built on top of it.
A worked example on a filing
Walk a single chunk through it. The raw chunk reads: revenue grew 14% year over year, driven by the cloud segment. On its own that sentence is a trap. There are dozens of companies whose cloud segment grew 14% in some year. A vector index cannot tell them apart. Retrieve that chunk for the wrong company and the model produces a confident, specific, wrong answer.
Now add the context. Before indexing, an LLM prepends: this passage is from Acme Corp’s fiscal 2023 10-K, in the segment-results section. The chunk that gets embedded now contains the company, year, and section. A query about Acme’s 2023 cloud revenue matches it cleanly. A query about a different company no longer collides with it. The exact-match BM25 layer reinforces this, because the company name and the fiscal year are now literal tokens in the index, recoverable by a keyword search even when the embedding is ambiguous.
This is the same gap FinanceBench measured from the model’s side. The model could answer at 85% once handed the right pages, and failed at 19% when left to find them. Contextual Retrieval attacks the finding-them step directly, by making each chunk carry the identifying facts that let the retriever find it. It does not make the model smarter. It makes the evidence findable, which is the half of the problem that was actually broken.
The honest caveat is that the prefix is itself a model output. A context sentence that misattributes a chunk, that assigns it to the wrong company or period, poisons the index in a way that is hard to notice later. The fix is the same discipline you apply to any generated artifact in the pipeline. Sample the generated prefixes, check them against the source documents, and treat a high error rate in the context step as the bug it is before you build retrieval on top of it.
One more practical note on where this fits. Contextual Retrieval is a preprocessing step, which means it composes with everything else in the stack rather than replacing it. You still need a good base embedding model. You still want the reranker. You still measure end-to-end accuracy on your own questions. What the contextual prefix adds is a cheap, high-leverage improvement to the input every other component sees. In a stack that already has hybrid retrieval and reranking, adding contextual chunking is one of the few remaining moves that improves every downstream step at once, because it improves the quality of what gets indexed in the first place. That is the test of a good preprocessing step: it makes the components you already have work better, without forcing you to rip any of them out.
How I would use it
As a default preprocessing step for any retrieval over filings, transcripts, or research notes. Build contextual embeddings and a contextual BM25 index, add a reranker, and measure the failure rate on a set drawn from your own documents. Treat the indexing cost as a line item, especially if your corpus moves fast. The technique is cheap, model-agnostic, and aimed precisely at the way financial retrieval breaks, which makes it one of the highest-leverage upgrades available to a filing-QA system. On a system like that, the combination of low cost and high leverage is rare enough that skipping it is hard to justify once you have measured the gain.
Retrieval on filings fails because chunks lose their context. Add it back before you index: a contextual prefix, a keyword layer, a reranker. The failure rate then falls by two thirds.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.