// Insight

Let the market choose which source your RAG trusts

June 11, 20267 min read

RAGfinancial-signalspoint-in-time

Financial RAG systems rank evidence by textual relevance. Around a real corporate event, the most similar article often just restates the headline the market already priced. This paper from two MIT researchers keeps the language model frozen and moves all the adaptation into the retrieval layer, learning which source family has historically been useful for each event type and horizon from matured market feedback. On an 89-stock Nasdaq universe over 2021 to 2023, the frozen reader with source memory lifts held-out macro-F1 from 0.438 to 0.471 and a diagnostic portfolio Sharpe from 0.52 to 0.84 net of a 10bps cost proxy, while a LoRA-tuned reader helps less.

The setup is event-driven and strict about time. Each prediction is anchored to a timestamped news item. The system retrieves point-in-time news and SEC filing passages observable before the decision. A frozen Llama-3.1-8B-Instruct then outputs a residual-impact call of plus one, zero, or minus one at the one-, three-, and five-day horizons. After each horizon matures, the realized residual return, the stock move after stripping out QQQ through a horizon-matched market model, becomes a label. That label updates an external Beta-style source memory indexed by source family, event type, and horizon. The reader never sees a gradient. Only the ranking changes.

Adaptation lives in the retriever; the reader stays frozen

The Beta-style memory scores source family x event type x horizon; final rank = relevance + lambda * clipped utility, with lambda 0.30 capped at a 0.06 contribution. Feedback moves the ranking, never the weights.

The design targets a mismatch this archive has circled all year. Textual similarity is not market usefulness. Near-duplicate news repeats information already in the price, while a buried filing passage may carry the guidance or risk language that moves a stock over a week. FinanceBench showed retrieval over filings fails most of the time. Finance-tuned embeddings lifted recall by teaching the retriever finance vocabulary. This paper adds a different lever: let realized P&L tell the retriever which well of evidence has paid off before, conditioned on the kind of event that just fired.

The paper’s own worked example makes the gap concrete. On 30 April 2020 Apple reported a Q2 beat. The most relevant article led with the beat. The fact that moved the stock sat in the 8-K: Apple withdrew guidance for the coming quarter, a COVID-era signal a pure relevance ranker buries under the louder headline.

One event, two retrieval philosophies (AAPL, 30 Apr 2020)

Source memory had learned that for earnings events the filing has paid off, so it lifts the guidance withdrawal above the louder beat.

The result that separates this from the usual factor-zoo backtest is a control rather than the headline number. When the authors replace each feedback label with a random draw from the class distribution, the source-memory gain collapses back to the no-memory baseline, 0.441 macro-F1 against 0.438. Shuffling the event-type key weakens the gain without erasing it, so source utility carries some global structure but works best conditioned on what happened. The lift is real source-trust learning rather than the incidental reordering any reranker produces. For anyone who has watched a learned signal turn out to be a relabeled coin flip, that ablation is the first thing to check. The paper leads with it.

Held-out macro-F1: from random reranking to learned source trust

Random labels add nothing (0.441 against 0.438). The lift appears only when real outcomes are keyed to event type and credited to the evidence the reader actually cited.

The gain also lands where it pays. Scored only on the positive and negative events a trader acts on, non-neutral macro-F1 rises from 0.311 to 0.356, clear of the dominant do-nothing class that flatters any all-class average.

The more provocative finding is architectural. Adapting the retrieval layer beat fine-tuning the reader on the same labels.

Diagnostic portfolio Sharpe, net of 10bps cost

Same 89-stock universe, 2021 to 2023; QQQ ran 0.41 over the window. Source memory on a frozen reader tops a tuned reader, and tuning adds little once retrieval adapts.

A LoRA reader trained on the same matured labels improved static retrieval modestly, then added almost nothing once the source memory switched on. The reason is the label. Residual returns are noisy market feedback rather than clean semantic annotation. Pouring them straight into reader weights invites the model to memorize lexical accidents. A small Bayesian table indexed by source and horizon absorbs that noise more gracefully, with count-based shrinkage on thin cells and a clip that keeps textual relevance primary. The single knob is well-behaved. Sweeping the memory strength from zero to 0.30 lifts the diagnostic Sharpe from 0.50 to 0.80. Pushing it higher decays performance as source utility starts to override the text score. The maximum memory contribution is capped near 0.06, enough to reorder relevance-comparable candidates while leaving textual relevance in charge.

Read this as a model-risk practitioner and the audit trail is what stands out. SEC passages are timestamped by EDGAR acceptance datetime rather than filing date. A filing is eligible only if its acceptance precedes the decision time. The market-context card excludes anything observable after the close. The reported future-document rate is 0.00%, with near-duplicate anchor matches below 1% and the reader citing a genuine retrieved id in about 88% of outputs. That discipline is exactly what Look-Ahead-Bench found most LLM-on-history setups get wrong. A retrieval design that bakes point-in-time eligibility into the candidate pool is harder to contaminate than one that filters after the scoring.

The honest reading keeps the numbers in proportion. The portfolio is a fixed diagnostic rule rather than an optimized strategy, a long-only top-ten sleeve on a three-day hold. The Sharpe of 0.84 beats the QQQ benchmark and the equal-weight universe. Much of the edge is risk reduction. Annualized volatility falls from 19.1% to 16.9% and the drawdown narrows. The signal looks more like a better-behaved tilt than a fresh source of return. Turnover runs near 9% a day, which a 10bps proxy prices kindly for a real Nasdaq book that pays impact and borrow. The universe is 89 names and two source families. None of that sinks the contribution. It bounds how far an 0.84 travels before live frictions reprice it.

What a desk should take is the shape, which is cheap to copy. Keep a strong reader fixed, attach a small interpretable memory that scores where evidence has paid off by event type and horizon, then let matured outcomes move the retrieval ranking instead of the model. It composes with the rest of the filings stack this archive has documented, the reranking and small-to-big defaults and the near-duplicate curation that standardized filings demand. It also travels past finance: any retrieval system with a delayed ground-truth signal, fraud review, claims triage, support deflection, can score its sources the same way. The eighteen years I have spent watching signals decay teach the one caution the paper already flags. A source that paid off through 2021 to 2023 earned it in that regime. The memory has to keep updating or it becomes one more stale weight.

The cheapest adaptation in financial RAG may be learning where to retrieve from rather than how to read: a frozen reader with a small market-fed source memory beat a fine-tuned one. A shuffled-label control proves the edge is signal, not reranking noise.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →