// Insight

Look-Ahead-Bench: the cardinal sin of backtesting gets a meter

January 24, 20266 min read

look-ahead-biasbenchmarkLLM-evaluation

Every quant learns the cardinal sin in week one: never let information from the future touch a decision dated in the past. LLMs commit it by construction. A model trained through mid-2024 has read the financial history of its training window, the earnings surprises, the crashes, which stocks worked. Any backtest inside that window grades the model on material it memorized. Look-Ahead-Bench turns that suspicion into a number. The number is brutal: standard LLMs lose 15 to 22 percentage points of alpha the moment the evaluation crosses their training cutoff.

The design is the dual-period experiment every desk should have been running already. Period one sits inside the candidate models’ training windows, April 2021 through September 2023, buy-and-hold of +25.32%. Period two sits safely beyond the cutoffs, July through December 2024, with a nearly identical buy-and-hold of +24.75%, deliberately matched so market regime cannot explain the difference. Each model runs the same trading-decision tasks in both periods. Alpha decay is simply the period-two alpha minus period one. A model with genuine skill should decay mildly, the way real signals do. A model that was reciting its training data falls off a cliff.

The dual-period audit

Matched market conditions across periods isolate the cutoff as the only variable.

The table that ends the argument

The results separate cleanly into two regimes. Llama 3.1, DeepSeek 3.2, the general-purpose models, post period-one alphas between +13.81 and +20.73 points that look like discovered skill. Past their cutoffs, the same models produce -3.42, +4.02, and -1.04. DeepSeek’s collapse is the most instructive: the highest in-sample alpha in the table, +20.73, becomes -1.04 out of sample, a decay of -21.77 points. The Pitinf models, trained point-in-time so that each evaluation date only sees history available at that date, post modest alphas in period one and hold them in period two, with decay actually slightly positive, +0.31 to +1.30, improving with scale.

Alpha in percentage points per period; decay is the audit number, sign included.

Read the first column the way a fraud examiner would: the in-sample alpha of a general LLM is not a capability estimate, it is a memorization estimate wearing one’s clothes.

The paper’s forensic details make the mechanism undeniable. Models reproduce closing prices to within 1% for dates inside their training window. Prompted about 2023, they volunteer that NVIDIA surged 190%, which is exactly the knowledge a point-in-time decision must not have. For calibration, the paper runs classical baselines through the same harness: a momentum strategy earns +7.96 points of alpha in period one and keeps +5.75 in period two, the mild decay an honest signal shows, while mean reversion decays by -10.78 points, a reminder that strategy fragility exists independently of language models.

What this changes operationally

The benchmark arrives as the missing instrument for a problem this archive has circled all year. The moving-targets signal needed an encoder whose knowledge ended before the evaluation window. FinDPO’s eye-watering backtest was a teaching case in exactly this leak. Every LLM-derived signal in production today carries the same exposure, mostly unmeasured. What changed in January is that the measurement is now a protocol anyone can run: pick two regime-matched windows straddling your model’s cutoff, run your actual task in both, difference the alphas. Alpha decay across the cutoff is the LLM version of the out-of-sample test; a signal that cannot survive it was never a signal.

Building the in-house version has two traps worth naming before someone automates it. Regime matching is harder than picking two windows with similar index returns: the periods should also resemble each other in volatility, rate environment, and sector leadership, or the decay number confounds cutoff effects with regime effects, exactly the confound the original design spent its matched buy-and-hold constructing away. And the cutoff itself is fuzzier than the model card admits, since post-training and preference data often postdate the pretraining cutoff, which argues for placing the out-of-sample window well past every date the vendor discloses rather than one day past the headline one.

The triage for signals already in production follows the same logic. Inventory every LLM-derived signal by the model’s cutoff against the evaluation window that justified deployment. Anything validated entirely inside the cutoff gets re-tested on post-cutoff data at the next refresh, with position sizing held back until the decay number comes in. The cost is one evaluation cycle; the alternative is discovering the decay live, at full size, the way mean reversion discovered its own in the table above.

Three operational consequences follow for a model-risk function. First, any backtest of an LLM-assisted strategy whose window overlaps the model’s training data is inadmissible as deployment evidence, full stop; the table above is what that overlap is worth, 15 to 22 points of phantom alpha. Second, vendor claims need the cutoff question asked first: a demo on 2023 data from a model trained through 2024 demonstrates retrieval, and pricing it as forecasting is paying alpha fees for a history book. Third, the scaling result on the point-in-time side matters for the build decision: Pitinf alphas grow with model size, +6.02 in-sample at the large tier, held at +7.32 out of sample, which says genuine capability does live in these architectures once the leak is sealed, improving with scale rather than evaporating.

Reading a vendor demo after this paper

One question changes the meeting: where does the evaluation window sit relative to the cutoff?

The honest caveats keep the verdict calibrated. Six months of out-of-sample period is one regime; the paper itself flags multi-period extensions as ongoing. The Pitinf models are the authors’ own, which is the usual conflict to note even when the methodology is sound. And the harness measures one task family, trading decisions from textual information, leaving extraction, summarization, and analysis tasks for the same treatment later. None of this dulls the core instrument: the dual-period design transfers to any task with dated ground truth.

The deeper read is about where LLM-in-finance evaluation is heading this year. The era of quoting benchmark scores from inside the training distribution is closing, the same way in-sample Sharpe quoting closed a generation ago, and for the same reason: someone built the instrument that makes the sin visible. The desks that internalize this first get a quiet edge, earned by being the only ones whose evidence means what it claims rather than by better models. Point-in-time discipline took fundamentals decades to standardize. The text side now has its first ruler, seventeen years after I first watched a backtest die of a vendor data revision.

Standard LLMs shed 15 to 22 points of alpha the moment evaluation crosses their training cutoff while point-in-time models hold steady: in-sample LLM alpha is memorization in costume, while the audit now takes an afternoon.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →