// Insight
Implementation risk: same strategy, five engines, 3.71 points of divergence
Ask five backtesting engines to run the identical strategy on identical data and you assume you will get the identical answer. This paper checked, across five independent open-source engines, fifteen benchmark strategies, 180 S&P 500 stocks, four transaction-cost regimes. The assumption holds exactly until costs enter. At zero transaction costs the engines agree to the decimal, divergence 0.000 percent; with realistic costs the same logical strategy returns answers up to 3.71 percentage points apart, with the gap scaling almost perfectly with turnover at a Spearman correlation of 0.93. The audit also surfaced seven previously undocumented defects across three engines, found because someone finally compared outputs instead of trusting them severally.
The mechanism is mundane, which is why it went unmeasured for so long. Engines implement costs at different points in the order lifecycle, round and fill differently, time their rebalances against price marks differently, and each choice is individually defensible. The deltas compound per trade, so low-turnover strategies barely notice while high-turnover rotation strategies, where the 3.71-point divergence lives, accumulate engine personality into their results. None of the engines is wrong; the strategy’s reported performance is simply a joint product of the logic and the machinery, and only the logic was ever written down.
A vocabulary for a problem every desk has met
The paper’s lasting contribution may be metrological: four named quantities that turn an anecdote every quant owns into something a validation function can report. Engine sensitivity measures how much a metric moves across implementations. The implementation uncertainty interval puts a band around any reported figure, the error bar backtests have always deserved and never carried. The divergence amplification factor tracks how design choices like turnover inflate the band. The conclusion stability index asks the question committees actually care about: does the investment decision survive the engine choice? On these fifteen strategies it does, stability index of one, every engine agreeing on the sign even while disagreeing on the size.
The sign-stability result cuts both ways and the paper is straight about it. For these benchmark strategies, implementation risk changed magnitudes rather than decisions, which is genuinely reassuring for go/no-go calls on robust strategies. The reassurance thins exactly where decisions get close: a marginal strategy whose Sharpe clears the bar by a tenth is well inside a 0.75-point band, let alone a 3.71-point one, and marginal cases are where allocation committees spend their time. An uncertainty interval that is irrelevant for the obvious winners and decisive for the borderline candidates is not a footnote; it is the difference between funding and passing on every strategy that lives near the threshold.
Two readings extend the result beyond its own tables. The five engines here are open-source, inspectable, and community-hardened; proprietary vendor engines and in-house frameworks receive no such cross-examination, which makes the published divergence a floor rather than a ceiling. Nobody runs this comparison against a closed engine, by construction, leaving the desks most confident in their tooling with the least measurement behind the confidence. And the seven defects matter beyond their count: every one was found by comparing outputs across implementations, none by the engines’ own test suites. That is the oldest result in software reliability, N-version comparison catching what self-testing cannot, arriving in a domain that has run production capital on single-version trust for decades.
The governance translation
This result completes a chain this blog has been assembling for two years. FinanceBench showed the retrieval layer silently deciding LLM accuracy. Look-Ahead-Bench showed the training cutoff silently deciding backtest validity. Now the execution engine joins the list of unexamined components that were always part of the model. The pattern has a name in model-risk practice: the model boundary keeps being drawn too small, around the algorithm, when the reportable object is the full pipeline from data to number. Pin the engine version the way you pin model weights, record it the way TiMi-style systems stamp their strategy hash onto every fill, and treat an engine migration as a model change with parallel-run evidence, because that is what the divergence numbers say it is.
The question generalizes past backtesting to every numerical pipeline a desk trusts on faith: the risk engine, the attribution system, the cost model itself. Each is an implementation of logic someone believes is unambiguous; each would likely show its own divergence band if anyone ran the comparison. Backtesting merely went first because five open implementations existed to compare.
The cheap-to-adopt protocol falls straight out of the paper’s design. Any strategy heading to committee at nonzero costs gets a second-engine confirmation run, with the divergence reported beside the headline metric as an implementation uncertainty interval. High-turnover proposals, the 3.71-point class, earn a third engine. The cost is hours of compute against the alternative this paper quietly documents: seven engine defects that nobody had found, because every desk validated its strategy inside the same machinery that was distorting it. Cross-engine confirmation is the backtesting version of reconciling against an independent system, a control finance applies everywhere else money is computed, finally arriving at the place where strategies are born.
The durable artifact a desk should build from this paper is a canonical-strategy regression suite: a dozen strategies spanning the turnover spectrum, pinned data, expected metric ranges per engine version. Run it on every engine upgrade the way a software team runs tests on every dependency bump; any drift in the canon then flags an implementation change before it contaminates live research. The suite costs a day to assemble and runs in minutes. It converts engine risk from an annual audit surprise into a continuous, boring control, which is the correct fate for every risk on this list.
One prediction, offered with the confidence of someone who has watched this movie in other domains: implementation uncertainty intervals will migrate from this paper into due-diligence questionnaires within a couple of years, the way deflated Sharpe ratios did. Allocators learn new questions slowly, then all at once. The desks that can already answer, our number is X plus or minus the engine band, sign-stable across three implementations, will find the question a gift.
Identical strategies diverge by up to 3.71 points across backtesting engines once costs enter, scaling with turnover at 0.93 correlation: the engine is part of the model, so pin it, band it, and confirm the marginal calls on a second implementation.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.