// Insight
Sort on the bound, not the point: uncertainty-adjusted ML portfolios
Every ML asset-pricing pipeline built on the Gu-Kelly-Xiu template, the lineage that runs through to the transformer-in-the-SDF result, does the same thing at the last step: rank stocks by the model’s point forecast, buy the top decile, short the bottom. This paper asks the question that step has been begging for years. The model produces an uncertainty estimate alongside every forecast, essentially for free. The standard pipeline throws it away at exactly the moment it becomes useful. Sorting on uncertainty-adjusted bounds instead of raw points improves portfolio performance across models and decades, with the gains coming mainly from lower volatility.
The construction is deliberately practical. For each stock, the method pools historical prediction residuals, then builds a bound around the point forecast with half-width set by a chosen quantile of those absolute residuals, the quantile-of-residuals approach in the Steinberger-Leeb and jackknife+ lineage of conformal-style inference. No retraining, no new model, no distributional assumptions strong enough to argue about. The sort then ranks stocks on the conservative end of the bound rather than the point. A stock with a high forecast and wide uncertainty slides down the ranking; a stock with a moderate forecast the model has historically nailed climbs.
What it does to the numbers
The evaluation runs the full CRSP common-stock universe, monthly, 1967 through 2016, with the standard rolling scheme: train on 1967-1986, validate on 1987-1991, test from 1992 onward, refitting annually across the usual model zoo of penalized regressions, principal-components methods, random forests, boosted trees, and neural networks of one to five layers.
The baseline panel confirms what the literature has reported since 2020: point-prediction sorts work. Boosted trees lead with an annualized 39.45% at 18.40% volatility, a 2.14 Sharpe before costs. Shallow networks land between 1.45 and 1.73. The uncertainty adjustment then moves the needle where it matters. The one-layer network improves from a 1.48 Sharpe to 1.86 at the 5% quantile level. Principal-components regression rises from 1.22 to 1.56.
That is precisely the trade a portfolio manager would choose to make. Return improvements photograph well in papers, while volatility reductions compound quietly: they cut drawdowns, reduce margin strain, and let the same risk budget hold a larger position. The paper’s robustness section adds the result that makes the method credible rather than lucky: the gains persist when the bounds are built from partial or misspecified uncertainty information. You do not need the right uncertainty model. You need any disciplined one.
Why flexible models gain the most
The cross-model pattern carries the economics. Gains are largest for the flexible estimators, the networks and tree ensembles, and smallest for the rigid linear ones. The mechanism is intuitive once stated: flexible models have heterogeneous confidence across the cross-section, very sure about some names, guessing about others, which gives a sort that conditions on confidence real information to exploit. A linear model is roughly equally uncertain everywhere, leaving the bound little to reorder. Uncertainty adjustment is the complement to model flexibility, which means the shops running the fanciest models are leaving the most on the table by ignoring it.
The paper’s neural-network dispersion deserves a flag of its own, because it echoes through every ML-factor conversation I have. The one-to-three-layer networks earn Sharpes from 1.45 to 1.73; the four- and five-layer versions collapse below 1, sometimes below 0.75 after costs. Depth hurts here, consistently, across forty years of data. Anyone whose pipeline inherited a deep architecture from a domain where depth pays should treat that panel as a free ablation study.
The quantile level should be treated as the risk dial it is. At the 1% level the bounds are tight and the sort barely departs from the point ranking; at 10% the bounds widen, the sort turns conservative, and more high-uncertainty names drop out of the extremes. The 5% results quoted above sit at the sensible middle, while the right setting for a live book is a tuning decision against your own turnover and capacity, not a constant to inherit. Two practical notes from the construction itself: bound-based rankings should be more stable month to month than point rankings, since they damp exactly the noisy forecasts that flip deciles, a turnover dividend worth measuring in any replication. And model uncertainty concentrates where tradability is worst, in small, illiquid, hard-to-borrow names; a bound-based sort that avoids them is partly rediscovering a liquidity screen, which is fine, as long as the capacity analysis prices the overlap honestly.
The cost on the other side of the ledger is modest and worth stating. Bound-based sorts hold back from high-conviction-high-uncertainty names, which forfeits some of the lottery-ticket upside point sorts occasionally capture. After the standard 20-basis-point cost assumption the ordering survives, with the best post-cost configurations still well above 1 and the adjusted networks holding economically meaningful spreads. Nothing here repeals the usual caveats, a single panel, US equities, the multiple-testing haircut every reported Sharpe owes. The within-paper comparison is the result: same models, same data, same costs, better risk-adjusted outcomes from one added step.
The platform read
For a multi-strategy platform the attraction is that this composes with everything already running. It does not compete with the forecast models, the factor blocks, or the portfolio optimizer; it inserts between forecasting and construction, where most pipelines currently pass a single number forward. The implementation is a residual archive per asset and a quantile lookup, infrastructure most research stacks half-maintain already for diagnostics.
The pattern also rhymes across this archive in a way worth noticing. Uncertainty heads made LLM outputs deployable by attaching confidence to each generation; conformal-style bounds make ML forecasts deployable by attaching confidence to each prediction. Same move, different asset class: the second number is where the governance lives. A sort that documents why it avoided a name, the model’s own measured uncertainty, is also a sort a risk committee can interrogate, which is never true of a raw point ranking.
Watching ML pipelines mature on desks teaches one durable lesson: the upgrades that stick are the ones that subtract risk without adding moving parts. This is one of those. The forecast was never the deliverable. The decision was, and decisions have always needed the error bar.
The ML model already tells you which forecasts to distrust; sorting on uncertainty-adjusted bounds instead of points turns that free information into lower volatility, which is the improvement that compounds.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.