// Insight

CAFPO: deep RL on learned factors, before costs

September 30, 20256 min read

reinforcement-learningfactor-modelslong-short

Take the architecture seriously before taking the table at face value. CAFPO, Conditional Auto-encoded Factor-based Portfolio Optimisation, compresses stock-level returns into a small set of latent factors conditioned on 94 firm characteristics, then feeds those factors to a deep RL agent, PPO or DDPG, that outputs continuous long-short weights. The design swaps the Fama-French factor block out of the RL pipeline and lets the network learn its own conditioning, which is the right experiment to run. A SHAP pass on top attributes the allocations back to characteristics, producing the economically intuitive explanations a risk committee would ask for.

CAFPO: learned factors inside the allocation loop

A learned conditioning block replaces Fama-French inside the RL loop; SHAP explains allocations.

The evaluation runs January 2000 through December 2020 on an annually refreshed universe of the 200 largest US stocks by market cap, chosen to isolate stock selection. Against that setup, the comparison table is stark. CAFPO with PPO posts a 24.58% total compound return and a 0.94 Sharpe out of sample. Every classical baseline is negative over the same two decades: equal-weight at -42.62%, value-weight at -31.36%, historical Markowitz at -0.72%. The DRL baselines scrape past zero, vanilla at 1.45% and the Fama-French-conditioned variant at 2.87%.

Total compound return and Sharpe, 2000-2020 out of sample, top-200 universe, before costs.

The construction explains the strangeness of those baselines. Long-short books built to isolate selection skill have no market beta to ride, so twenty years of equity bull market do nothing for them; what remains is pure cross-sectional skill minus noise. The Sterling column repeats the verdict in drawdown units, CAFPO at 0.07 against near-zero and negative readings everywhere else. Both columns agree these are quiet books separated by a real but small edge. That framing matters for what the table can claim: it ranks selection methods cleanly. It says nothing about whether the best of them clears the bar of being worth running.

Read the table twice

The first read says CAFPO wins by a mile, and within the paper’s frame it does. The relative ordering is the legitimate result: learned conditional factors beat hand-specified Fama-French factors beat no factors at all, inside an identical RL harness. That ordering is evidence the conditioning block matters, consistent with what transformer-based covariance work found from the estimation side: the structure you impose on the cross-section is where the value hides.

The second read prices the absolute numbers. A 24.58% total return over twenty years compounds to roughly one percent a year, which a 0.94 Sharpe can only accompany if the book runs at very low volatility. A construction where every classical benchmark loses money across two decades that included two historic bull markets tells you these are long-short books whose absolute economics are thin. Thin economics have no room for friction. The paper does not mention transaction costs anywhere. An RL agent emitting continuous weight adjustments over 200 names is a turnover machine by temperament. One modest cost assumption per rebalance could consume the entire annualized edge. Until that line exists in the table, this is a ranking of methods rather than a strategy.

The second stability flag comes from the paper’s own variants. Swap PPO for DDPG and the result collapses from 24.58% to 6.25% total, Sharpe from 0.94 to 0.42. When the headline halves twice on a change of optimizer, the reported number is a draw from a wide distribution, not a property of the method. The factor block may be doing real work while the RL layer adds variance on top, exactly the decomposition a desk would want before believing either part. Walk-forward stability of the latent factors, seed sensitivity, plus a deflated-Sharpe accounting for the configurations tried would settle it. None are reported.

The re-run a desk would actually do

The good news is that every objection above is testable with the paper’s own frame. Four amendments turn the ablation into evidence a committee could use. Put costs inside the environment rather than the postmortem: a per-trade charge in the simulator plus a turnover penalty in the reward, because an RL agent only learns to trade less when trading costs something during training. Walk the autoencoder forward: refit it on a rolling window and measure how stable the latent loadings are across adjacent refits; a factor block that reinvents itself every year is a risk model no one can govern. Attribute the variance: freeze the factor block and sweep allocator seeds, then freeze the allocator and swap factor blocks, turning the PPO-versus-DDPG gap from an anecdote into a decomposition. And respect capacity: the top-200 universe is the most liquid corner of the market, where an edge this thin will not survive a move down the liquidity spectrum.

The sample period deserves one more note. The 2000-2020 window spans two major crashes and two long recoveries, which is genuinely good regime coverage. It also ends before 2021, leaving the latest rate regime entirely untested. A desk re-running this would extend the window first, before any of the fancier amendments, because the cheapest test of a learned factor model is simply more out-of-sample time.

What survives the skepticism

Three things, none of them the headline. The conditional autoencoder as a factor block is worth lifting into any allocation pipeline that currently hard-codes its factors, because the FF-DRL-versus-CAFPO gap is the cleanest evidence in the paper. The SHAP attribution layer is the governance dividend: a learned factor model that can say which characteristics drove a weight is deployable in a way a black box never is, the same property that made KAN-style architectures interesting for committee-facing work. And the experimental frame itself, identical RL harness with swappable factor blocks, is a good template for in-house ablations.

One governance note on the SHAP layer before the verdict. Attribution computed once, on the published model, is a snapshot of a moving system: the moment the autoencoder refits, the factor meanings can drift while the explanation dashboard keeps its old labels. The deployable version re-runs the attribution at every refit and treats a large shift in the characteristic profile as a model-change event, with the same review that any recalibrated risk model would trigger.

My platform-side read: this is a research result about representation learning wearing a strategy paper’s clothes. The right response is neither dismissal nor deployment. Re-run the ablation on your own universe with your own cost model and your own turnover penalties in the reward, then see whether the learned-factor edge survives contact with implementation. The paper proves the conditioning block earns its place in the architecture; it proves nothing yet about the P&L. Those are different claims, and twenty years of out-of-sample table can establish only the first.

CAFPO’s learned factors beat Fama-French inside an identical RL harness, which is real evidence about representation; one percent a year before costs, halving on an optimizer swap, is not yet evidence about money.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →