// Insight

The transformer enters the SDF: complexity wins asset pricing

February 8, 202510 min read

asset-pricingSDFtransformerscomplexity

Empirical asset pricing has spent fifty years on a parsimony diet: three factors, then five, then six, each addition litigated for a decade. Kelly, Kuznetsov, Malamud, and Xu arrive from the opposite direction. They implant a transformer directly in the stochastic discount factor, let attention share information across the entire cross-section of stocks, and report an out-of-sample Sharpe ratio of 4.57 on US equities from 1968 to 2022, against 1.77 for the best classical factor model in the comparison. The result is not a fluke of one architecture; it is the latest and largest entry in a research program arguing that in pricing, complexity is a virtue.

The foundational half of that program is the companion paper, Didisheim, Ke, Kelly, and Malamud’s “APT or AIPT?”, which earns the theory citation here. Where Ross’s APT conjectured that a small number of factors govern returns, AIPT demonstrates the opposite empirically: stochastic discount factors built from up to 360,000 random Fourier factors, a complexity ratio of 1,000 against the 360-month training window, deliver out-of-sample tangency Sharpe ratios near 3.7, roughly 2.6 times the low-complexity equivalents. More parameters than observations, by three orders of magnitude, with out-of-sample performance rising in the excess. Every instinct trained on classical statistics objects; the double-descent literature that machine learning normalized explains why the objection fails when paired with proper shrinkage.

The ladder from factors to attention

The transformer paper’s cleanest contribution is incremental attribution: each architectural ingredient added separately, each Sharpe gain assignable. An SDF with no cross-asset information, each stock priced from its own characteristics, reaches 3.6, already double the classical bench. Adding linear attention, cross-asset information sharing with no nonlinearity and identical inputs, lifts it to 3.9, with an alpha t-statistic of 6.8 over the attention-free version. Stacking depth and nonlinearity carries the full model to 4.31 and then 4.57.

Out-of-sample SDF Sharpe, US stocks 1968-2022

Each rung adds one ingredient; the attention step alone carries a 6.8 alpha t-statistic.

The decomposition isolates the economic claim: most of the leap over classical models comes from complexity itself, while the distinctly transformer-shaped gain comes from letting stocks price each other.

Cross-asset attention is a learned, conditional version of something quants have always done by hand, peer comparisons, industry adjustments, lead-lag relationships, except the model discovers which stocks inform which from the data, updating the map conditionally, while the hand-built covariance structures this blog examined in the fall look, in hindsight, like its fixed special case.

Cross-asset attention inside the SDF

Peer comparison as a learned, conditional operation rather than a fixed industry map.

What the machinery actually is

Two technical details from the paper repay the reading time, because they locate where the magic is and is not. The first is that linear attention, the version with no nonlinearity at all, is formally equivalent to a regression with roughly 2.3 million parameters built from interactions of the underlying characteristics. Most of the transformer’s mystique dissolves into that statement: cross-asset attention is a disciplined way to generate and shrink an enormous interaction space, the AIPT program executed through a different factory. The genuinely new behavior arrives with the softmax. The paper proves a selectivity lemma showing how the softmax sharpens attention onto a conditional subset of stocks, which is what upgrades static interactions into state-dependent peer selection, the difference between always comparing a stock to its industry and choosing tonight which comparison matters.

The second detail is a depth result with a practitioner moral: performance saturates at two transformer blocks. The 4.57 does not come from architectural enormity; it comes from one round of conditional information sharing applied well, then diminishing returns. Anyone budgeting a replication should hear that as good news, since the result lives at a scale a research cluster can train, rather than in frontier-lab territory. The bitter-lesson framing this invites, general computation plus data beating hand-crafted structure, lands differently in pricing than in language: here the hand-crafted structure being retired is the factor zoo itself, six factors at a time, each with a sponsor and a story.

The era result deserves its own exhibit, because it answers the practitioner’s first objection, that factor models worked fine until recently. In the post-2002 subsample every classical benchmark deflates toward noise, the best of them at a 0.95 Sharpe and the worst at 0.46, the familiar story of crowded, published factors decaying. The complex models deflate too, but to 3.37.

Post-2002 subsample: the era that humbled factor models

Everything decays after publication and crowding; complexity decays from a much higher floor.

What the theory actually claims

AIPT’s argument is worth restating precisely, because it is a statement about misspecification rather than magic. A low-dimensional factor model is a strong prior that almost surely excludes the true SDF; a high-dimensional model with ridge shrinkage spans a vastly larger space and lets the data pull toward the truth it contains. The random Fourier construction makes this concrete: nonlinear transformations of the same JKP characteristics everyone uses, multiplied into hundreds of thousands of managed portfolios, with shrinkage doing the discipline that variable selection used to fake.

Two worldviews of the pricing kernel

The empirical scoreboard, 3.7 vs 1.4 at matched discipline, currently reads one way.

The interpretability objection arrives on schedule. The honest response is that it bites less here than elsewhere. An SDF is a portfolio; its positions, exposures, and turnover are all inspectable ex post even when the function generating them is not, which puts it closer to the auditable-output side of the interpretability spectrum than a black-box classifier. What a committee cannot get is a three-sentence story for why the weights are what they are this month. Whether that story was ever more than ceremony for classical factors is a question the post-2002 panel asks pointedly.

The implementability haircut

Now the part the papers leave to the reader, which is where a practitioner earns their keep. An out-of-sample SDF Sharpe of 4.57 is a statement about a frictionless tangency portfolio rebalanced monthly across the full CRSP universe, including the small, illiquid names where characteristic signals run strongest. It is not a strategy quote. The haircuts arrive in a known order: transaction costs on monthly turnover across thousands of names; capacity limits that bind hardest precisely where the model loves most; implementation lag between signal and fill; and the multiple-testing discount owed by any result selected from a research program of this size, however honest the out-of-sample protocol. Industry replications of the adjacent complexity literature have found the ordering robust and the levels far more modest once real-world constraints enter, which is the expected fate and not a refutation.

The right reading is that the gap between 1.77 and 4.57 measures recoverable structure that classical models leave on the table, while the tradable fraction of that gap is an open empirical question each desk must price for its own costs and capacity.

Even a quarter of it surviving implementation would be the largest methodological gain in systematic equity in a generation; the experiment is precisely the kind a disciplined replication culture exists to run.

The objections that survive contact with the papers are two, both worth respecting. The first is selection across the research program: this is the strongest result from a prolific group running many specifications across many papers, which means the program-level discount applies even where each paper’s own out-of-sample protocol is clean, the standard posture this archive takes toward any family of results with one ancestry. The second is economic rather than statistical: an SDF with thousands of conditional positions is, mechanically, a high-turnover long-short machine, and critics reading it as sophisticated exploitation of microcap mispricings and short-horizon reversal are raising exactly the right question, because those are the alphas implementation eats first. Both objections sharpen the replication agenda rather than dissolving the result; neither touches the within-paper ladder, which is the cleanest part of the evidence.

The platform translation

For a multi-strategy platform, the actionable content is a reallocation of research effort, worth stating concretely. The complexity program shifts value away from the activity most factor teams spend their year on, arguing individual signals into the model, and toward three kinds of infrastructure. Data breadth first: the JKP characteristic library is the fuel for every result above, which means a desk’s edge increasingly lives in the breadth and cleanliness of its characteristic panel rather than the cleverness of any single signal built on it. Shrinkage discipline second: ridge parameters and validation protocol carry the statistical load that variable selection used to fake, which makes the tuning harness a governed artifact rather than a notebook setting. Monitoring third, because a 360,000-factor SDF cannot be reviewed signal by signal; it is reviewed the way this archive keeps concluding all complex models must be, at the portfolio layer, through exposures, turnover, concentration, and drawdown behavior, with factor-attribution tooling translating weights into committee language after the fact.

The replication experiment for a desk that wants its own number is mercifully well-defined. Reconstruct the managed-portfolio universe on your own data vendor, run the AIPT recipe at three or four complexity levels to verify the virtue-of-complexity slope survives your data hygiene, then add the linear-attention layer and check the increment against the paper’s 0.3. Costs enter as a final overlay: net the monthly turnover at your execution schedule across capacity tiers; the surviving Sharpe at each tier is your answer to the only question that matters here, how much of the frictionless 4.57 is yours. Teams that ran this protocol on the earlier complexity papers report the slope survives and the level halves or worse, which would still leave the most attractive systematic result in years.

The bottom line

A research program now spans theory and instantiation: AIPT establishes that many weak factors beat few strong ones when shrinkage replaces selection; the transformer paper shows the gain compounds when stocks are allowed to price each other through learned attention. The numbers are gross, the universe is frictionless, the haircut is real and unpriced. What survives all discounting is the directional claim, made twice now with sixty years of data: parsimony in asset pricing was a computational constraint masquerading as a principle, with the constraint now gone. The desks that treat that as a research agenda rather than a leaderboard will spend the next several years finding out how much of the 4.57 was ever theirs to keep. The historical rhyme is exact enough to navigate by: when Markowitz mean-variance arrived, the estimation technology of the day could not support it; the profession then spent decades building shrinkage, factor structure, and robust optimization to make the theory usable. Estimation caught up to theory once before, slowly; the institutions that funded that work patiently owned the result, which is the precedent worth budgeting against rather than the leaderboard. AIPT hands this generation the same homework in reverse, a working empirical machine waiting for its implementation theory, with costs, capacity, and turnover discipline as the chapters still to be written.

A transformer in the SDF posts 4.57 out of sample against 1.77 for the best classical factor model, while the theory behind it says why: many weak factors plus shrinkage beat few strong ones, because parsimony was never a virtue, only a constraint.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →