// Insight

Time-series foundation models in finance: what transfers and what does not

December 15, 202510 min read

time-seriesfoundation-modelsforecastingTSFM

For two years the time-series foundation model pitch has hovered over every quant research meeting: pretrain once on a vast corpus of generic series, forecast anything zero-shot, retire your feature pipeline. This paper finally runs the test at the scale the question deserves, over 18 million daily excess returns. The answer comes in two halves. Off-the-shelf foundation models fail on returns outright, posting negative R-squared and trailing a plain gradient-boosted tree, while the same architectures pretrained from scratch on financial data recover most of the gap. Rent does not work. Build might.

The honest framing matters because the negative half kills a procurement fantasy and the positive half opens a research program. Both halves carry numbers, which is what separates this from the vibes that have surrounded TimesFM, Google’s decoder-only foundation model whose 2023 zero-shot results on generic benchmarks started the hope, and its Amazon counterpart Chronos. Those models earned their reputations honestly on electricity load, traffic, and retail demand. Daily equity returns are a different animal, and now there is a measurement.

What did they actually test?

The evaluation is the most complete of its kind. Two model families anchor it, Chronos in five sizes from roughly 8M to 710M parameters and TimesFM in four sizes from 8M to 500M, with ten further foundation models relegated to an appendix. The task is next-day excess return forecasting, univariate, from return history alone. The data spans roughly 10,000 US securities with training data from 1990 and out-of-sample evaluation from 2001 through 2023, extended to 94 countries across seven major international markets. Lookback windows run from 5 to 512 trading days. The benchmark to beat is deliberately unexotic: CatBoost, a tuned gradient-boosted tree.

The evaluation frame

Out of sample 2001-2023; the benchmark is a tuned CatBoost, picked to be hard to beat.

That design answers the right question. Most foundation-model demos cherry-pick series with visible seasonality, exactly what pretraining corpora are full of. Returns are the adversarial case: signal-to-noise near zero, no stable periodicity, distribution shift as a way of life. A model class that claims generality has to survive here to deserve the word.

How badly does zero-shot fail?

Badly, and with a pattern. The largest Chronos at the longest window manages an R-squared of -1.37%. The 500M TimesFM posts -2.80% with directional accuracy below the coin flip. CatBoost, the reference tree, sits at -0.10% averaged across windows, which on next-day returns is the difference between unusable and merely humble. The economics follow the fit: zero-shot TimesFM turns an annualized -1.47% in the paper’s portfolio frame, while CatBoost’s signal supports 46.50% annualized with a 6.79 Sharpe at the 252-day window.

Next-day return forecasting, out of sample; every model negative, the tree least so.

Pause on those portfolio numbers before quoting them anywhere. A 6.79 Sharpe is a paper-frame artifact, gross of costs on a daily-rebalanced long-short over thousands of names, the kind of construction that exists to compare signals rather than to be traded. The right reading is relative: the tree’s signal is dramatically more useful than the foundation models’, and none of these numbers survive contact with implementation arithmetic. The relative ordering is the result.

Why zero-shot fails is more instructive than that it fails: pretraining corpora teach shapes, and returns have no shapes to teach.

Electricity load has daily cycles a model can memorize and transfer. Equity returns offer near-white noise with faint, unstable conditional structure, which a generic prior actively obscures. The model arrives confident about patterns that do not exist here, the transfer-learning equivalent of bringing seasonal intuitions to a martingale.

The lookback window turns out to be a result of its own. At the 5-day window, zero-shot Chronos posts an R-squared of -77.07%, a number so far below zero it reads like a typo and is not: with five observations of near-noise, the generic prior dominates completely and forecasts shapes that are not there. The same model pretrained on financial data at the same window sits at -3.18%, still poor, no longer absurd. Stretch to 512 days and both versions become merely bad, -1.27% against -0.59%. The gradient says transformers need history to wash out their priors, while the tree-based benchmark holds its modest fit across every window tested. For a desk this is a deployment constraint in disguise: any TSFM-style model in a short-lookback seat, intraday or post-event, is operating exactly where the architecture is weakest.

Does fine-tuning rescue it?

Mostly no, which is the paper’s quietly devastating middle finding. Fine-tuning the pretrained checkpoints on financial data yields limited improvements, deteriorates most models, and where it helps fit, the largest Chronos, the gain does not translate into economic terms. The generic prior is not a head start to adapt; it is a bias to unlearn, and unlearning a trillion-point prior with a comparatively small financial dataset turns out to be the hard direction. Anyone who has tried to fine-tune a sentiment model trained on product reviews into a filings model has met the same wall.

What does pretraining from scratch buy?

This is the half that opens the research program. Take the same Chronos architecture, throw away the generic weights, pretrain on financial data alone: the small variant’s R-squared at the 5-day window improves from -77.07% to -3.18%, and at 512 days from -1.27% to -0.59%, now supporting 36.84% annualized at a 5.42 Sharpe in the paper’s frame. A from-scratch 20M TimesFM reaches 30.36% at 3.66. Scaling the pretraining data globally and augmenting with the JKP factor library pushes the small Chronos to 51.74% directional accuracy against CatBoost’s 51.16%, with portfolio economics of 41.89% at 6.78 Sharpe against the tree’s 47.25% at 6.46.

The evidence ladder, same architectures throughout

Each rung uses the same model class; only the training data and recipe change.

Read the ladder carefully, because the top rung is genuinely surprising while remaining honest. With finance-native pretraining plus data scaling, the foundation-model recipe pulls roughly even with the tuned tree on directional accuracy and paper Sharpe. The authors note that with hyperparameter care the architectures can outperform even without the data scaling. Yet the caveat the paper states plainly survives every rung: even pretrained from scratch, the foundation models remain less effective on goodness-of-fit than the benchmark, meaning the tree still extracts more signal per unit of data than the transformer does.

Paper-frame Sharpe, window 512 (gross, comparative only)

Relative ordering is the result; none of these levels survive implementation costs.

One scaling detail merits its own sentence: expanding pretraining from US to global data lifted the linear models’ R-squared by 0.43 to 0.60 points, turning negative fits positive, while ensemble baselines marginally deteriorated. More data helps the data-hungry architectures most, which is the scaling-law story in miniature and the strongest argument that the from-scratch program has room left to run.

What should a desk do with this?

The build-vs-borrow decision this paper reframes has sat unresolved in my platform conversations all year. The resolution is unusually clean. Borrowing weights is dead on arrival for returns: every pretrained checkpoint tested arrives with a prior that hurts, and adaptation cannot remove it. Borrowing architectures is alive and measured: the recipe, a tokenize-and-predict design of the kind Kronos applied to candlesticks, works on returns when the pretraining corpus is finance from the first gradient step. The asset a desk would actually build is a pretraining corpus and pipeline, cross-sectional, global, survivorship-clean, point-in-time, which is the same data discipline that decides whether a learned covariance model is deployable. The moat is the corpus rather than the checkpoint.

What building the corpus actually involves needs a concrete paragraph, because it is the line item proposals underestimate. The paper’s augmented setup leans on two ingredients beyond raw returns: global breadth, the 94-country panel that turned negative linear fits positive, plus the JKP factor library, the standardized cross-sectional characteristics that gave the models conditioning information beyond price history. Reproducing that in-house means survivorship-clean global price histories, point-in-time characteristics, and corporate-action hygiene across markets with different conventions, the unglamorous data engineering that every learned model in this archive ultimately stands on. None of it is research. All of it is the actual cost.

CatBoost’s persistence has earned respect rather than embarrassment. Gradient-boosted trees keep winning tabular and short-history problems because their inductive bias matches the data: axis-aligned splits on engineered features, robust to noise, indifferent to sequence length. Returns forecasting at daily horizons is closer to a tabular problem than a language problem, which is the structural reason the transformer needs a finance-native corpus just to pull even. The lesson generalizes past forecasting: when the data does not look like the data the architecture was invented for, the burden of proof sits with the architecture.

Three governance notes before anyone budgets for it. First, the strongest result in the whole comparison still belongs to CatBoost on fit, which makes the null hypothesis for any internal proposal a tuned tree with good features, the benchmark that has quietly embarrassed deep architectures on tabular problems for a decade. Second, every economic number above is gross and frame-bound; replication on your own universe should produce its own cost-adjusted versions before any capability claim enters a committee deck. Third, the from-scratch result is one paper deep. It deserves the same one-paper skepticism this archive applies everywhere else. The right response is a contained replication rather than a platform commitment.

The deployment posture that falls out is the familiar one: rent nothing, replicate cheaply, and let your own out-of-sample numbers decide whether the foundation-model recipe earns a place beside the tree it has not yet beaten.

A two-person replication on a single market with the paper’s own ladder, zero-shot, fine-tuned, from-scratch, takes a quarter and answers the question for your data. That is a small price for retiring a two-year-old hype cycle, in whichever direction your numbers point.

Where the program goes next is visible from the ladder’s slope. The from-scratch models tested here are small by language-model standards, the pretraining corpora are a fraction of what a global point-in-time data operation could assemble. The global-data result says the curve was still rising when the paper stopped. Conditioning beyond univariate return history, the cross-sectional information every practical signal already uses, is the obvious next rung. None of that is certain to close the remaining gap to the tree. All of it is testable with the same ladder, which is the paper’s most durable gift: a protocol that turns the next hype cycle into a measurement.

The bottom line

The first serious measurement of time-series foundation models on returns splits the hype cleanly. Generic pretraining does not transfer: negative R-squared, sub-coin-flip direction, economics that embarrass the simplest benchmark. Finance-native pretraining transfers substantially: most of the gap closes, the paper-frame economics pull even, and global data scaling shows the curve still rising. The tuned tree survives as the fit champion, which keeps the burden of proof exactly where it belongs. Foundation models for finance are not a product you can buy this year; they are a capability you would have to build, and for the first time there is evidence saying the build is worth a pilot.

Zero-shot time-series foundation models fail on returns, negative R-squared and all, while the same architectures pretrained from scratch on financial data pull even with a tuned tree: the moat is the corpus, and renting weights was never going to work.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →