Skip to content
Tim Frenzel

// Insight

The LLM-trading-agent survey: a skeptic's reading of the backtests

3 min read
surveytradingagents

The useful part of this survey is the section nobody quotes. It catalogs LLM trading agents reporting annualized returns of 15% to 30% over the strongest baseline. Then it quietly documents evaluations too weak to trust those numbers. Read it for the backtesting practices, because that is where the field gives itself away.

The map of architectures is worth having. The survey splits the work into two roles. An LLM-as-Trader makes the decision directly, driven by news, by layered memory and reflection, by debate between role-playing agents, or by reinforcement learning. An LLM-as-Alpha-Miner does something more familiar to a quant: it generates alpha factors that feed a downstream trading system, rather than pulling the trigger itself.

The two roles a trading LLM plays
Prices + filings + newsLLM agent: profile, memory, reflectionDirect tradeAlpha factor
As a trader the model acts on the decision; as an alpha miner it emits factors for a conventional system to trade. The second role is the one a desk can govern.

Where the evidence falls apart

The survey is candid about the holes. They are the holes a quant looks for first. The median backtest covers just 1.3 years, with start and end dates chosen arbitrarily. A 1.3-year window is not a regime test. It is a single market mood with a result attached. Coverage is confined to US and Chinese equities, with derivatives, bonds, and commodities almost entirely absent. Few studies count trading costs at all, which alone can turn a reported edge into a loss.

The two omissions that should stop you are the ones the survey flags by their absence: no discussion of survivorship bias, and no discussion of data-snooping. Those are the first two questions on any model-risk checklist. A strategy backtested on the names that still trade today, tuned until the numbers looked good, will report a handsome return that means nothing. The LLM does not change that arithmetic. It just produces the overfit faster.

What a credible evaluation needs

The constructive read of the survey is a checklist for the evaluation it documents being missing. A trading-agent result worth a second look would test across at least one full regime change, not a 1.3-year window that catches a single market mood. It would extend beyond US and Chinese equities into the asset classes where the claimed edge has to survive different microstructure. It would price in realistic transaction costs, because an uncosted return is a gross number masquerading as a net one. It would address survivorship and data-snooping head-on, with point-in-time data and a deflated-Sharpe haircut for the strategies tried.

None of that is exotic. It is the standard a quant applies to any new signal before it sees capital. The survey is useful precisely because it shows how far the LLM-trading literature still sits from that standard. The capability is moving fast. The evaluation discipline has not caught up. Until it does, the returns are claims rather than evidence.

How to read a result like this

Discount the headline and inspect the harness. A 15% to 30% annualized return over 1.3 years, on US and Chinese equities, with no cost model and no survivorship control, is not evidence of alpha. It is evidence that the evaluation was generous. The same return with a decade of out-of-sample data across asset classes, with costs included and a deflated-Sharpe haircut for the strategies tried, would be worth a serious look.

This is why the alpha-miner role is the more interesting one for a desk. A model that emits named factors lets you drop them into a validation harness you already trust, with point-in-time data and honest cost accounting. A model that trades directly hands you a number you have to take on faith. On a multi-strategy book, faith is the one input that never clears risk.

The returns in this literature are loud and the evaluations are quiet. Judge an LLM trading agent by its backtest harness, never by its headline Sharpe.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.