// Insight
An LLM alpha factory meets the multiple-testing problem
Formulaic alpha mining has had one production-line upgrade per decade: hand-crafted ratios, then genetic programming, now language models. This framework is a clean instance of the current generation. An LLM receives structured inputs, OHLCV history, technical indicators, plus sentiment scores for the target company and its related peers, and writes formulaic alphas: mathematical expressions meant to capture a signal. Those alpha values then feed a stable of predictors, Transformer, LSTM, TCN, SVR, and random forest, to forecast prices. The authors report the LLM-generated alphas significantly improve predictive accuracy, with the natural-language reasoning behind each formula doubling as documentation.
The architecture choice worth noticing is the separation of duties. The LLM never predicts a price. It writes features, while conventional models do the forecasting, which keeps the unverifiable part of the system (a language model’s judgment) out of the final numerical claim. The reasoning trace attached to each formula is genuinely useful: a factor that arrives with its own stated hypothesis is easier to review, easier to reject, and easier to retire when its premise stops holding. That is an upgrade over genetic-programming factor zoos, whose expressions tend to arrive meaning nothing.
The industry has a public reference point for what this factory’s output competes against. 101 Formulaic Alphas, Kakushadze’s 2016 paper documenting real production alphas, reports average holding periods of 0.6 to 6.4 days and, the number that matters here, an average pairwise correlation of just 15.9% across the set. Production libraries prize decorrelation, because formulaic alphas are run as large ensembles in which each formula is a weak signal and the combiner does the heavy lifting. The marginal value of a new formula is its orthogonality to the existing library rather than its standalone accuracy. An LLM factory tuned to maximize predictive accuracy is optimizing the wrong objective; the prompt should demand candidates that are decorrelated from what the library already holds, with marginal contribution measured after neutralizing against incumbent factors.
Crowding is the system-level version of the same worry. Firms drawing formulas from the same base models with similar prompts will converge on similar expressions, the herding dynamic that LLM market simulations flagged from the trading side. A formula your factory wrote in an afternoon was plausibly written by three competitors the same week, with the decay profile that implies.
The improvement worth pricing is interpretability per factor, because the statistical problem underneath has not moved an inch.A factory that can write a thousand plausible formulas is a multiple-testing machine, the same one quant research has been disciplining since long before language models. Test enough expressions against the same history and some will fit it beautifully by chance. The LLM makes candidates cheaper and better-argued. Cheaper candidates mean more tests. More tests mean the bar for believing any single discovery has to rise, mechanically, with the number of formulas tried.
The paper’s own evidence stops short of where a desk would need it. Reported gains are in predictive accuracy against baselines, not in portfolio terms after costs. Accuracy improvements on price prediction routinely evaporate between the forecast and the fill: turnover, costs, capacity, and crowding all live in that gap. A sentiment-derived formula has a particular fragility here, since sentiment data is short-history, vendor-dependent, and regime-sensitive, the precise profile that overfits a backtest window. The earlier LLMFactor work had the same shape: named, human-readable factors from text, with the economic significance left as an exercise.
What would make the factory safe to operate is the harness around it, and none of it is exotic. Pre-register each formula’s hypothesis before testing, the reasoning trace makes this nearly free. Charge every candidate against a multiple-testing budget, deflating the acceptance threshold as the count grows. Hold out time as well as tickers, with an embargo period the factory never sees. Then retire factors on schedule unless they re-qualify out of sample. Treat each formula as a position with a half-life rather than a discovery with tenure, tracked on a dashboard of per-factor decay and crowding. The reasoning trace gives you the retirement memo for free: when the stated hypothesis stops being true, the factor goes. A preference-trained sentiment model taught the same lesson from the other direction this summer: the method can be real while the reported backtest is the least trustworthy artifact in the paper.
The note-sized verdict: adopt the pattern, meter the output. An LLM that writes documented, hypothesis-bearing factors is a real productivity gain for a research pipeline that already has discipline. For one that does not, it is a faster way to fool yourself, with better prose attached to each mistake.
LLM-written alphas arrive cheaper and better-documented than genetic programming ever managed, while the multiple-testing arithmetic stays exactly as brutal: the factory is only as good as the rejection discipline around it.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.