// Insight
LLMFactor: named factors from news, and the backtest that complicates them
The appealing idea in LLMFactor is that it does not stop at a sentiment score. It uses sequential knowledge-guided prompting to pull named, human-readable factors out of financial news, then combines them with price history to predict the next move. The contribution worth keeping is the readable factor, not the accuracy number. The accuracy, once you read it honestly, is a useful lesson in why news-to-alpha is hard.
The method runs in three stages. It matches news to stocks and extracts the relationships between companies, identifies the factors moving a price from the article, then predicts direction using those factors and recent price history. The output is the part a quant likes: a list of named drivers you can read, argue with, and track. Instead of a single sentiment number, you get something closer to a research note that says which catalyst the model thinks is moving the stock.
Read the results honestly
Across four datasets, the directional accuracy sits in the high 50s to mid 60s. On a binary up-or-down call, that is a modest edge. It is not uniform.
LLMFactor beats its best baseline on StockNet by about three points and on the Chinese CMIN-CN set by about four. It loses on CMIN-US by nearly three points and trails badly on EDT, where a specialised model scores 75.7 against LLMFactor’s 59.1. A method that wins on two datasets and loses on two is not a signal you take to a committee. It is a prompt-engineering result that needs the same scrutiny as any other.
The honest framing matters here. A paper that reported only StockNet and CMIN-CN would read as a clear win. The full table reads as a wash on accuracy, with the real contribution sitting elsewhere. That is a more useful result, because it tells you where the value is and is not, rather than selling a single cherry-picked number.
One more caution about the metric itself. Directional accuracy weights a one-cent move the same as a ten-percent move. A model can be right on the small, unimportant days and wrong on the few that matter and still post a respectable number. For a trading signal, the distribution of when you are right matters more than how often. A binary hit rate hides that completely.
Where the quant scrutiny goes
This is genuinely the bridge between NLP and the factor-zoo discipline a desk already lives in. That is exactly why it deserves the discipline’s full suspicion. Three questions decide whether named factors are alpha or artifact.
The first is look-ahead. News-based prediction is wired for leakage, because the article, the label, and training data can overlap in time in ways that flatter a backtest. A model pretrained on text through a given date, evaluated on events before that date, may simply remember how the story ended. Without strict point-in-time data and a model whose knowledge cutoff sits before the test window, a directional accuracy in the 60s can evaporate. This is the single most common way news-to-alpha results fail to replicate. It is invisible unless you design the test to rule it out.
The second is multiple testing. A prompt that extracts factors is a factory for candidate signals. A factory that tries enough formulas will find one that fits noise. A prompt you tune until the backtest looks good is exactly that factory wearing different clothes. The deflated-Sharpe instinct applies to prompts as much as to formulas: every variation you tried is a test you have to pay for in your significance threshold. A 62% accuracy that survived fifty prompt iterations is not the same evidence as a 62% from the first prompt you wrote.
The third is the gap between a directional hit rate and a tradable strategy. A 62% up-or-down call says nothing about magnitude, costs, or capacity. It does not tell you whether the wins are bigger than the losses, whether the edge survives the spread, or whether it exists in names large enough to trade. On a multi-strategy book, those are the terms that decide whether a signal survives contact with execution. A hit rate in the low 60s on liquid names is often a hit rate the costs eat alive.
A concrete way to use it
Strip out the prediction and keep the extraction. That is when LLMFactor becomes a useful tool. The valuable output is the structured list of named drivers the model pulls from each article: this stock, this catalyst, this relationship to that supplier. Treated as features rather than as a signal, those drivers feed the research process a desk already runs.
Picture the workflow. Each night the model reads the day’s news for your universe and emits, per name, a short list of candidate factors with the sentence that supports each one. An analyst sees a structured digest rather than a sentiment score: which names had a supply-chain factor fire, which had a guidance change, which had a regulatory event. The analyst keeps the source link for every claim. Nothing has to be taken on trust. The factors that recur and survive validation graduate into the factor library. The ones that do not are discarded with a clear record of why.
Used this way, the modest accuracy stops mattering, because you are not trading the model’s prediction. You are trading a factor you validated yourself, found faster because the model surfaced and labelled it. That is the honest value of LLMFactor: a feature factory with a built-in audit trail, run under the same point-in-time discipline as any other source of candidate signals.
What is worth keeping
The interpretability, handled with care. A named factor you can read is worth more than an opaque signal with the same hit rate, because you can sanity-check it against economic sense, monitor when it stops working, and explain it when it does. If the model says a stock moved on a supply-chain disruption, you can check whether that is plausible, watch whether the factor keeps firing on similar news, and retire it when the relationship breaks. An opaque classifier with the same accuracy gives you none of that.
That is real value. It is independent of the modest, mixed accuracy. The honest read of LLMFactor is that it makes news-derived signals legible, which is a genuine step, while reminding you that legibility is not the same as edge. A readable wrong signal is still wrong. The legibility helps you find out faster, which is worth a great deal in research even when the signal does not survive.
Treat the factors as hypotheses to validate under a strict point-in-time backtest, deflated for the many prompts you tried, before any of them sizes a position. Used that way, LLMFactor is a feature-generation tool with an unusually good audit trail. Used as a turnkey signal off the reported accuracy, it is a replication failure waiting to happen.
LLMFactor’s worth is making news signals readable: a named factor you can interrogate beats an opaque one. It still has to survive a point-in-time, multiple-testing-aware backtest.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.