// Insight

From text to alpha: the signal is in what firms stop talking about

October 11, 20256 min read

NLP-alphadisclosuresLLM-signals

Most disclosure-text signals still run on counting: how often a firm says “margin,” how positive the adjectives are. This paper measures something closer to what an experienced analyst actually reads. The signal is the change: when a firm quietly stops talking about a metric it used to emphasize, that shift predicts underperformance. The authors call the behavior moving targets, management steering attention away from numbers that stopped cooperating.

The method is the upgrade worth studying. From each earnings call transcript, an LLM extracts context-aware spans about specific metrics, keeping the qualifiers that give a number its meaning. Consecutive quarters are then compared by embedding similarity, scoring how far the firm has drifted from what it previously emphasized. The contrast is with a named-entity-recognition baseline, the keyword-counting approach most desks still run, which sees the same transcripts but loses context and picks up non-metric noise.

Moving targets: scoring emphasis drift across quarters

The score measures drift from previously emphasized metrics rather than sentiment or counts.

What the numbers actually say

The evaluation covers S&P 100 firms from January 2010 through December 2024, 5,615 firm-quarter observations across 64 quarters, with equal-weighted quintile portfolios in a calendar-time framework. Firms in the highest moving-targets quintile underperform those in the lowest: the Q5-minus-Q1 spread runs at a monthly three-factor alpha of -0.47% (t = -2.32) and a five-factor alpha of -0.50% (t = -2.43). The same construction on the NER baseline produces a three-factor alpha of -0.18% with a t-statistic of -0.89, statistically nothing. The cross-sectional regressions agree: the moving-targets coefficient comes in at -0.0465 (t = -1.70) under LLM extraction against a wrong-signed 0.0107 (t = 1.10) for the keyword version, the same story in a second framework.

Monthly 3-factor alpha magnitude, Q5-Q1 spread (bps)

The semantic signal is significant at t = -2.32; the keyword version is noise at -0.89.

The honest summary is that semantic extraction turned an insignificant signal into a significant one, which is a methods result before it is an alpha result.

The economics deserve sober framing. Roughly fifty basis points a month on a quintile spread of S&P 100 names is real but modest, the t-statistics clear conventional thresholds without ceiling room, and nothing is net of costs. A spread portfolio short the shiftiest mega-caps is cheap to trade but easy to crowd. What the paper establishes is the direction and the mechanism, with magnitudes a desk should expect to shrink under the multiple-testing discipline any text signal owes before deployment.

The leakage audit this signal class owes

Before the mechanism earns trust, the measurement deserves an audit, because semantic signals carry a leakage channel that keyword signals do not. The embedding model itself is trained on text, and unless its training cutoff predates the evaluation window, it may encode knowledge of how disclosure language evolved, which firms prospered, what phrasing preceded trouble. A point-in-time discipline that scrubs fundamentals but embeds 2010 transcripts with a 2024-trained encoder has reintroduced look-ahead through the side door. The clean protocol embeds each quarter with a model whose knowledge ends before it, or at minimum demonstrates the signal survives an encoder swap. The span-extraction design helps here, since the LLM extracts and the embeddings compare, keeping the generative model out of the scoring path. The audit still belongs in any replication, the same skepticism toward learned components that the RLVR evaluation literature normalized for reasoning claims.

One more channel hides in the data itself. Transcript vendors revise: corrected speaker attributions, cleaned transcription errors, occasionally restated passages. A backtest built on today’s vendor archive is reading the corrected record, while live deployment reads the raw same-day version. The gap is small per transcript and systematic in aggregate, which is precisely the profile of biases that survive into production unnoticed. Point-in-time text deserves the same as-reported discipline point-in-time fundamentals get.

The construction details also reward attention before anyone trades the spread. Equal-weighted quintiles on roughly a hundred names put about twenty stocks per leg, so single-name idiosyncrasies move the portfolio more than smooth alpha numbers suggest. The calendar-time framework is the right discipline, while the quarterly horizon means each position must survive three months of public availability, an eternity for a transcript-derived score any competitor can recompute the morning after the call.

Why the mechanism is the durable part

Management discretion over emphasis is one of the oldest reads in fundamental analysis. Every seasoned analyst tracks which metric vanished from the deck this quarter. What changed is that the read now scales: an LLM preserving contextual qualifiers can do across decades of transcripts what a human does for twenty names. This is the same lesson finance-tuned embeddings taught retrieval, the value is rarely the model, usually the representation. Counting keywords represents a transcript as a bag of tokens. Tracking metric spans across time represents it as a sequence of management choices.

Racing telemetry works on the identical principle. An engineer staring at absolute lap times learns almost nothing; the screen that matters shows deltas, where this lap diverged from the last one, which sector the gap opened in. A driver’s pace is a noisy level. A driver’s change is information. Disclosure text behaves the same way: the level of optimism is noise, while the delta in what management chooses to emphasize is a decision someone made.

The practical checklist before anyone trades this: replicate beyond the S&P 100, where 5,615 firm-quarters of mega-caps leave both capacity and breadth on the table; test the decay horizon, since a quarterly signal from public transcripts invites front-running once known; and watch the regime dependence, because emphasis-shifting may correlate with the earnings-pressure cycle. The formulaic-alpha factory note from last month applies its full force here too: one documented hypothesis, tested once, is exactly what a research pipeline should want, provided it goes into the library’s audit ledger like everything else.

There is a second-order use that may outlast the spread portfolio. A moving-targets score is a screening overlay as much as a signal: a risk team reviewing a watchlist gains a cheap, systematic flag for management evasiveness, the quantitative cousin of the analyst instinct that something in the story changed. Overlay uses are kinder to modest alphas, since a flag needs ordering power rather than tradable magnitude. The score also composes with the other text signals this archive has tracked. Prompt-extracted factors read the level of the narrative while moving targets reads its derivative; a desk running both holds two nearly orthogonal views of the same transcript for one extraction pipeline.

For my own work on fraud-adjacent text models, the transferable insight is the framing. Evasive drift in corporate communication is measurable, predictive, and cheap to compute once the extraction is right. Whether it prices a spread portfolio or flags a deteriorating counterparty, the construction is the same: extract what was emphasized, score what changed, treat silence about a former favorite metric as data.

The alpha is modest and the mechanism is durable: an LLM that tracks which metrics management quietly abandons turns analyst intuition about evasive disclosure into a scalable, testable signal.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →