// Insight
A stress gauge from the cross-section: simple model, fragile-market features
Every risk function runs some version of a stress dashboard, usually realized volatility with extra steps. This paper builds the disciplined version: a Market Stress Probability Index that forecasts the probability of high stress in US equities one month ahead, from eleven interpretable fragility features computed on the monthly cross-section of individual stocks. The model is an L1-regularized logistic regression in an expanding window; the result is a calibrated probability rather than an alarm: AUC of 0.800 out of sample against 0.752 for a benchmark built on lagged returns and realized volatility.
The feature list reads like a market internist’s checklist rather than a feature dump: cross-sectional return dispersion, skewness, kurtosis, mean absolute return, the fractions of stocks with moves beyond plus and minus five percent, stock count, log volume, dollar volume, turnover. Each is a fragility symptom a practitioner can narrate, dispersion widening as correlations break, the negative-tail fraction swelling before the index notices, volume signatures shifting as liquidity thins. The L1 penalty then keeps only the symptoms that earn their coefficient, in a model whose entire state can be printed on one slide. Out of sample from 2005 through 2024, after a 120-month training runway, the index rises ahead of the major episodes, with prominent spikes into 2008-2009 and the 2020 COVID shock.
One feature pairing rewards a second look: the fraction of stocks with moves beyond minus five percent sits beside the fraction beyond plus five. Including the euphoric tail is not decoration. Melt-up breadth is a fragility symptom in its own right, speculative dispersion that precedes air pockets, which single-sided drawdown indicators structurally cannot see. A stress gauge that watches both tails of the cross-section encodes the practitioner knowledge that blow-off tops and washouts are phases of the same instability, the kind of design detail that separates features chosen by someone who has watched markets from features chosen by a grid search.
Why the modest numbers are the credible kind
The improvement over the benchmark is real and undramatic, which is exactly the profile to trust in this literature. Where the gains concentrate tells the operational story: the precision-recall AUC moves from 0.444 to 0.538, a 21 percent relative gain on the metric that matters for rare events, since stress months are the minority class and a gauge is judged by its behavior when it fires. Calibration improves on both scores, Brier from 0.116 to 0.106 and log loss from 0.400 to 0.352, meaning the probabilities mean what they say, the property that separates an instrument from a mood ring.
A vol-targeting book might trim exposure as the gauge crosses thirty percent; a committee might require written justification for gross additions above fifty. The same gauge serves both without re-estimation, which is what makes it infrastructure rather than a strategy.
The simple-model choice deserves its own appreciation, third entry in a pattern this year keeps confirming. CatBoost embarrassed the foundation models on returns; time-weighting matched regime-weighting on responsive VaR bases; here a logistic regression with good features beats the obvious benchmark while staying fully inspectable. The intelligence lives in the feature engineering, the cross-sectional fragility view rather than the index-level volatility view, leaving the model one job: weigh eleven testimonies without inventing a twelfth. For a monitoring tool that must survive committee scrutiny and regulator curiosity, every coefficient being nameable is a feature worth more than another point of AUC.
The edges, stated plainly
Three boundaries belong in any deployment memo. A one-month horizon at monthly cadence makes this a posture gauge, informing hedging budgets and gross limits, useless for intraday de-risking by construction. The 2005-2024 window contains two great stress events; the spike record is two confirmations rather than a statistical law, with the no-precedent caveat applying to whatever the next crisis does differently. And an expanding window never forgets, which stabilizes coefficients while letting ancient regimes vote on current fragility; a desk replication should test a rolling variant against the published expanding one, since the choice encodes a real belief about whether market microstructure from 2005 still describes 2026.
Operationally, the gauge itself is only half the deployment; the threshold policy is the other half, with its own backtest owed. A probability stream becomes action through cutoffs, trim at thirty, escalate at fifty, and each cutoff implies a realized history of interventions whose costs and saves can be simulated against the same 2005-2024 record the gauge was scored on. Two desks consuming the same MSPI with different thresholds are running different risk programs.
Worth stating what the gauge is not, because mislabeling kills good instruments. It is not a timing signal: a 21 percent precision-recall improvement is meaningful for posture and budgeting while remaining far too coarse to trade against directly. The first desk that shorts the market because the dashboard crossed sixty will discover the difference between calibration and clairvoyance. Monitoring instruments earn their keep by changing the conversation in the room, what would have to be true for us to add risk here, rather than by replacing the people having it.
The adoption case belongs in a monitoring stack rather than a trading system. The marginal cost is eleven features off data most desks already license, a regression that fits in seconds, plus a dashboard line whose movements can be explained in committee with sentences rather than saliency maps. Against that cost, a fifth more precision-recall on rare-event warning and better-calibrated probabilities on the events that define careers. Monitoring instruments rarely get papers this clean, because the rewards in the literature flow to alpha claims; the desks that read past that incentive get the better dashboard.
Eleven nameable fragility features and a logistic regression deliver AUC 0.800 and a 21% precision-recall gain over the volatility benchmark: a calibrated stress probability whose every coefficient survives committee questioning is monitoring infrastructure done right.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.