// Insight

FinRL-DeepSeek: an LLM news signal wired into a risk-aware RL agent

March 1, 20256 min read

reinforcement-learningLLM-signalstrading

Every desk that mixes discretionary and systematic work has the same idea sketched on a whiteboard: let a language model read the news and turn it into a signal, then let a trained policy trade on it. FinRL-DeepSeek is that sketch, built and open-sourced. It fuses an LLM’s reading of financial news with a risk-aware reinforcement-learning agent, and ships the code to reproduce it. That makes it a template you can run rather than a result you take on faith.

What does it actually do?

The pipeline has two halves. First, an LLM reads financial news and turns it into a structured signal. The paper draws news from the FNSPID dataset and runs it through models including DeepSeek V3, extracting a sentiment, a risk assessment, and trade recommendations for each name. Second, a reinforcement-learning agent takes that signal as an input and decides allocations across the Nasdaq-100. The two halves are the whole idea: a language model supplies the view, a trained policy turns the view into positions. The paper runs the signal through three different language models, DeepSeek V3, Qwen 2.5, and Llama 3.3, which is itself useful: it tests whether the approach leans on one model or holds across several, a small robustness check most single-model papers skip.

FinRL-DeepSeek: news to position

An LLM turns each news item into a structured signal, then a reinforcement-learning agent allocates on it. The agent's objective penalizes tail losses (CVaR), so risk preference is trained in rather than enforced afterward.

The RL algorithm is the detail worth noticing. It is not vanilla policy optimization. It is CPPO, a variant that optimizes a Conditional Value-at-Risk objective, which means the agent is trained to limit tail losses rather than only to maximize expected return. CVaR is the average loss in the worst slice of outcomes, the left tail a risk manager actually worries about. Optimizing it directly is the right instinct baked into the objective.

Why the risk-aware objective matters

This is where the design shows judgment. A naive RL trading agent maximizes expected return, which in backtest rewards exactly the behavior that blows up in practice: concentrated bets that happened to pay off in-sample. Optimizing a CVaR objective instead pushes the agent to care about the worst cases, which is what survives contact with a real drawdown. Building the risk preference into the training objective is sounder than maximizing return and clamping risk afterward, because the agent learns to trade within the risk budget rather than against it. A return-maximizer with risk limits bolted on will spend its time pressing against those limits. A CVaR-trained agent treats the tail as part of the goal.

There is a reason this resonates with how a real desk works, beyond the math. A discretionary trader who reacts to news already manages tail risk by instinct, sizing down when conviction is low and the downside is ugly. A naive return-maximizing agent has no such instinct. Encoding CVaR into the objective is an attempt to give the machine the risk discipline a good trader already carries, which is the right ambition for a system meant to touch real capital. Whether this implementation achieves it is an empirical question. The intent is the part worth borrowing.

The reproducibility is the point

For a practitioner, the open code matters as much as the method. Most trading-with-LLM papers report a backtest you cannot inspect, with a Sharpe ratio you simply have to trust. FinRL-DeepSeek ships its code. The news dataset is public. The asset universe is a standard index. That combination means you can run it, change the LLM, swap the universe, and test each piece against your own assumptions. A template you can take apart is worth far more than a number you cannot verify, which is precisely the standard the trading-agent survey found most of the field failing to meet.

Where to keep your skepticism

The honest cautions are the ones the abstract does not settle. I cannot see the performance numbers in the version available to me. The size of any edge is unverified. A reproducible template is not the same as a profitable one. Three hazards deserve specific attention before anyone gets excited.

The first is the news data. If the news signal is not strictly point-in-time, the backtest leaks future information. An LLM reading news with any lookahead will manufacture an edge that vanishes the moment it trades live, and news-driven backtests are unusually prone to this. The second is transaction costs and turnover, the perennial killers of any signal that rebalances on a news flow. A strategy that reacts to every headline can trade itself broke before the signal pays. The third is the LLM signal itself. As the analyst-forecasting work showed, an LLM’s read can be confident and wrong. The quality of the sentiment-and-risk signal is its own model-risk question, not a given. The whole strategy rests on it.

What a real evaluation would require

If you wanted to know whether the LLM signal actually adds value, the test is specific. Run the RL agent with the news signal and without it, on the same universe, the same period, the same costs, and compare. That ablation is the only thing that isolates the contribution of the language model from the RL policy, which would trade with or without it. Then layer in the realism: strictly point-in-time news the model could not have seen before the market did; a transaction-cost model calibrated to the turnover the agent actually generates; a walk-forward design across at least one regime change, because a news-reactive strategy that only saw a calm bull market has proven nothing. Report a deflated performance figure that accounts for the many configurations tried. None of this is exotic. It is the standard a desk applies to any new signal. It is exactly what separates a template from a strategy.

How I would use it

As a starting harness rather than a finished strategy. The value is the scaffolding: a clean way to plug an LLM news signal into a risk-aware RL policy, with the wiring already done. Take it, replace the convenient pieces with honest ones, point-in-time news, a realistic cost model, your own universe, a walk-forward test through a regime change, and see whether the LLM signal adds anything once the optimism is stripped out. If it does, you have a genuinely new source of signal worth pursuing. If it does not, you have learned that cheaply, on someone else’s code rather than a quarter of your own team’s work. Either way, the contribution is the template. The discipline you bring to testing it is what decides whether the idea is real. The code gives you the harness for free. The judgment is the part you still have to supply, exactly as it always was.

FinRL-DeepSeek is the LLM-news-into-RL-policy hybrid, built, risk-aware, and open. Treat it as a harness to test on point-in-time data with real costs rather than a strategy to deploy. The open code is what makes that test cheap.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →