// Insight

TiMi: the LLM writes the bot, the bot trades alone

November 1, 20256 min read

trading-agentsLLM-strategyexecution

The recurring fantasy in LLM-trading papers is a model pondering each tick. TiMi is built on the opposite premise, the architecturally honest one: language models design, code, and tune the strategy offline, while a deterministic CPU-only bot does the actual trading with no model anywhere in the hot path. The system’s end-to-end action latency is 137 milliseconds, of which internal decision logic takes 5. The other 132 are market data retrieval and order transmission, which is what an execution profile is supposed to look like.

Four agents share the offline work. A macro-analysis agent mines market-wide patterns from technical indicators into a general strategy set. A strategy-adaptation agent customizes those strategies per trading pair, with volatility-adaptive parameters. A bot-evolution agent compiles the result into executable Python through a layered design, decision logic, reusable functions, and tunable parameters kept separate. A feedback-reflection agent closes the loop with the system’s most distinctive move: it converts observed failures into linear-programming constraints and re-solves for parameters, mathematical optimization rather than vibes-based prompt revision.

Two loops, two clock speeds

The language model never touches the hot path; 137ms end-to-end, 5ms of internal logic.

This is the division of labor this archive keeps arriving at from different directions. The agentic-enterprise principle that anything reproducible belongs in code, the formulaic-alpha factory where the LLM writes features rather than forecasts, and now the trading version: the model’s judgment is a compile-time input, while runtime behavior stays deterministic, auditable, and fast. A bot whose decision logic is frozen code can be reviewed before deployment and replayed after an incident. A model deciding live can be neither.

The layered bot design carries a governance dividend the paper undersells. Decision logic, reusable functions, and tunable parameters live in separate layers, which maps cleanly onto how a desk already governs change: a parameter retune flows through the fast lane, a new function gets review, a strategy-layer change triggers the full sign-off. When the reflection agent re-solves its linear program, it touches the parameter layer alone, leaving the reviewed logic untouched. Optimization stays inside a boundary compliance can describe in one sentence. Contrast that with re-prompting a live agent, where every adjustment is potentially a strategy change and nobody can say which.

The pipeline’s middle stage earns a note too. Strategies pass through simulation-based refinement before deployment, where the mathematical reflection loop does its work on simulated failures first. The published evidence for that loop is concrete: prototype bots returning around 2% in simulation grew past 20% cumulative after hierarchical optimization. That gap, between a strategy idea and its tuned implementation, is where systematic edges are actually won or lost, as anyone who has watched a good signal die of bad parameters will confirm.

The numbers, with the right denominator

Evaluation runs on 213 trading pairs each across US stock index futures, mainstream cryptocurrencies, and altcoin futures, with live trading from January to April 2025. TiMi posts annualized returns of 6.4%, 8.0%, and 13.7% with Sharpe ratios of 0.74, 0.79, and 0.86 respectively. The altcoin segment, where inefficiencies presumably live, is the interesting comparison set: every baseline class loses to it.

Altcoin futures, live Jan-Apr 2025: Sharpe ratio

MACD landed at -0.85 on the same segment; four months of live data, annualized.

Hold the enthusiasm to calibrated levels. Four months of live trading is a quarter of a market regime, the drawdowns are heavy relative to the returns, 20.3% to 32.8% maximum drawdown against single-digit-to-low-teens annualized gains, and annualizing a January-to-April window flatters whatever those months favored. What four months can support is a relative claim: against six baselines spanning classical, learned, and LLM-agent methods on identical pairs, the decoupled architecture won everywhere it was tested. Two distribution-level details strengthen it more than the headline returns. Cross-pair return variance runs 11.03% against 29.64% for DDPG, with under 2% of pairs hitting catastrophic tail events, which is the uniformity a multi-pair deployment actually needs. A strategy that averages well by winning enormously on a few pairs and dying on others is undeployable however good its mean.

Action efficiency is its own result class, with the cleanest numbers in the paper. The 137-millisecond end-to-end breakdown allocates 85ms to market data retrieval, 5ms to internal decision logic, and 47ms to order transmission, meaning the system spends 4% of its latency budget deciding and 96% on I/O.

Where the 137ms goes (ms)

The decision is 4% of the budget; I/O is the rest, as an execution profile should look.

A capital-utilization read backs it: a 1.53 profit-to-loss ratio per unit invested, with higher deployment rates than the learning-based baselines that hold capital idle while their networks deliberate. Speed here is not about racing anyone to the queue. It is about the bot never being the bottleneck in its own loop.

The baseline that matters most is TradingAgents at 0.57 Sharpe on altcoins, because it represents the rival philosophy: LLM agents debating in the decision loop itself. TiMi beating it while running deterministic execution is one data point for a position I have held since these systems started appearing on the desk: conversation is a research-time activity, and anything still talking at execution time is in the wrong place. The AlphaAgents committee earns its tokens during analysis, where minutes are cheap. TiMi extends the same logic one step further down the stack, all the way to the order.

What a desk should actually take

The reflection loop is the component worth stealing. Converting trading failures into LP constraints, then re-solving the parameter set subject to them, gives the system a tuning mechanism with provable properties instead of another round of prompt adjustment. It is the same instinct that made GEPA’s reflective evolution compelling at the prompt layer, applied here to numeric parameters where the math is native. The bot-evolution validation backs it concretely: prototype bots returning around 2% grew past 20% cumulative in simulation after hierarchical optimization, which is the gap between a strategy idea and a tuned implementation.

The operational hook to add on day one is version binding: hash the strategy layer, snapshot the parameter vector, and stamp both onto every fill the bot reports. Any incident then maps to an exact bot version and an exact parameter state, which turns post-mortems from archaeology into lookup. Config-as-code discipline is cheap to bolt onto a system that already separates its layers this cleanly.

The honest unknowns: capacity is unaddressed, costs inside the live results are not itemized. A strategy set mined from technical indicators inherits every regime-dependence worry that class has carried for decades. The architecture would survive all three concerns even if the returns did not, which is the right way around. Designs outlive backtests. TiMi’s lasting contribution is a working existence proof that LLM-grade strategy synthesis and production-grade execution discipline are compatible, provided the model stays on its side of the compile line.

TiMi keeps the language model at compile time and the trading at runtime, 5ms of deterministic logic against four baselines’ worth of evidence that the decoupling, more than any single strategy, is the edge.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →