// Insights
Long reads
The pillar pieces of this archive: field guides and deep evaluations, each built to stand on its own as the one piece to read on its subject. Ten to fifteen minutes apiece, several original figures each, evidence attached throughout.
Building the agentic enterprise: a field guide
Everyone is shipping agents and most will stall after the demo. The architecture that holds up in production, and the pre- and post-launch discipline that decides which agents survive, with the evidence attached.
DeepSeek-V4: a million tokens of context, on weights you can own
Two MIT-licensed MoE models, V4-Pro at 1.6T parameters and V4-Flash at 284B, ship with 1M-token default context and a production sparse-attention design. For document-heavy quant work that cannot leave the building, the cost calculus just moved again.
Agentic reasoning, unified: a map for deciding where agents belong
A 29-author survey organizes agentic reasoning into three layers, foundational, self-evolving, and collective, and splits inference-time orchestration from post-training optimization. The taxonomy doubles as a decision tool for where agentic loops help a research desk and where they multiply p-hacking.
Time-series foundation models in finance: what transfers and what does not
The first comprehensive test of TimesFM and Chronos on 18 million daily returns answers the question every quant has been asking: zero-shot transfer fails outright, finance-native pretraining recovers most of the gap, and a tuned gradient-boosted tree still wins on fit.
GDPval: measuring models against working professionals
OpenAI's benchmark grades frontier models against real deliverables from professionals averaging 14 years of experience. The best model wins or ties 47.6% of blind comparisons. What that number means, and how to build your own version, matter more than the headline.
Kronos: a foundation model for candlesticks, and the scrutiny it invites
Kronos applies the language-model recipe to market data: tokenize 12 billion candlesticks, train a decoder to predict the next one, read off forecasts. The zero-shot numbers are large. The quant's job is to ask the questions a benchmark cannot answer, about leakage, regime, and whether forecast skill survives the cost of trading on it.
AlphaEvolve: automated discovery, and why the evaluator is the whole game
AlphaEvolve pairs Gemini with an automated evaluator in an evolutionary loop and finds things people missed, including a 4x4 matrix-multiplication algorithm better than any since 1969. For a quant the template is automated strategy discovery, and the lesson is severe: the loop optimizes your evaluator with superhuman efficiency, leaks included.
Does RL really incentivize reasoning? A caution for the backtest
A sober study finds RL makes reasoning models better at the first try without expanding what they can ultimately solve. The quant analogy is exact: do not mistake variance reduction for alpha, in a model or in a trading agent.
Granular metric extraction from filings: traceability and verification beyond summarization
Clients want an agent that reads the 10-K and returns the number. Extraction, not summarization, is the hard part, and benchmarks say models fail it more than half the time. The build guide for doing it with a source on every figure and a verification gate.
The transformer enters the SDF: complexity wins asset pricing
Kelly and coauthors implant a transformer in the stochastic discount factor and report an out-of-sample Sharpe of 4.57 against 1.77 for the best classical factor model, on sixty years of US stocks. The companion theory says why: in pricing, more factors keep winning.
DeepSeek-R1: frontier reasoning goes open
R1 matches OpenAI's o1 on hard math and code, ships openly, and distills into small models you can host. Why the distillation result, not the benchmark parity, is what changes build-vs-buy for a quant desk.
RAG for financial documents: a field guide
Grounding an LLM in your own filings is hard because retrieval, not the model, is the bottleneck. The proven moves that fix it, each with the evidence attached, and the discipline that makes the result safe to use.
Model Context Protocol: the integration layer finally gets a standard
Anthropic's MCP is an open protocol that lets any model reach any data source or tool through one interface. Why a standard, modeled on LSP, is what a quant platform's integration layer has been missing.
OpenAI Swarm: a teaching toy with a lesson worth stealing
Swarm is an experimental, MIT-licensed framework built on two primitives, agents and handoffs. It is not for production. The handoff pattern, though, is the right mental model for a research-agent stack.
Llama 3.1 405B: a frontier model you can run behind your own firewall
Meta's 405B is the first openly available model that matches the closed frontier on knowledge, math, and code. Why that changes the build-vs-buy math for a quant desk that cannot send data to an API.
Kolmogorov-Arnold Networks for time series: a volatility model a risk committee can read
On real implied-volatility data, T-KAN matches an LSTM with about sixty times fewer parameters and stays interpretable. The result, the architecture, and where the story gets oversold.
// Stay close to the work
Building AI that ships?
If you’re past the demo and into production, I’d love to compare notes.