// Insights

Long reads

The pillar pieces of this archive: field guides and deep evaluations, each built to stand on its own as the one piece to read on its subject. Ten to fifteen minutes apiece, several original figures each, evidence attached throughout.

← All insights

Jun 23, 202611 min read

Can a sparse autoencoder audit your finance LLM?

Sparse autoencoders decompose an LLM into millions of readable features. A 2025 result steers a finance model's risk appetite through them. The same year, the labs that built them found linear probes beat them and random networks pass the same tests. For a model-risk desk, a discovery tool that is not yet audit evidence.

Jun 17, 202611 min read

What agent memory actually costs to run

The first systems characterization of agent memory prices ten systems on one H100. Energy per correct answer spans 47x, driven by memory construction rather than query-time serving, a cost every accuracy benchmark hides.

Jun 7, 202620 min read

Building the agentic enterprise: a field guide

Everyone is shipping agents and most will stall after the demo. The architecture that holds up in production, and the pre- and post-launch discipline that decides which agents survive, with the evidence attached.

Apr 26, 202610 min read

DeepSeek-V4: a million tokens of context, on weights you can own

Two MIT-licensed MoE models, V4-Pro at 1.6T parameters and V4-Flash at 284B, ship with 1M-token default context and a production sparse-attention design. For document-heavy quant work that cannot leave the building, the cost calculus just moved again.

Feb 10, 202610 min read

Agentic reasoning, unified: a map for deciding where agents belong

A 29-author survey organizes agentic reasoning into three layers, foundational, self-evolving, and collective, and splits inference-time orchestration from post-training optimization. The taxonomy doubles as a decision tool for where agentic loops help a research desk and where they multiply p-hacking.

Dec 15, 202510 min read

Time-series foundation models in finance: what transfers and what does not

The first comprehensive test of TimesFM and Chronos on 18 million daily returns answers the question every quant has been asking: zero-shot transfer fails outright, finance-native pretraining recovers most of the gap, and a tuned gradient-boosted tree still wins on fit.

Oct 5, 202510 min read

GDPval: measuring models against working professionals

OpenAI's benchmark grades frontier models against real deliverables from professionals averaging 14 years of experience. The best model wins or ties 47.6% of blind comparisons. What that number means, and how to build your own version, matter more than the headline.

Aug 5, 202510 min read

Kronos: a foundation model for candlesticks, and the scrutiny it invites

Kronos applies the language-model recipe to market data: tokenize 12 billion candlesticks, train a decoder to predict the next one, read off forecasts. The zero-shot numbers are large. The quant's job is to ask the questions a benchmark cannot answer, about leakage, regime, and whether forecast skill survives the cost of trading on it.

Jun 7, 202511 min read

AlphaEvolve: automated discovery, and why the evaluator is the whole game

AlphaEvolve pairs Gemini with an automated evaluator in an evolutionary loop and finds things people missed, including a 4x4 matrix-multiplication algorithm better than any since 1969. For a quant the template is automated strategy discovery, and the lesson is severe: the loop optimizes your evaluator with superhuman efficiency, leaks included.

Apr 20, 202510 min read

Does RL really incentivize reasoning? A caution for the backtest

A sober study finds RL makes reasoning models better at the first try without expanding what they can ultimately solve. The quant analogy is exact: do not mistake variance reduction for alpha, in a model or in a trading agent.

Mar 9, 202510 min read

Granular metric extraction from filings: traceability and verification beyond summarization

Clients want an agent that reads the 10-K and returns the number. Extraction, not summarization, is the hard part, and benchmarks say models fail it more than half the time. The build guide for doing it with a source on every figure and a verification gate.

Feb 8, 202510 min read

The transformer enters the SDF: complexity wins asset pricing

Kelly and coauthors implant a transformer in the stochastic discount factor and report an out-of-sample Sharpe of 4.57 against 1.77 for the best classical factor model, on sixty years of US stocks. The companion theory says why: in pricing, more factors keep winning.

Feb 3, 202510 min read

Building AI that ships?

If you’re past the demo and into production, I’d love to compare notes.

Get in touch Read the book