// Insight
DeepSeek-R1: frontier reasoning goes open
Reasoning models stopped being a closed-lab secret in January. DeepSeek-R1 matches OpenAI’s o1 on hard math and code, it is openly available, and its reasoning distills into small models you can run yourself. R1 is the moment frontier-level reasoning became something a quant desk can own rather than only rent through an API.
For two years the best reasoning lived behind an endpoint. o1 made that concrete and kept it closed. R1 breaks the pattern on every axis that matters to a practitioner. The weights are public. The method is documented in detail. And, most importantly for a desk, the reasoning can be compressed into models small enough to serve cheaply on your own hardware. That combination is what turns a benchmark result into a deployment option.
How was R1 trained?
With reinforcement learning. The surprising part is how far that alone goes. The first model, DeepSeek-R1-Zero, was trained by large-scale RL with no supervised fine-tuning at all. It was never shown human reasoning traces to imitate. It was rewarded for getting verifiable answers right. The reasoning behavior, long chains of thought, self-checking, backtracking, emerged on its own. The optimization used a method called Group Relative Policy Optimization, which estimates how good an answer is by comparing a group of sampled attempts rather than training a separate value model, a leaner setup than the usual reinforcement-learning machinery.
R1-Zero had rough edges, mixed languages and traces that were hard to read. The full DeepSeek-R1 added a cold-start stage: a small amount of high-quality supervised data to clean up the reasoning before more reinforcement learning. The result reads clearly and scores at the frontier.
What R1-Zero actually proved
The pure-RL result is the part a researcher should sit with, because it is a genuine scientific finding rather than an engineering tweak. The reasoning a model learns is the chain of thought, the step-by-step working that prompting research first surfaced in 2022. Until now, getting a model to reason well meant showing it examples of good reasoning to imitate. R1-Zero shows the behavior can be incentivized instead of demonstrated. Reward the right final answers at scale. The model discovers for itself that working step by step, checking, and backtracking get it there.
The paper documents an emergent behavior it calls an aha moment, where the model spontaneously learns to allocate more thinking time to a hard problem and to re-evaluate an approach partway through. Nobody programmed that. It fell out of optimizing for correct answers. For a quant, the interesting implication is methodological: if reasoning can be trained by reward on verifiable problems, then the door is open to training a model on your own verifiable tasks, the ones where an answer can be checked, rather than waiting for a vendor to ship the capability.
There is a caveat to the pure-RL story that keeps it honest. R1-Zero learned to reason, and what it produced was often unreadable, mixing languages and skipping the legible steps a human wants to see. Pure reward optimization finds whatever works rather than whatever is interpretable, which is why the production R1 needed the cold-start stage to make the reasoning presentable. For a quant, that is a familiar lesson: an objective optimized literally gives you exactly what you asked for, which is not always what you meant. Train on verifiable rewards and you get verifiable answers, with no guarantee the path to them is one you can audit unless you ask for that too.
Does it actually match o1?
On the benchmarks that define the category, yes.
On AIME 2024, R1 scores 79.8% against o1’s 79.2%. On MATH-500, 97.3% against 96.4%. It trails on GPQA Diamond, 71.5% against 75.7%, and on Codeforces it posts a 2029 rating against o1’s 2061. The honest summary is parity: level with o1 on competition math, a touch behind on graduate-level science and competitive coding. For an open model arriving months after o1, that is the headline result.
Two cautions keep the parity honest. These are pass@1 scores on public competition sets, which o1 and R1 may both have seen, so absolute numbers flatter both. And competition math is the home turf of reasoning models, the cleanest possible test. Parity here is real and meaningful. It is also parity on the friendliest benchmark, which is a narrower claim than across-the-board equality.
Why distillation is the real story for a desk
The o1-matching scores get the attention. The distillation gets the deployment. The R1 team took the big model’s reasoning and distilled it into small dense models, Qwen and Llama, from 1.5 billion parameters up to 70 billion. The small models inherit a large share of the capability.
R1-Distill-Qwen-7B reaches 55.5% on AIME 2024. The 14B reaches 69.7%. The 32B reaches 72.6%, within striking distance of the full R1 and ahead of where o1-mini sat. A 32-billion-parameter model that reasons near the frontier fits on a single high-memory GPU.
The paper adds a finding that matters for how you build. Distilling the big model’s reasoning into a small one beats running reinforcement learning directly on the small model. You cannot shortcut the process by RL-ing a 7B model from scratch and expect the same result. The reasoning has to be learned at scale, then compressed. For a desk, that is a clear recipe: take the open distilled model rather than trying to reproduce the RL training, which needs resources a lab has and you do not.
Put rough economics on it. A distilled 32B runs on one high-memory GPU, which a desk can rent or own for a fixed cost and amortize across unlimited queries. The same reasoning workload on a closed o1-class API is metered per token, and reasoning tokens are not cheap, because the model generates a long hidden chain for every answer. For a research copilot fielding thousands of queries a day, or a signal agent reasoning over a coverage universe every morning, the fixed-cost open model wins on unit economics once the volume is real, before you even count the data-residency and reproducibility benefits.
What it changes for build-vs-buy
The calculus here is the one open weights at the frontier already changed for general capability, now extended to reasoning specifically. R1, and especially its distilled versions, lets you host a reasoning model on sensitive data, pin the version for a reproducible backtest, and price the workload by your own hardware rather than a per-token tariff. For an in-house research copilot, or a signal-generation agent that has to reason over filings and news, that moves frontier reasoning from a capability you can only call to one you can actually deploy.
The strategic point is the speed of the shift. o1 shipped in December as a closed frontier. By early February an open model matched it and distilled it onto hardware a desk already owns. Whatever moat reasoning capability looked like it might have, the distance between rented and owned has collapsed faster than almost anyone expected.
Picture the concrete case. A desk wants a research copilot that reasons over its internal notes, recent filings, and position data, none of which can be sent to a vendor. Six months ago that meant choosing between a capable closed model it was not allowed to use and a weak open model that could not really reason. R1’s distilled versions remove the dilemma. The copilot runs on the 32B model inside the firewall, reasons at a level that was frontier-only in December, and leaves no data outside the building. For a compliance-bound desk that is not a marginal convenience. It is the difference between having a reasoning copilot and not having one, on exactly the data where the capability is most valuable.
How should a desk adopt this?
Concretely, the path is short. Start with one of the distilled models, the 14B or 32B depending on your hardware, rather than the full 671B R1, which is a serving project in itself. Point it at the reasoning-heavy workloads first, where its strength shows and a general model struggles. Pin the exact weights and record them alongside any result. A backtest or a memo then stays reproducible months later. Keep a frontier API in reserve for the rare problems where the open model visibly struggles, and measure how often that actually happens rather than assuming it is often.
The mistake to avoid is treating R1 as a drop-in replacement for the whole stack on day one. It is a reasoning specialist. The right deployment routes the hard, multi-step reasoning to it and keeps a fast, cheap model on the high-volume single-step work, the same triage that governs any expensive capability. What R1 changes is that the reasoning specialist can now live in-house, on your data, rather than only behind a vendor’s endpoint.
Where to keep your skepticism
Three cautions before you build on it. First, benchmark parity is not task parity. R1 matching o1 on competition math says little about how it handles your messy, domain-specific research questions, which is exactly why you measure on your own tasks before trusting either. Second, the distilled models are narrower than the full R1. They inherit the reasoning on the kinds of problems they were distilled on. They can be brittle off that distribution. A desk’s real questions are often off-distribution. Third, a reasoning model hallucinates like any other. A long, confident chain of thought can walk straight to a wrong answer with more conviction than a short one. The verification gate belongs here as much as anywhere, arguably more, because the fluency of the reasoning makes the error harder to spot.
There is a fourth point easy to miss in the excitement. R1’s training leaned on problems with verifiable answers, math, code, logic, because that is what reinforcement learning can reward cleanly. The reasoning it learned is sharpest exactly there. A great deal of real research work is not cleanly verifiable: judging whether a thesis is plausible, weighing conflicting evidence, deciding what is worth investigating at all. The model’s reasoning is genuinely strong on the checkable and less proven on the open-ended, which happens to be where much of an analyst’s value actually sits. Match the tool to the verifiable parts of the work and you get the most from it.
One practical wrinkle deserves a flag. A model tuned hard for reasoning can regress on the mundane things a general model does without thinking: clean instruction-following, well-formed structured output, reliable tool and function calling. A reasoning specialist that occasionally mangles the JSON your pipeline expects is a real integration headache, not a hypothetical one. The fix is the same routing logic as everywhere else. Send the reasoning to R1 and keep a dependable general model for the plumbing, rather than asking one model to be excellent at both.
The bottom line
R1 is the moment reasoning went open. It matches o1 on the benchmarks that matter, it is freely available, and its reasoning compresses into models small enough to host. For a quant desk, the distilled 32B is the artifact to notice: frontier-style reasoning on a single GPU, on your own data, reproducible and yours. The cautions are the standard ones. Measure on your tasks, mind the narrower distribution of the distilled models, and keep verifying outputs that a fluent chain of thought makes seductive. The shift underneath them is real. It happened in weeks rather than years. That pace, more than any single benchmark score, is the real story here.
DeepSeek-R1 matches o1 on hard reasoning and distills into small open models you can host. For a quant desk, frontier reasoning just became something you can run on your own data rather than only call through an API.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.