Skip to content
Tim Frenzel

// Insight

s1: buying reasoning with a budget you control

7 min read
test-time-computereasoningopen-models

s1 makes a quietly radical claim: you can buy reasoning accuracy with inference compute, cheaply, and control the spend with a dial you hold yourself. It fine-tunes an open 32B model on a thousand curated examples, then adds a trick called budget forcing that makes the model think longer on demand. It is the test-time-compute idea from o1, but open, small, and with the dial in your hands rather than the vendor’s.

The recipe is striking for how little it needs. The base is Qwen2.5-32B-Instruct, an ordinary open model. The training set, called s1K, is 1,000 questions paired with worked reasoning traces, selected for difficulty, diversity, and quality. A thousand examples, not a million, was enough to teach the model to reason at a level that competes with far larger systems. The traces were distilled from a stronger reasoning model, so what s1 really shows is that a small, careful sample of good reasoning can transfer most of the skill. Curation substitutes for scale, which is a lesson that keeps recurring and keeps being underused.

The thousand-example result deserves a second look, because it cuts against the reflex that more data is always better. The s1K set was small but deliberately curated, chosen for difficulty, diversity, and quality, with the reasoning traces distilled from a stronger model. What transferred was a behavior rather than facts, the habit of working a problem step by step. That is cheaper to teach than it sounds. It is a reminder that for many capabilities the binding constraint is the quality of a small sample rather than the size of a large one.

Budget forcing: a dial on how long the model thinks
Model reaches the end of its reasoningWant more? append "Wait" so it keeps going and reconsidersWant less? insert the end-of-thinking tokenFinal answer
The same trick, in both directions, sets a compute budget at inference. Appending "Wait" makes the model second-guess and correct itself, with no retraining involved.

The dial: budget forcing

Budget forcing is the mechanism. It is almost comically simple. When the model tries to stop thinking, you append the word “Wait” to its output, which prompts it to keep going, reconsider its approach, and often catch its own error. When you want it to stop, you insert the end-of-thinking token and force the final answer. The same lever, pushed in either direction, sets a compute budget at inference time without touching the weights.

AIME 2024: same model, more thinking (%)
s1-32B, no intervention50s1-32B + budget forcing57

The payoff is measurable. Appending “Wait” lifted s1-32B from 50% to 57% on the 2024 AIME competition-math set, with no retraining, purely from more deliberation. Across MATH and AIME24 the model exceeds o1-preview by up to 27%. A 32B open model, tuned on a thousand examples, reaching that bar is a result worth sitting with, because it says the capability behind o1 is not a deep moat. The reasoning behavior can be induced cheaply. The compute that drives it can be metered by anyone.

Why should a quant care?

Because it turns thinking time into a parameter you can set, measure, and defend. o1 proved that more inference compute buys more accuracy. But it kept that dial behind a vendor API, with the reasoning hidden and the budget opaque. s1 puts the same dial in your own infrastructure. That changes what you can do with it.

You can hold the compute budget fixed across a backtest, which makes a result reproducible in a way a vendor endpoint that updates underneath you never can. You can raise the budget for a hard research question, factor hypothesis generation, a thorny derivation, and lower it for routine screening, spending deliberation only where it pays. You can inspect the reasoning trace, because the model runs on your hardware. For a model-risk function, a controllable, auditable compute knob is worth more than a few points of raw capability locked inside an endpoint you cannot see into. The dial is the deliverable, more than the benchmark score.

Picture it in a research workflow. You are screening a universe of hypotheses, hundreds of them, most of which will not survive. You set the thinking budget low, because a cheap first pass is enough to discard the obvious failures. The handful that survive go to a second pass with the budget turned up, where the model reasons hard about confounds, data availability, and how each idea might be spurious. The compute follows the value, because you control the dial. A vendor endpoint with a fixed reasoning mode cannot do that. A model whose budget you set can.

This is the same make-versus-buy logic that runs through any decision about critical infrastructure. When reproducibility, auditability, or cost control matters, you bring the capability in-house. s1 is evidence that frontier-style reasoning has crossed that line for any team willing to host a 32B model.

The honest limits

Three caveats keep this from being magic. The first is scope. The results are on competition math, where problems have clean, checkable answers. That is the friendliest possible terrain for longer reasoning, and whether budget forcing helps as much on a messy, open-ended research question is unproven. The second is the ceiling. Budget forcing extrapolates only so far. The paper itself shows the gains from appending “Wait” flatten out, which means the dial has a top end rather than scaling without limit. More thinking helps until it does not.

The third is what the headline is really measuring. Beating o1-preview on a few math sets is a real result. It is also a narrow one. A quant should read s1 as a strong proof of concept that a controllable reasoning dial is buildable and cheap, rather than as a finished tool ready to drop into a research pipeline. The contribution is the method. The method is what travels.

There is a fourth caveat worth stating plainly, because it tempers the headline. The thousand reasoning traces were distilled from a stronger model. s1 is, in part, compressing an existing reasoning capability into a smaller open one rather than conjuring it from nothing. That is a genuine and useful result. It is a different claim from independent reasoning ability. Read s1 as evidence that strong reasoning can be transferred cheaply onto open weights, rather than as proof that a 32B model invents that reasoning on its own. The distinction matters when you decide how far to trust it on a problem the teacher never saw.

The bottom line

s1 is the more useful half of the test-time-compute story for a practitioner. o1 showed the effect is real and powerful. s1 shows it is cheap, open, and controllable, which are the properties that decide whether a desk can actually use it. The recipe, a thousand curated examples and a one-word trick, is almost too simple to believe. The result holds on the benchmarks where it has been tested. Treat the thinking budget as exactly what it is: a tunable cost-accuracy dial, to be set deliberately per task, measured on your own work, and held fixed when a result has to be reproduced.

The broader signal should interest a desk as much as the method. If a thousand examples and a one-word trick can put competition-grade reasoning on open weights you host, the distance between rented frontier reasoning and owned reasoning is shorter than the price list suggests. That is a strategic fact as much as a technical one. A desk that already hosts its own models is positioned to act on it sooner than most.

s1 turns reasoning into a dial you hold: a small open model, a thousand examples, a one-word trick that trades inference compute for accuracy. The value to a quant is a thinking budget you can fix, measure, and audit yourself.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.