// Insight

Qwen3: one open model with a dial between thinking and throughput

June 28, 20253 min read

open-weightsMoEreasoninghybrid-thinking

The benchmark table is not the news. Qwen3, Alibaba’s April 2025 open-weights family, is competitive with the strong reasoning models on coding and math, which is roughly what you would expect by now. The feature worth a desk’s attention is the packaging. Qwen3 is a single model with a dial between two regimes: deep step-by-step reasoning for hard problems, and fast cheap inference for everything else, switchable per request.

The lineup is broad. Six dense models from 0.6B to 32B, plus two Mixture-of-Experts models: the flagship Qwen3-235B-A22B, which activates 22B of 235B, and Qwen3-30B-A3B, which activates 3B of 30B. All of it is Apache-2.0, which for a quant shop is the part that turns a research curiosity into something you can actually deploy without a licensing conversation. The MoE design matters for the same practical reason: the 235B flagship runs at the inference cost of a 22B model, because only the active experts fire on each token. You get the capability of 235B at the bill of 22B in compute, though the full weights still have to be loaded into memory.

The dial is the real contribution. The same weights run in a thinking mode that produces an explicit chain of reasoning before answering, or a non-thinking mode that responds immediately. You switch with a parameter on the request, or mid-conversation with a /think or /no_think instruction. A thinking budget caps how much the model deliberates before it has to answer.

One model, two regimes, two desk jobs

The same weights cover the research regime, where you want deliberation and can afford it, and the throughput regime, where you want a fast cheap answer at scale. The thinking budget is the knob between them.

For a desk that runs both kinds of work, the dial maps directly onto the two regimes you actually live in. Research is low-volume and reasoning-heavy: you want the model to deliberate, where a handful of slow expensive calls is fine. Signal scoring is the opposite, high-volume and shallow: you are running the same prompt across a coverage universe thousands of times, and what you want is a fast cheap answer with no reasoning overhead at all. Running both regimes off one set of self-hosted weights means one deployment, one model to validate, and budgets you tune per workload rather than per model.

That is the same self-hosting argument that vLLM V1 makes on the serving side, met from the model side. Open weights make the capability free. The serving engine decides whether you can afford it. A hybrid model adds a third lever: you decide, per request, how much compute the answer is worth. The reasoning that DeepSeek-R1 delivered as a dedicated model, Qwen3 delivers as a mode you can switch off when the task does not need it.

The honest caveat is the usual one for a benchmark-topping open release. Competitive on public benchmarks is not the same as good on your problem. A thinking budget is a knob you still have to tune against your own latency and cost constraints rather than a setting that tunes itself. The reasoning mode also inherits the limits any reasoning model has, including the efficiency-not-capability ceiling that careful evaluation keeps finding. None of that detracts from the packaging. A permissively licensed model that lets you spend reasoning compute where it pays and skip it where it does not is a genuinely useful thing to be able to self-host.

Qwen3’s eight Apache-2.0 models matter less than its dial: one model that switches between budget-capped chain-of-thought and fast cheap inference. Self-host once, spend reasoning compute on research, and drop to throughput mode for high-volume signal scoring. The MoE flagship runs at a 22B inference cost for 235B of capability.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →