Skip to content
Tim Frenzel

// Insight

DeepSeek-V3.1: one model, a thinking switch

4 min read
DeepSeekopen-weightshybrid-reasoning

DeepSeek released V3.1 this week. The design decision worth studying is consolidation. Where the V3 chat line and the R1 reasoning line used to be separate models, V3.1 is a single set of weights, 671B parameters with 37B active, that switches between thinking and non-thinking modes through its chat template. Set the thinking flag and the template opens a think token; leave it off and the same weights answer directly. The context window is 128K. The license is MIT.

The model card’s own benchmark table shows what the switch buys. The deltas are large enough to be an allocation decision rather than a curiosity. AIME 2024 pass@1 goes from 66.3 in non-thinking mode to 93.1 with thinking on. LiveCodeBench moves from 56.4 to 74.8. GPQA Diamond goes from 74.9 to 80.1. Same weights, one template token, twenty-seven points of competition math.

V3.1 pass@1, non-thinking vs thinking (model card)
AIME 2024, non-think66.3AIME 2024, think93.1LiveCodeBench, non-think56.4LiveCodeBench, think74.8GPQA Diamond, non-think74.9GPQA Diamond, think80.1
The deltas from flipping one template token, per DeepSeek's published table.

The reading that matters is the shape of those gaps. Deliberation pays most where problems are deep and verifiable, competition math and code, and least where the question is mostly knowledge retrieval, like GPQA’s five-point gain. That shape is your routing policy in miniature: send the hard, checkable work through the thinking path and let everything else go direct at a fraction of the cost and latency.

A dial on your side of the API

This is the third deliberation dial to ship in a month. The three designs make a tidy spectrum of control. GPT-5’s router decides for you, invisibly. Qwen3’s hybrid modes put a soft switch in the prompt. V3.1 hard-wires the switch into the template of an MIT-licensed model you can host yourself, which means the dial is not just visible but yours to automate. A batch pipeline over a filing corpus can run non-thinking extraction across everything, score its own confidence, and re-run the difficult residue with thinking on. Two passes, one model, one deployment. The switch itself is an argument in the tokenizer call:

msgs = [{"role": "user", "content": doc_prompt}]
fast = tok.apply_chat_template(msgs, thinking=False)  # extraction pass
deep = tok.apply_chat_template(msgs, thinking=True)   # hard cases only

Racing has had this argument settled for years. A superbike carries an engine-map switch on the left bar: full power for a fresh tyre, a softer map when the rear goes off. The rider flips it mid-lap because the rider owns the consequences of the choice. Nobody on a pit wall would accept an ECU that silently picked the map for them and changed its logic between sessions. V3.1 puts the map switch on your handlebar; the closed routers keep it on someone else’s pit wall.

Two details in the card hint at where DeepSeek is pointed. Non-thinking mode holds 66.0 on SWE-bench Verified in agent mode, a sign the fused model is being tuned for tool-use loops where a separate reasoning model would be too slow. And the weights ship in a UE8M0 FP8 scale format on both weights and activations, a quantization choice aimed at compatibility with upcoming accelerator hardware. Neither detail changes the desk case today. Both say the fused-mode design, which R1’s separate-model lineage treated as two products, is now DeepSeek’s main line.

The practitioner verdict is straightforward. For cost-sensitive batch analysis of large corpora, a self-hosted model with an explicit, scriptable deliberation switch is the right shape: you meter your own spend, you log which mode produced which answer, and your reproducibility story holds up in front of a validator. The capability ceiling sits below the closed frontier, and for the workloads where the switch matters most, extraction at scale with escalation on the stubborn tail, that ceiling is rarely the binding constraint. Log the mode alongside every output and the escalation rate becomes a monitoring signal in its own right: a corpus whose share of hard cases suddenly doubles is telling you something changed in the data before any accuracy metric does.

DeepSeek-V3.1 makes deliberation a template token you control: 66.3 to 93.1 on AIME from one flag, in an MIT model you can host, meter, and audit yourself.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.