// Insight

Kimi K2 Thinking: 300 tool calls on weights you can own

November 9, 20256 min read

open-weightsagentictool-use

The number that matters in Moonshot’s Kimi K2 Thinking release is not the trillion parameters. It is 200 to 300: the count of sequential tool calls the model card claims it sustains while holding coherent, goal-directed behavior, where prior models degraded after 30 to 50 steps. Long-horizon tool use is the capability that separates an agent that researches from an agent that fetches; it just arrived in open weights.

The architecture follows the now-familiar sparse recipe at larger scale. One trillion total parameters with 32B active per token, across 61 layers whose MoE blocks route each token to 8 of 384 experts plus one shared expert. Context runs to 256K. The deployment-relevant choice is native INT4 through quantization-aware training, roughly doubling inference speed at what the card calls lossless quality, the same weights-fit-the-hardware engineering that made gpt-oss’s MXFP4 the quiet headline of that release. The license is modified MIT, commercially usable.

K2 Thinking, by the card

Model-card specifications; the sparse recipe at trillion scale, packaged for self-hosting.

Where it wins, where it trails

The benchmark profile is unusually legible about what this model is for. On Humanity’s Last Exam with tools, K2 Thinking posts 44.9 against GPT-5’s 41.7 and Claude Sonnet 4.5’s 32.0. On BrowseComp, the agentic-search benchmark, it scores 60.2 against 54.9 and 24.1. Flip to SWE-bench Verified and the order reverses: 71.3 for K2 against GPT-5’s 74.9 and Sonnet 4.5’s 77.2. Competition math with a Python interpreter is saturated for everyone, 99.1 versus 99.6 and 100.0.

Agentic benchmarks, per the model card (%)

The same card shows SWE-bench Verified reversed: 71.3 vs 74.9 and 77.2.

Vendor-reported numbers earn the usual discount. The shape is still informative. The model leads precisely on benchmarks that reward sustained tool orchestration and trails on single-domain coding craft, which is a coherent specialization rather than a leaderboard accident. Moonshot tuned for the agent loop: search, browse, synthesize, repeat for hundreds of steps. That profile maps onto research workflows better than onto software engineering, which happens to be the right trade for a desk that wants an analyst rather than a programmer.

Why long horizons are the hard capability

The 30-to-50-step ceiling the card references is worth understanding, because it is a context problem before it is an intelligence problem. Every tool call appends its results to the conversation: search output, file contents, error messages, all of it accumulating while the original goal recedes hundreds of thousands of tokens into the past. Attention degrades as context grows, the model starts optimizing for the most recent page of output rather than the mission, and step 60 quietly forgets what step 3 was for. K2’s design answer is interleaved thinking, reasoning tokens woven between tool invocations, letting the model re-derive where it stands before acting again rather than pattern-matching on the latest tool dump. The 256K window buys room; the interleaving spends it on staying oriented.

The INT4 decision deserves more credit than quantization choices usually get. Most quantized checkpoints are post-training conversions, accuracy negotiated after the fact. Quantization-aware training bakes the precision constraint into the optimization itself, which is how the card can claim a roughly 2x inference speedup at lossless quality. For agentic workloads the speedup compounds: a 300-step trajectory is 300 sequential inference passes, so per-step latency multiplies into minutes of wall-clock difference per task. The cost arithmetic compounds the same way. An agent that burns a few hundred thousand tokens per research task prices very differently on owned hardware than through a metered API, the experimentation-repricing argument from the serving side, now applied to whole agent trajectories.

One benchmark-hygiene note applies with extra force to agentic scores. A “with tools” benchmark measures the harness as much as the model: the sandbox, the tool implementations, the retry policy, all of it varies across vendors, and none of it transfers to your environment. An HLE-with-tools score from Moonshot’s sandbox and one from OpenAI’s are two different experiments sharing a question set. The cross-model ordering is suggestive; the absolute levels are unportable. Your own harness, on your own tasks, is the only number that prices a deployment.

The desk case, updated

The open-weights deliberation thread this archive has tracked all year just gained its agentic chapter. DeepSeek-V3.1 put the thinking switch in your hands; gpt-oss put o4-mini-adjacent reasoning on one GPU; K2 Thinking adds the long-horizon agent. A research workflow that reads filings, queries databases, runs calculations, and drafts a memo is a few hundred tool calls end to end. Until now the models that could sustain that arc lived behind APIs, with the data-residency and reproducibility problems that entails for a regulated desk. A modified-MIT checkpoint that holds 300 steps changes the default architecture for in-house research agents.

The open-weights desk stack, as of November 2025

Three releases, three capabilities: metered deliberation, single-GPU reasoning, long-horizon agency.

Two operational realities temper the enthusiasm. First, a trillion-parameter MoE at INT4 still wants serious multi-GPU hardware, well beyond gpt-oss’s single-card territory; self-hosting is feasible for a fund, not for a laptop. Second, an agent that runs 300 autonomous steps is 300 opportunities to drift, which makes kernel-level observability and budget caps prerequisites rather than nice-to-haves. The longer the leash, the better the telemetry has to be. Long-horizon autonomy and independent supervision are complements; deploying the first without the second is how an impressive demo becomes an incident report.

The pilot that makes sense costs little. Stand the model up on a research node, replay a week of real analyst tasks in shadow mode, and grade the trajectories against your frontier-API baseline with the same blind pairwise protocol GDPval demonstrated at scale: same tasks, unlabeled outputs, your own seniors ranking them. Cap the tool budget per task, whitelist the tools, and log every call from outside the process. Hardware sets the entry fee, since a trillion parameters at INT4 still means a multi-GPU node rather than a workstation. That is a real line item and a fraction of what the same workload costs through a metered API at three hundred calls per task, once the agent runs daily.

The composite verdict: the agentic frontier is no longer exclusively closed. For tool-heavy research loops where data cannot leave the building, K2 Thinking is the first open checkpoint that plausibly carries the whole workflow, benchmarked credibly where it claims to be strong, honest about where it is not, and licensed for the deployment that matters.

K2 Thinking holds 200-300 tool calls on open weights and beats GPT-5 where agents live, on search and tool orchestration: the long-horizon research agent just became something a desk can own.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →