// Insight

gpt-oss-120b and gpt-oss-20b: OpenAI's open weights, on your hardware

August 17, 20256 min read

open-weightsMoEon-prem

OpenAI shipping open weights again is the kind of event you date things by. gpt-oss-120b and gpt-oss-20b, released this month under Apache 2.0 with a model card on arXiv, are the company’s first open-weight language models since GPT-2 in 2019. The headline for a regulated desk is simple: near-frontier reasoning now runs on hardware you own, under a license that lets you do almost anything with it. The larger model fits on a single 80GB GPU. The smaller one runs in 16GB, which is laptop and edge territory.

The strategic read matters less than the operational one. Plenty of strong open-weight models exist; DeepSeek-R1 made open reasoning weights mainstream in January, and Llama 3.1 405B put open frontier-scale weights on the table a year earlier. What gpt-oss adds is OpenAI’s post-training recipe, distilled and reinforcement-trained for reasoning and tool use, in a package sized for commodity deployment rather than a cluster.

The architecture, in numbers

Both models are mixture-of-experts transformers. The numbers are worth holding in your head because they explain the economics. The 120b has 116.8B total parameters but activates only 5.1B per token: 36 layers, each with 128 experts of which a router selects the top 4 per token. The 20b holds 20.9B parameters with 3.6B active, across 24 layers of 32 experts, again top-4. Both run a 2880-dimension residual stream, grouped-query attention with 64 query heads against 8 key-value heads, plus a context window of 131,072 tokens via YaRN position interpolation.

Conditional capacity in gpt-oss-120b: each token consults 4 of 128 specialists.

Mixture-of-experts is conditional capacity. The quant analogy is a multi-strategy platform: the firm carries many specialist books, while any single decision draws on a handful of them. You store 116.8B parameters and pay inference on 5.1B, a roughly 4% activation ratio that is precisely why this fits on one GPU. The other half of the trick is quantization: the expert weights ship in MXFP4 at 4.25 bits per parameter, bringing checkpoints to 60.8GB and 12.8GB. That is engineering in service of a deployment target, chosen so the 120b lands under the 80GB ceiling of a single H100-class card.

Two deployment tiers, one recipe

Model-card specifications; the checkpoints target 80GB and 16GB hardware ceilings.

The models also expose an adjustable reasoning effort, low to medium to high. Where the router OpenAI shipped in GPT-5 the week before makes that choice for you, here you set the effort level per call. The trade shows up in your own latency and token numbers, where you can measure it.

What the benchmarks say, and what they do not

The model card reports results at high reasoning effort. They are strong for the size class. On GPQA Diamond, the 120b scores 80.9% against 74.2% for the 20b. With tools, AIME 2025 lands at 97.9% and 98.7% respectively. Codeforces Elo comes in at 2,622 and 2,516. OpenAI’s own framing is that the 120b surpasses o3-mini and approaches o4-mini on canonical reasoning benchmarks, with HealthBench at 57.6% roughly matching o3.

High reasoning effort, per the model card (%)

On AIME with tools the 20b edges past the 120b; GPQA restores the size ordering.

Two readings of that table deserve attention. First, the 20b beating the 120b on tool-assisted AIME is a tell: once a model can call a calculator and a Python interpreter, raw scale matters less than the post-training that teaches it when to reach for the tool. That is encouraging for anyone eyeing the 16GB deployment tier. Second, the usual benchmark skepticism applies with extra force to a release this prominent. Saturated competition-math benchmarks reward exactly the training distribution these models drilled, which is why contamination-resistant tests like FrontierMath were built. Treat the scores as evidence the models are in the o3-mini-to-o4-mini band, and run your own evaluation on your own tasks before believing anything finer-grained.

The card’s emphasis on agentic work is the more durable signal. These models were post-trained with large-scale distillation and reinforcement learning toward tool use: web research, Python execution, custom function calls. That matches where open-weight deployment is actually heading on a desk, which is rarely “chat with a model” and usually “run a loop that reads, computes, and writes.” The engineering hours belong in the tool scaffolding before they belong in model selection.

The on-prem case for a desk

The reason this release matters to a quant shop is not the leaderboard. It is the set of constraints it dissolves. A fund’s filings analysis, position-aware research, and mandate-sensitive workflows mostly cannot leave the building: compliance reviews of vendor APIs are slow, data-processing agreements are restrictive, and some mandates simply forbid third-party transmission. The standing options were open weights from Chinese and European labs or smaller US releases, each carrying its own approval friction. An Apache-2.0 model from OpenAI, at this quality, on one GPU, is the easiest such approval conversation yet.

The license does real work here. Apache 2.0 permits commercial use, modification, and fine-tuning without copyleft obligations, which means you can post-train the model on your own labeled research tasks and keep the result proprietary. Pin the weights and your reproducibility problem disappears at the model layer: the same checkpoint answers the same way next quarter, with no silent router or version bump in the path. For audit and model-risk purposes, a frozen artifact you control beats a vendor endpoint on almost every dimension except raw capability.

The build-vs-buy calculation has shifted: the question is no longer whether open weights are good enough to consider, it is whether your task needs the frontier delta enough to accept an API’s constraints.

For extraction, classification, summarization at scale, and most agentic tool-use loops, an o4-mini-adjacent model you own is plainly sufficient. For frontier reasoning on genuinely hard problems, the closed tier keeps its edge. Vendor dependencies age badly often enough that the durable position is to hold both: a pinned open model for everything that must be reproducible and private, plus a metered frontier call for the residual.

The serving economics are the last piece. A 5.1B-active model at MXFP4 is cheap to run at batch. The vLLM-class serving stack that makes self-hosting practical has matured in parallel. The marginal cost of an extra million tokens on your own GPU is electricity. That changes which research ideas are worth trying: sweeping a prompt variant across ten thousand filings stops being a budget request and becomes a Tuesday afternoon. The quiet, compounding benefit of owning the stack is that experimentation gets repriced along with inference.

gpt-oss puts OpenAI-grade reasoning weights on a single GPU under Apache 2.0: for every workflow that cannot leave the building, the default just flipped from API to artifact.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →