Skip to content
Tim Frenzel

// Insight

DeepSeek-V4: a million tokens of context, on weights you can own

10 min read
DeepSeekopen-weightslong-contextsparse-attention

DeepSeek shipped V4 on April 24. The number to anchor on is not the 1.6 trillion parameters. It is the context window: one million tokens, default, on MIT-licensed open weights, with a sparse-attention design that makes the length economically survivable. The release lands two models: V4-Pro at 1.6T total parameters with 49B active per token, and V4-Flash at 284B total with 13B active, both trained on the same 32-trillion-token corpus, both inheriting the same architectural line. For the document-heavy end of quantitative work, the filings, the transcripts, the credit agreements that cannot leave the building, this is the release the whole open-weights year was building toward.

The lineage matters for trusting the claims, which is why the foundational reference for this piece is the DeepSeek-V3 technical report, the December 2024 paper that established the recipe V4 scales: a sparse Mixture-of-Experts transformer with Multi-head Latent Attention, aggressive load balancing, plus a cost discipline that made 671B parameters trainable on a constrained budget. V4 is that recipe two generations on, with the experimental attention research this archive covered in early 2025 now graduated into the production line.

The architecture, and where it came from

The headline innovation is DeepSeek Sparse Attention paired with token-wise compression, the mechanism that makes the million-token default honest. Dense attention’s quadratic cost makes true long context an economic fiction: windows get advertised, then priced and latencied out of daily use. The sparse path compresses token representations and attends selectively, which the release describes as world-leading long context at drastically reduced compute and memory cost. Readers of this archive met the prototype fourteen months ago: Native Sparse Attention, DeepSeek’s February 2025 research paper on hardware-aligned sparsity that trains sparse rather than retrofitting it. V4 is that idea industrialized, the research-to-production arc completing in public.

How a million tokens stays affordable
Full document context: up to 1M tokensToken-wise compression: condensed representationsDeepSeek Sparse Attention: selective attention over compressed + selected tokensDecode at a fraction of dense cost
Sparsity trained in from the start, per the NSA lineage, rather than bolted on after.

The research lineage rewards a paragraph of detail, because it is the rare case of a lab publishing its architecture bets before shipping them. The February 2025 NSA paper laid out the three-part design: compress tokens hierarchically into coarse summaries, select the blocks that matter for fine-grained attention, and keep a sliding window for local context, with the whole mechanism trained natively rather than imposed on a dense-trained model after the fact. The native-training point was the paper’s hill to die on, since post-hoc sparsification keeps dense attention’s habits and loses accuracy where it prunes. Fourteen months later the production version pairs that selection machinery with token-wise compression at million-token scale. For anyone modeling lab behavior, the pattern is the tell: architecture research published openly, then industrialized two generations later, is a roadmap you can actually read, which closed labs structurally cannot offer.

What the release does not include is the same thing the line has never included: training data, data recipes, and intermediate checkpoints stay private, keeping reproducibility at the inference layer rather than the science layer. The distinction matters for governance language. These are open weights, auditable in behavior and pinnable in deployment, rather than open science in the fully reproducible sense that lets a third party rebuild the artifact. For most desk purposes the weights are what governance needs; for anyone writing ‘open source’ in a model-risk file, the precise term is worth the precision.

The two-model split is a deployment statement. Pro at 1.6T/49B is the frontier bid, the model DeepSeek positions as beating all current open models on math, science, and coding, with the launch benchmarks reporting 80.6% on SWE-bench Verified, the highest open-weights score and level with Gemini 3.1 Pro. Flash at 284B/13B is the work model, reported within 1.6 points of Pro on coding benchmarks while Pro keeps a roughly 11-point edge on complex terminal-agent tasks. The gap pattern is informative: distillation-era training has made mid-size models nearly free on bounded tasks, while long-horizon agentic work still pays for scale, the same capability split K2 Thinking’s card showed from the other side.

V4 by the release notes
V4-Pro1.6T total49B activeFrontier math, code, agenticV4-Flash284B total13B activeWithin 1.6pp on codingShared1M context defaultDSA + token-wise compression32T-token corpusLicenseMIT, both models, weights on day one
One recipe, two operating points; the license is the part procurement reads first.

What a million tokens changes on a desk

Context windows stopped being a bragging metric the day retrieval got good. The real question is what a million tokens buys that a tuned retrieval stack does not. The answer is the class of tasks where the cross-references are the task. A credit agreement read against its amendments, a 10-K against three years of predecessors, a fund’s offering documents against the side letters: these are workloads where chunking severs exactly the connections the analysis needs, where the orphaned-number problem is not an artifact but the enemy. At a million tokens, a full filing history sits in context simultaneously, with the model’s attention, rather than a retriever’s top-k, deciding what connects to what.

The economics decide whether that capability is real or theoretical, which is why the sparse attention is the load-bearing feature rather than the parameter count. Long-context inference on dense models prices itself out of routine use; sparse attention at 13B active parameters puts whole-document-set analysis into the daily-driver cost range. The pattern this archive has tracked since gpt-oss made single-GPU reasoning routine holds: the deployment-relevant breakthroughs are economic, with the architecture as the economics.

The open-weights desk stack, as this archive has assembled it over fourteen months, now reads as a complete platform. V3.1 put the deliberation switch in your hands. gpt-oss put auditable reasoning on one GPU. K2 Thinking added the long-horizon agent. V4 adds frontier-adjacent capability with a context window that swallows entire document sets, under the most permissive license in the lineup. There is no longer a workload class in document-heavy quantitative research that requires a closed API for capability reasons; what remains closed-API territory is closed for convenience, not necessity.

The open-weights desk stack, updated April 2026
DeepSeek-V3.1Thinking switchMITAug 2025gpt-oss-120bOne-GPU reasoningApache-2.0Aug 2025K2 Thinking200-300 tool callsModified MITNov 2025DeepSeek-V41M context, frontier-adjacentMITApr 2026
Fourteen months from first credible piece to complete platform.

The week-one pilots

Three measurements turn the launch claims into your numbers, and none takes longer than a week. First, the context-quality curve: plant known facts and reasoning dependencies at controlled depths across 100K to 1M tokens of your own filings, then score retrieval and cross-reference accuracy by depth. Every long-context model to date has a cliff somewhere; the deployment question is whether V4’s sits past the document sizes you actually use. Second, the Flash displacement test: run your current extraction and summarization workhorse against Flash on a frozen batch with the blind pairwise grading your seniors already know how to do, because a 13B-active model that ties your incumbent at lower cost is the quiet win that funds everything else. Third, the agentic gap audit: if the 11-point Pro-versus-Flash spread on terminal-agent tasks is real on your workflows, it prices exactly which workloads justify multi-node Pro serving, and if it is not, Flash inherits the whole stack. The survey’s reliability lesson applies to all three: repeat every task enough times to see the variance, since a launch benchmark is one draw and production is the distribution.

Which tier carries which workload
V4-Flash, defaultNode-scale servingWithin 1.6pp on codingDaily extraction, drafting, long-context readsV4-Pro, by exceptionMulti-node serving~11pp agentic edgeLong-horizon agent runs that prove the need
Default to Flash, promote workloads to Pro only when your own evals show the gap.

The calibrated skepticism

Launch benchmarks earn launch discounts, and three specifics keep this one grounded. The headline numbers are vendor-reported with the usual harness caveats this archive has flagged for every agentic score: a SWE-bench figure from one sandbox does not transfer to yours; the only number that prices a deployment is your own eval on your own tasks. The million-token claim needs its quality curve measured, because every long-context model to date shows retrieval-quality degradation somewhere inside its advertised window, and whether V4’s sparse attention holds attention quality at 800K tokens is precisely the kind of thing a needle-and-reasoning test on your own documents answers in a week. And hardware reality prices the tiers: 1.6T total parameters is multi-node territory regardless of sparsity, which makes Flash at 284B the model most desks will actually run, with Pro reserved for the workloads that demonstrably need the 11-point agentic edge.

The serving math behind the million tokens merits its own paragraph, because the constraint at long context is memory before it is compute. Dense attention’s real killer in production is the KV cache, which grows linearly with context and holds GPU memory hostage for the whole session; at a million tokens it dwarfs the weights for small models. Compression-based sparse attention attacks exactly that line item, shrinking what must be cached as well as what must be attended to, which is why the 13B-active Flash at long context is a plausible daily driver rather than a benchmark stunt. Desks budgeting a deployment should price the memory footprint per concurrent long-context session, since that, more than tokens per second, decides how many analysts one node serves.

The MIT license also reopens the rent-versus-own decision at the post-training layer with unusually favorable terms. Weights this permissive can be fine-tuned on your own research trajectories and kept proprietary, turning the foundational capability into owned, desk-specific behavior, the capital move the survey’s framework prices against perpetual orchestration rent. A Flash variant post-trained on a year of your filings workflows is an asset on your books in a way no API relationship ever is.

The geopolitical dimension deserves one sober paragraph rather than zero or ten. A Chinese lab shipping the most permissively licensed frontier-adjacent model continues the pattern that has defined the open-weights era, and for regulated firms the calculus is unchanged from the V3.1 discussion: weights you run on your own hardware carry no data-residency exposure, whatever their origin, while model provenance belongs in the same third-party-risk review as any vendor dependency. The MIT license simplifies the legal half of that review to near-zero. The behavioral half, what the model does on your tasks, your red-team prompts, your edge cases, is yours to test, which is true of every model on the stack figure above.

The bottom line

Eighteen years of watching infrastructure cycles says the moment to mark is when the constraint stops being capability and becomes integration. Those transitions never announce themselves with a benchmark; they show up as procurement questions getting shorter and pilot approvals getting faster, which is what the MIT license and the two-tier sizing are engineered to produce. V4 is that moment for document-heavy research: a million tokens of context at sparse-attention economics, frontier-adjacent benchmarks, MIT weights, plus a two-tier deployment that matches how desks actually budget. The numbers in this piece will age; the stack figure will not, because each layer was an economic threshold rather than a benchmark. The pilots write themselves, the long-context quality curve on your own filings first, Flash against your current workhorse second, after which the verdicts will be yours rather than the leaderboard’s. The closed frontier keeps its edge at the absolute top end. The gap now buys convenience rather than capability. For the workloads this blog has spent two years mapping, filings, transcripts, agreements, the full document memory of a research process, the default architecture is now open, owned, a million tokens wide.

DeepSeek-V4 puts a million tokens of sparse-attention context on MIT-licensed weights: the open stack is complete, the capability gap is now a convenience gap, while the document-memory workloads this archive has tracked for two years finally have their native model.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.