Skip to content
Tim Frenzel

// Insight

GDPval: measuring models against working professionals

10 min read
evaluationeconomic-tasksbenchmark

For two years the standard answer to “can a model do my analysts’ work” has been a shrug at a leaderboard that measures nothing an analyst does. GDPval, OpenAI’s new benchmark, finally attacks the question directly: real work products, graded blind by the professionals who do that work for a living. The headline result is that the best frontier model wins or ties against human-expert deliverables in 47.6% of blind comparisons; the model is Claude Opus 4.1, not one of OpenAI’s own. A lab publishing a rival’s win is a small fact that buys the methodology a lot of credibility.

The number is striking. The construction is the part a desk should study, because it is the template for the evaluation your firm keeps not building.

What did they actually build?

GDPval is 1,320 tasks spanning 44 occupations across the nine sectors that contribute most to US GDP. The tasks are not synthetic puzzles. Each was drafted from real work by professionals averaging 14 years of experience, arrives with the reference files a practitioner would actually have, and demands a deliverable: a memo, a model, a brief, a plan, a design. A 220-task gold subset is open-sourced alongside a public automated grading service. The grading protocol is the expensive, load-bearing choice: blind pairwise comparison, in which an expert from the relevant occupation receives the request plus reference files and ranks unlabeled deliverables without knowing which came from a machine. Each comparison costs more than an hour of expert time.

How GDPval is constructed and graded
9 top-GDP sectors44 occupationsProfessionals, avg 14 years: real tasks + reference files1,320 deliverable tasks; 220-task open gold subsetBlind pairwise expert grading, an hour-plus per comparison
Real work products, graded blind by people who do the occupation for a living.

The lineage here matters. The foundational predecessor is GPTs are GPTs, the 2023 Eloundou, Manning, Mishkin, and Rock study that mapped LLM capability descriptions onto occupational task lists to project labor-market exposure. That paper could only ask which tasks looked exposed on paper. GDPval closes the loop the 2023 work left open: instead of projecting exposure from task descriptions, it has models actually produce the deliverables and lets practitioners judge the output. Projection has become measurement. That step, from “this occupation looks automatable” to “here is the win rate against its professionals,” is the difference between a macro talking point and an operational input.

What do the numbers say?

The model spread is wide and recent. Claude Opus 4.1 leads at 47.6% wins-plus-ties on the gold subset. GPT-5 lands at 39.0%, o3 at 35.2%, o4-mini at 29.1%. GPT-4o, the frontier of eighteen months ago, manages 12.5%.

Win-plus-tie rate vs expert deliverables, gold subset (%)
Claude Opus 4.147.6GPT-539o335.2o4-mini29.1GPT-4o12.5
The frontier of eighteen months ago wins one comparison in eight; today's wins almost half.

Two readings of that chart deserve equal weight. The first is the level: the best model is approaching parity on deliverables that take professionals hours or days, which is not a result any 2023-era benchmark would have predicted comfortably. The second is the slope. The paper reports performance improving roughly linearly over time. The 12.5-to-47.6 climb happened in roughly a year and a half of model generations. Anyone using today’s win rate as a planning constant is pricing an asset off a stale quote; the slope is the decision-relevant number.

Performance is not uniform across the economy. Sector-level results vary substantially, with models approaching expert parity in Government, Retail Trade, and Wholesale Trade while other sectors lag well behind. The benchmark’s occupational grain is exactly what makes it useful: a single blended score across 44 occupations would hide the fact that some workflows are already at parity while others are nowhere close. The blended number is a headline. The per-occupation table is a plan.

How much should you trust the grading?

This is the section that separates GDPval from leaderboard culture, because the paper measures its own measurement. Expert graders, ranking deliverables blind, agree with each other 71% of the time. The automated grader OpenAI ships agrees with human experts 66% of the time, five points shy of the human-human ceiling.

Grader agreement on the same comparisons (%)
Human expert vs human expert71Automated grader vs human66
Experts disagree on 29% of comparisons; that noise floor bounds every score above.

Sit with the 71% before quoting the 47.6%. When qualified professionals disagree on nearly a third of blind comparisons, “expert quality” is a distribution rather than a line, and any model win rate carries that noise inside it. This is not a flaw in GDPval; it is the honest texture of professional work, where two seasoned practitioners legitimately rank the same two memos differently. The implication for interpretation is direct: the gap between Opus at 47.6 and a hypothetical 50% coin-flip parity is smaller than the disagreement band of the judges themselves. The right conclusion is “approaching parity, measured against a noisy human standard,” nothing sharper. Anyone who has run a model-validation function will recognize the pattern: the benchmark’s precision is capped by inter-rater reliability, so improving the judges is as valuable as improving the model.

The automated grader earns a separate caveat. At 66% agreement it is a competent screen and a cheap way to iterate, while remaining five points worse than the noisy human ceiling it approximates. Using it to rank adjacent models is asking a blunt instrument for a fine cut. Using it to track one model’s trajectory over months, or to triage which occupations deserve expensive human grading, is the calibrated use.

Releasing the 220-task gold subset does one more quiet job: it makes vendor claims checkable. Any lab can now run its model against the same deliverables and the same grading service, while any buyer can replicate a quoted win rate before believing it. Public benchmarks rot through contamination and teaching-to-the-test, which limits the half-life. While it lasts, an auditable common yardstick for professional work is something the field has not had before.

Where does the economics land?

The paper’s most practically interesting experiment is the oversight workflow. Rather than asking whether the model replaces the expert, it prices the loop a real deployment would run: the model drafts, the expert reviews, then either accepts or fixes and resubmits. Under that “try once, then fix if needed” regime, GPT-5 comes out roughly 1.12x faster and 1.18x cheaper than an unaided expert producing the same deliverable, with review and correction time charged against the model.

The oversight loop GDPval prices
Model drafts deliverableExpert reviewsAcceptFix and resubmitNet economics vs unaided expert
GPT-5 in this loop: roughly 1.12x faster, 1.18x cheaper, including review and corrections.

Those are thin margins. That is precisely what makes them believable. A 12% speed gain and an 18% cost gain, with human review fully costed, is the plain current state of frontier-model economics on real professional work: positive, modest, and entirely dependent on the review step staying cheap. The margin disappears whenever review takes as long as redoing the work, which is exactly what happens when the model fails in ways that are expensive to detect. The win rates and the agreement numbers above tell you when that risk is highest: in occupations where the model is weakest and where even experts disagree about quality, review costs balloon and the economics invert. The deployment frontier is not where the model wins comparisons; it is where its failures are cheap for a reviewer to catch.

The compounding view matters more than the snapshot. Hold the oversight workflow fixed and let the win rate climb its linear trend, then the review-and-fix share shrinks, the margins widen, the loop that breaks even today turns structurally profitable. The desks that will capture that compounding are the ones already running the loop, with the review infrastructure, the task libraries, plus the grading habits in place when the capability arrives.

What should a desk copy from this?

For a finance team, GDPval’s transferable asset is the recipe, because the benchmark itself contains finance-sector tasks while your desk’s actual work is its own occupation. Every validation function I have built or audited converged on the same lesson: the evaluation no vendor can sell you is the one against your own work product. GDPval shows precisely how to build it, and none of the steps are exotic.

Collect real deliverables rather than invented exercises: the last two years of investment memos, risk reports, client letters, and model documentation, with the reference files that accompanied them. Have your current models produce the same deliverables from the same inputs. Then grade blind, pairwise, using your own senior staff, without telling them which version came from the machine. The blind pairwise structure is the part most internal evals skip and the part carrying most of GDPval’s credibility: it prevents both the skeptic’s discount and the enthusiast’s halo, while producing an inter-rater agreement number that tells you how much to trust your own verdicts. If your seniors agree less than GDPval’s 71%, your grading rubric needs work before your model does.

Run it per task family rather than blended, the same capability-decomposition logic that made XFinBench actionable a month ago: a desk that knows the model wins on first-draft commentary and loses on covenant analysis can route work accordingly, where a blended score routes nothing. Re-run on every model change, because the 12.5-to-47.6 slope says any gating decision more than a couple of quarters old is stale. And price the oversight loop without flattery, charging review time against the model the way GDPval does, because the sell-side analyst evidence and the filings-extraction evidence agree on where the trap lies: outputs that look right and cost real review time to verify.

Scale is the objection that usually kills this proposal internally, so price it upfront. Thirty tasks across three deliverable families is enough for a first usable signal. The binding cost is senior time rather than engineering: GDPval’s hour-plus per blind comparison is the realistic budget line. Two seniors grading thirty pairs is roughly a week of part-time effort, once. The task library compounds from there, each quarter’s real deliverables feeding the next round, while an automated screen takes over the triage only after it has been calibrated against your own seniors, the way GDPval calibrated its grader and published the five-point gap.

The governance framing closes the loop. A model-risk file that says “the vendor’s model scores well on public benchmarks” documents nothing a regulator or an investment committee should accept. One that says “against our own deliverables, graded blind by our own seniors at 74% inter-rater agreement, the model wins 38% of comparisons on commentary and 11% on structuring work, re-measured quarterly” is an actual control. GDPval is the first public artifact that shows the full pattern at scale, with its own noise honestly measured.

The bottom line

GDPval converts the automation question from rhetoric to measurement: real deliverables, blind expert grading, win rates with the grader noise published alongside. The levels say frontier models approach parity on a meaningful slice of professional work while remaining far from it on the rest. The slope says the levels are perishable. The grading section says every number above carries a 29% expert-disagreement band, which is the benchmark being honest rather than weak. For a desk, the actionable move is not to quote any of these numbers. It is to steal the method, point it at your own work product, and let the win rate against your own seniors, re-measured every model generation, decide what gets delegated next.

GDPval’s real contribution is the protocol, blind pairwise grading of real deliverables with the noise published: steal it, run it on your own desk’s work, and re-measure every model generation, because the slope is the number that decides.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.