// Insight

FrontierMath: the math benchmark that is not saturated, and what that tells a quant

November 2, 20246 min read

benchmarkmath-reasoningevaluation

Almost every math benchmark is solved. The leading models score near-perfect on GSM8K and MATH, which is why a new headline arrives every month claiming math is done. FrontierMath, from Epoch AI, is the exception that puts those headlines in perspective. The best current models solve under 2% of its problems. That gap is the most useful number a quant can hold onto when deciding how far to trust an LLM on real quantitative work.

FrontierMath is a few hundred original, unpublished problems, authored and peer-reviewed by expert mathematicians, spanning undergraduate difficulty up to genuine research-level mathematics. Two design choices make it credible. The problems are new. A model cannot have memorized them from training data. And the answers are automatically verifiable, so scoring is not a matter of a generous judge.

Best-model accuracy: solved benchmarks vs FrontierMath (%)

Representative frontier-model scores. On GSM8K and MATH the strongest models are near-perfect; on FrontierMath no model exceeds 2%.

The same six models that ace the easy sets, o1-preview, o1-mini, GPT-4o, Claude 3.5 Sonnet, Grok 2 Beta, and Gemini 1.5 Pro, collapse here. None of them clears 2%. The contrast is the whole message: saturation on GSM8K and MATH measures how good models are at problems that look like their training data, and tells you almost nothing about genuinely hard, novel reasoning.

Why a quant should care about a math benchmark

Because the question behind it is the one a model-risk review actually asks: how much do I trust this model on a derivation I cannot easily check myself? GSM8K saturation invites a dangerous answer. If the model is near-perfect on math, surely it can handle my pricing derivation, my risk calculation, my factor algebra. FrontierMath says the confident answer is the wrong one. The model is excellent at math it has effectively seen before, and close to useless on math that is genuinely new and hard.

That distinction maps directly onto a desk’s use of these tools. Routine, well-trodden calculations, the kind that resemble a million worked examples online, are where an LLM is genuinely strong and genuinely time-saving. A novel multi-step derivation, an unusual stochastic-calculus argument, a bespoke proof of a constraint, is exactly the territory FrontierMath shows models cannot yet handle. Knowing which side of that line your task sits on is the difference between a useful assistant and a confident, wrong one.

The contamination point, which is the real one

The deeper value of FrontierMath is methodological. It is a lesson a quant already knows in a different language. A benchmark a model may have trained on measures memorization rather than capability, like an in-sample backtest measuring fit rather than edge. FrontierMath is the out-of-sample test for reasoning: problems the model provably has not seen, scored by a machine that cannot be charmed. The under-2% result is what genuine out-of-sample performance on hard reasoning looks like right now.

This is why a saturated benchmark should make you more skeptical, not less. When a model scores 95% on a public math set, the first question is how much of that set leaked into its training rather than how smart the model is. The benchmarks that stay hard are the ones telling you the truth.

Drawing the line for a desk

The practical question FrontierMath forces is where exactly an LLM’s math ability stops, because that line governs how you use one in a pipeline. The benchmark gives the shape of the answer. Models are strong on problems that resemble their vast training data and weak on problems that are genuinely novel and hard. The work is to map your own tasks onto that line.

Some quant tasks sit safely on the strong side. Reformatting a calculation, applying a standard option-pricing formula, writing the code for a well-known estimator, summarizing a derivation that appears in every textbook: these resemble a million worked examples. An LLM handles them well, with a verification check as a backstop. The model is doing pattern completion on familiar material, which is its strength.

Other tasks sit on the weak side. They are the ones where the confident answer is most dangerous. A bespoke proof that a risk constraint holds, an unusual stochastic-calculus argument, a novel derivation with no close precedent: these are FrontierMath-shaped. The under-2% result says the model will often be wrong while sounding completely sure. That combination, wrong and confident, is the worst case for a number that feeds a decision.

The discipline that follows is the one a model-risk function already runs. Classify the task before you route it. Send the well-precedented work to the model with a checker, and keep a human on the novel, hard work until the contamination-resistant benchmarks say otherwise. The mistake is to read a 95% on a public math set as permission to trust the model everywhere. FrontierMath is the correction. The public score measures the easy half. The hard half is where your most consequential derivations actually live.

Add one more lens, because the result reframes how to read every other model announcement. When a lab reports a new high on a public benchmark, the FrontierMath gap is the question to ask: is this a benchmark the model could have trained on? If yes, the score measures how well the lab covered that benchmark’s distribution, which is a real engineering achievement and a poor guide to novel reasoning. If the benchmark is genuinely held out and hard, the score means much more. The skill a quant needs is to read benchmarks with this distinction in mind rather than dismiss them.

The same lens applies inside your own shop. If you build an internal eval to decide whether to trust an LLM on a class of quant tasks, the contamination question is the first one to settle. An eval drawn from public problems the model may have seen tells you almost nothing. An eval built from your own proprietary problems, or from genuinely novel ones, tells you what you actually need to know. FrontierMath is the public proof of a principle you should apply privately: the only score worth trusting is the one the model could not have studied for.

How I would use this

As a calibration tool rather than a leaderboard. Before you let an LLM do quantitative work in a research or risk pipeline, classify the task honestly: is it routine and well-precedented, or novel and hard? Route the first kind to the model with a verification check, and keep a human on the second kind until the benchmarks that resist contamination say otherwise. FrontierMath does not tell you LLMs are bad at math. It tells you precisely where their math ability stops, which is exactly what you need to know before you trust one with a number. That single piece of knowledge, where the ability stops, is worth more to a risk function than another point on a leaderboard it was never measuring honestly. Calibration like that is the whole job of an evaluation.

A saturated benchmark measures memorization; a hard one measures reasoning. FrontierMath’s under-2% result is the sober gauge of how far to trust an LLM on math it has not seen before.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →