Skip to content
Tim Frenzel

// Insight

AlphaEvolve: automated discovery, and why the evaluator is the whole game

11 min read
evolutionary-searchcode-generationoptimizationagents

Every so often a result arrives that is less interesting for what it did than for what its method implies. AlphaEvolve is one of those. It is a coding agent from Google DeepMind that pairs Gemini with automated evaluators in an evolutionary loop: the model proposes program variants, the evaluators score them, the best survive and become the parents of the next generation. Pointed at real problems, it found things people had missed, including a matrix-multiplication algorithm better than the best known since 1969. For a quant, the headline is not the algorithms. It is the template, because an LLM-plus-evaluator search loop is automated strategy discovery, and everything good and dangerous about it lives in the evaluator.

The mechanism

The idea is older than the language-model era, and seeing it work at this level is the news. You take a population of candidate programs, score each one against an objective, keep the best, mutate and recombine them to make the next generation, and repeat. That is evolutionary search, and its weakness was always the mutation step: random changes to code are almost always garbage, which means the search wastes most of its time on nonsense. AlphaEvolve replaces random mutation with a language model. Gemini proposes changes that are at least plausible, which makes each generation far more productive than blind variation ever was.

AlphaEvolve: the evolutionary loop
A program that solves the problem, plus an automated evaluatorGemini proposes variants of the programRun each variant through the evaluator, get a scoreKeep the best, discard the restThe survivors become parents of the next generationRepeat for thousands of generations
The language model supplies plausible mutations instead of random ones, so each generation is productive. The evaluator is the only judge of fitness. Whatever it rewards is what the population evolves toward.

This is the direct descendant of FunSearch, DeepMind’s earlier system that paired an LLM with a systematic evaluator to make genuine mathematical discoveries, including new constructions for the cap set problem. FunSearch proved the shape worked on narrow mathematical objects. AlphaEvolve generalizes it: it evolves whole programs rather than small functions, across domains from pure math to production infrastructure. The lineage matters because it tells you this is not a one-off stunt. It is a maturing method, the second working version being the point at which a practitioner should start asking what it means for their own field.

The results that matter

The results span pure mathematics and Google’s own infrastructure. The infrastructure ones are the proof that this is more than a demo. Start with the one a mathematician would single out.

Scalar multiplications to multiply two 4x4 complex matrices (lower is better)
Strassen (1969), applied recursively49AlphaEvolve (2025)48
One fewer multiplication, and the first improvement on this case in over five decades. The point is not the single operation saved. It is that a search loop found a better algorithm for a problem mathematicians had not improved since 1969.

Matrix multiplication is the most-optimized operation in computing, because nearly everything expensive eventually reduces to it. Strassen showed in 1969 that you could multiply matrices with fewer scalar multiplications than the schoolbook method, and his approach applied recursively needs 49 multiplications for a 4x4 case. AlphaEvolve found a way to do it in 48. A single multiplication saved sounds trivial. It is not the saving that matters. It is that a fifty-six-year-old human result, in the single most studied operation in the field, was improved by an automated search. When the most-optimized corner of computing still has room a machine can find, the implication for less-optimized corners is hard to ignore.

The production results are where the method earns its keep. AlphaEvolve discovered a scheduling heuristic for Google’s data centers that recovers, on a continuous basis, 0.7% of the company’s worldwide compute. At Google’s scale that fraction is enormous in absolute terms. The heuristic has run in production for over a year. It sped up a matrix-multiplication kernel inside Gemini’s own training stack by 23%, which translated into roughly a 1% reduction in Gemini’s total training time. It found an implementation of a FlashAttention kernel that ran up to 32.5% faster. And on a battery of open mathematical problems, it rediscovered the best known result in roughly 75% of cases and improved on the best known in about 20%, including a better configuration for the kissing-number problem in eleven dimensions.

That spread of numbers carries a lesson on its own. A 23% kernel speedup becoming a 1% training-time reduction is a reminder that local wins dilute as they propagate through a system. The place to apply the search is the bottleneck rather than whatever is easiest to measure. And the 0.7% compute figure, trivial-sounding and actually vast, is a reminder that the value of an optimization lies in its absolute scale rather than its percentage. There is a self-reinforcing loop hiding in these results too. The kernels AlphaEvolve speeds up are the kernels that train the models that run AlphaEvolve. A system that improves its own substrate is the kind of thing worth watching carefully, in both the hopeful and the wary sense.

The constraint that defines everything

Now the part that matters more than any single result. AlphaEvolve only works where progress is machine-measurable. Every one of its wins has the same shape: a problem with a clean, automatic evaluator that returns a trustworthy score for any candidate. The matmul algorithm is verified by checking it computes the right product. The kernel speedup is measured by running it. The math results are checked against the problem’s formal definition. The evaluator is not a detail of the method. It is the method.

This puts AlphaEvolve in the same family as DeepSeek-Prover-V2, which trains against a proof checker that cannot be charmed. Both systems work because correctness is decided by a machine rather than estimated by a judge. The difference is that a proof checker is sound by construction, while an AlphaEvolve evaluator is something you build, and what you build is where the risk enters.

Where the method applies, and where it cannot
A candidate solutionIs there an automatic evaluator that scores it correctly and completely?Yes, sound and complete: AlphaEvolve thrives, it optimizes against a true objectiveYes, but the evaluator has gaps or leaks: AlphaEvolve finds the gaps, it optimizes the measurement instead of the goalNo machine-checkable objective: the method does not applyThe quality of the evaluator is the ceiling on the result
AlphaEvolve is only as good as the evaluator it optimizes against. A sound, complete evaluator yields real discoveries. A leaky one yields a program that games the leak with superhuman efficiency.

The middle branch is the one to fear, being the default rather than the exception. An evaluator with a gap is not a neutral imperfection. It is an attack surface, and AlphaEvolve is the most patient, tireless attacker imaginable. If there is a way to score well without solving the problem, a search running thousands of generations will find it. This is the old reward-hacking story, with the volume turned up by the efficiency of the search. The cleaner your evaluator, the more miraculous the result. The leakier your evaluator, the more elaborately you will be fooled.

The quant translation

Here is why a quant should read this closely. Strip AlphaEvolve to its frame and you have a system that proposes strategies as code and keeps the ones that score best against an objective. Swap in a backtest as the objective and you have described automated strategy discovery: point the loop at a clean performance metric and it could evolve execution heuristics, factor expressions, or whole trading rules, with the same superhuman efficiency it brought to matmul kernels.

Point the loop at a backtest
A trading strategy expressed as codeGemini proposes variantsScore each variant on a backtest, the evaluatorKeep the best, evolve the next generationA strategy that maximizes the backtestBut the backtest IS the evaluator, so any leak in it is what the strategy actually learns
The same loop that found a better matmul algorithm will evolve a strategy against your backtest. If the backtest leaks future information, ignores transaction costs, or overfits the sample, the loop optimizes the leak. It overfits the simulator with the same efficiency it optimizes anything else.

This is the alpha-jungle warning from LLM-guided factor search, sharpened to a point. There, the hazard was a search trying thousands of formulas and selecting the luckiest. Here it is worse, because the search is not just selecting on luck, it is actively shaping programs to exploit whatever your evaluator measures. If your backtest has any leakage, lookahead, survivorship, unmodeled costs, an unrealistic fill assumption, AlphaEvolve will not stumble into it, it will engineer toward it, and hand you a strategy with a glorious backtest and no future. The efficiency that makes the method valuable on a clean problem makes it dangerous on a dirty one.

The failure is concrete enough to picture. Suppose the backtest fills orders at the closing price and ignores the spread, a common shortcut. A human researcher might lean on that gently and move on. AlphaEvolve will discover that trading the most illiquid names, where the real spread is widest and the printed close is least achievable, scores best of all, and will evolve a strategy that lives entirely in the gap between the simulator and the market. The backtest will look superb. Every basis point of the edge will be a cost the simulator forgot to charge. Nothing in the output announces the problem, which is what makes it lethal: you get a beautiful equity curve whose beauty is the artifact, discovered only when real money meets the spread the backtest waved away.

The discipline this demands is the one a desk should already have, applied with new severity. Treat the evaluator as the deliverable. Before you let any automated loop optimize against a backtest, the backtest has to be leakage-proof, cost-aware, and tested out of sample, because the loop will find every shortcut you left in it. The work is no longer in proposing strategies, which the loop now does for free. The work is entirely in building a simulator honest enough that maximizing it means maximizing real performance. That was always the hard part of quant research. AlphaEvolve makes it the only part.

It helps to name the cases where the evaluator is clean by construction, because those are the places the tool earns its keep with no catch. Code with a machine-checkable answer is the clearest: a pricing or risk function can be optimized against analytic benchmarks, a put-call parity check, a known closed-form limit, a Greek that must carry a fixed sign, the same kind of machine-graded target that makes formal verification trustworthy. Execution is another, where a simulator that honestly models queue position, latency, and impact makes beating the simulator the same as beating the real venue. The pattern is that the evaluator is sound when it measures a mechanical, physically grounded quantity rather than a forecast of the future. Alpha discovery sits at the dangerous end of that spectrum, because the backtest is a forecast wearing the costume of a measurement. Knowing which end you are on is the first decision, before you let the loop run at all.

What to take from it

Three things to carry out of this. First, the method is real and it is maturing. FunSearch was the proof of concept and AlphaEvolve is the production system. An LLM-plus-evaluator loop that discovers genuine improvements in matrix multiplication while shaving real compute off a hyperscaler is not a thing to wave away. The frame of automated discovery against a machine-checkable objective is going to spread to every field that can write one.

Second, the value is bounded entirely by the evaluator. A sound, complete, leak-free evaluator turns the loop into a discovery engine. A flawed one turns it into a machine for producing convincing artifacts. The intelligence of the proposer matters far less than the integrity of the judge, which is a humbling inversion of where most people think the magic lives.

Third, for quant work the message is specific and a little uncomfortable. The barrier to automated strategy discovery was never the generation of ideas. It was, and remains, the validation. AlphaEvolve removes the generation barrier and leaves the validation barrier exactly where it was, while raising the stakes, because a tireless optimizer will exploit a weak backtest faster and more thoroughly than any human researcher ever could. The same scrutiny I would bring to any backtest that suddenly looks better is now the precondition for using the tool at all. Build the leakage-proof evaluator first. Only then let the loop loose. The method rewards a clean objective with discovery and punishes a dirty one with expensive self-deception, and which one you get is decided before the search ever runs.

AlphaEvolve pairs Gemini with an automated evaluator and evolves real discoveries, including a 4x4 matmul algorithm better than any since 1969. The template is automated strategy discovery. The lesson is severe: the loop optimizes whatever the evaluator measures with superhuman efficiency. Point it at a leaky backtest and it engineers toward the leak. The evaluator is the whole game.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.