// Insight
GEPA: improving an agent by reading its traces, not its gradients
The expensive way to make an LLM system better is to fine-tune it with reinforcement learning: run it thousands of times, score the runs, and nudge the weights toward the ones that scored well. GEPA proposes a cheaper path that is also, on its benchmarks, a better one. Instead of nudging weights from scalar rewards, it reads the system’s own execution traces, reflects on them in natural language to work out what went wrong, and edits the prompt accordingly. GEPA beats a reinforcement-learning baseline by up to 20% while using up to 35 times fewer rollouts, by treating each failed run as something to be understood rather than just scored.
How it works
The loop is the idea. GEPA runs the system on a task and collects the full trajectory: the reasoning steps, the tool calls, the intermediate outputs, the final answer. It then feeds that trace back to a language model and asks it to reflect, in words, on where the run went wrong and why. That reflection becomes a proposed edit to the prompt, which is tested empirically. The edits that help are kept, the ones that do not are discarded, then the process repeats.
The reason this is sample-efficient is the information content of the signal. A reinforcement-learning reward is one number per run: better or worse, by this much. A natural-language reflection on the same run is a paragraph: it failed because it misread the date format in step two, it confused the parent company with the subsidiary, it stopped before checking the second source. That diagnosis carries orders of magnitude more information than a scalar, and more information per run means fewer runs to converge. The 35x rollout reduction is not a trick. It is what you get when each trial teaches a sentence instead of a number.
A race engineer rather than a dyno sweep
The distinction that makes GEPA click for me is one I know from motorsport. There are two ways to improve a car. One is to sweep the setup space: try a thousand combinations of wing, camber, and tire pressure, lap each one, and keep whatever is fastest. The other is to bring the car in, read the telemetry with the engineer, and reason about why it is understeering into the slow corners, then change the one thing the data says to change. The first is reinforcement learning. The second is GEPA. A good race engineer does not randomly perturb the car and time the result. They read the trace of the lap, form a hypothesis about the cause, and make a directed change. It converges in a handful of runs because each run is diagnosed rather than merely scored. The diagnosis is what tells you what to change next. GEPA is that engineer, applied to a prompt.
The other piece worth naming is what GEPA keeps between rounds. Rather than greedily holding the single best prompt, it maintains a Pareto frontier of candidates, the ones that are best at different parts of the task. A prompt that nails the numeric questions and one that nails the textual ones are both kept, and GEPA synthesizes their complementary strengths instead of forcing a choice. That is what stops the search collapsing onto a prompt that is good on average and excellent at nothing.
The results
Across six tasks, GEPA beats the GRPO reinforcement-learning baseline by 6% on average and as much as 20% at the high end, and beats MIPROv2, a strong prompt optimizer, by more than 10%, including a 12-point gain on the AIME-2025 math problems. The same reflective search also works as an inference-time strategy for optimizing code. The accuracy gains are real. The headline I would put on it is the 35x, because efficiency is what decides whether a method is usable on the data a desk actually has.
Why a desk should care
Here is the regime where this matters. Reinforcement-learning fine-tuning needs a lot of labeled examples and a lot of rollouts, and most financial-analysis tasks have neither. You are tuning a pipeline to spread a metric, classify a disclosure, or extract a figure, and your labeled set is a few dozen carefully checked examples, not the tens of thousands RL wants. In that regime RL is simply not an option, and GEPA’s sample efficiency is exactly what makes optimization feasible at all. When you have dozens of examples rather than thousands, a method that learns from a written diagnosis per run rather than a scalar reward per thousand runs is the difference between being able to tune the system and not.
There is a governance benefit that comes free with the approach. What GEPA produces is a prompt, in plain language, that you can read, diff, and put in front of a reviewer. An RL-fine-tuned model hides its improvement in weight updates nobody can inspect. A GEPA-optimized system hides nothing: the change is text, the reasoning behind it is text, and so is the trace that motivated it. For a desk that has to defend why its pipeline behaves the way it does, an optimizer whose entire output is auditable is worth more than a few points of benchmark score.
The honest limits
GEPA optimizes the prompt rather than the model. It cannot add a capability the base model lacks, in the same way that reinforcement learning sharpens rather than expands what a model can do. It moves the model’s existing ability to where the task needs it. It also leans on the quality of the reflection, which means a weak model writing the diagnoses produces weak edits, with the method inheriting whatever blind spots the reflecting model has. And it still needs an evaluation signal to test candidates against. The same discipline applies as everywhere else: a prompt tuned against a flawed metric is tuned to the flaw. None of that dents the core result. Reading the trace beats scoring the run, and on the small-data tasks that fill a research desk’s actual workload, reading the trace is often the only option that works.
GEPA improves an LLM system by reflecting on its execution traces in natural language and editing the prompt, beating a reinforcement-learning baseline by up to 20% with up to 35x fewer rollouts. A written diagnosis per run carries far more than a scalar reward, which is why it converges on the dozens of examples a desk actually has. The optimized prompt is plain text a reviewer can read.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.