// Insight

AutoResearch: a hundred experiments while you sleep

March 21, 20266 min read

agentsautomated-researchKarpathy

Two weeks ago Karpathy released autoresearch. The star count, past 21,000 within days, measured the recognition: this is the automated-researcher idea reduced to its irreducible core. An agent reads the training code, proposes one change, runs a training job capped at exactly five minutes of wall-clock, keeps the change if the evaluation metric improved, reverts it if not, and repeats, indefinitely, unattended. Roughly twelve experiments an hour; a hundred while the researcher sleeps. No paper, no benchmark suite, a few hundred lines of Python wrapped around nanochat, plus a sharper statement of when autonomous research loops work than most of the literature that studies them.

The three-file architecture is where the design thinking lives. It is a permission model wearing a directory listing. prepare.py holds data preparation and utilities; the agent never touches it. train.py is the experimental surface; the agent edits it freely, architecture, optimizer, hyperparameters, anything. program.md is the human’s instruction file, the research-program specification that steers what kinds of changes the agent explores.

Three files, three trust levels

A permission model expressed as a directory: what varies, what holds still, who steers.

The five-minute budget is the most underrated decision in the repository. Every experiment gets the same wall-clock, regardless of what the agent changed, which makes every comparison a level contest at fixed compute: a bigger model that trains slower must beat the incumbent within the same five minutes, not in some flattering asymptote. Practitioners will recognize the move, it is the fixed-budget evaluation discipline that separates clean method comparisons from compute-laundering, hard-coded so the agent cannot negotiate with it.

The loop, in full

Hill-climbing with an incorruptible referee; the budget is the part the agent cannot negotiate.

The same 630 lines, two referees

The loop is identical; whether the evaluator regenerates or depletes decides tool versus hazard.

Why this loop is safe, and where its twin is not

Strip the novelty and autoresearch is greedy hill-climbing with an LLM proposing the steps, which is to say it is the evaluator-is-everything pattern at its minimum viable size. The loop works because its reward is unimpeachable: a real metric, on a real held-out evaluation, recomputed fresh each run, with a fixed budget that forbids buying improvement with compute. The agent can be wrong a hundred times a night at almost no cost, because the referee never is. This is the self-evolution quadrant where autonomy is cheap precisely because verification is.

Every quant reader has already seen the dangerous twin. Point this exact loop at a strategy instead of a model, propose a change, backtest, keep if the Sharpe improved, whereupon you have built the overfitting machine this blog keeps warning about, now running unattended at a hundred iterations a night. The mechanical difference is one property of the evaluator: a fresh training run cannot be mined by repetition, while a fixed historical record can, and every kept change spends test-set information the backtest never replenishes. AutoResearch and the p-hacking machine are the same 630 lines of code; the entire difference between a tool and a hazard is whether the evaluator regenerates or depletes.

The desk translation is therefore an enablement with a boundary. The boundary fits in one sentence. Run this loop wherever your objective is freshly recomputable, execution-cost models scored on tomorrow’s fills, code optimized against profiled runtime, simulators with fresh draws, and never against a finite history without an embargoed holdout and a testing budget the loop cannot see. The program.md file is the desk’s contract surface: the place to write what the agent may vary, what it must never touch, and what counts as better, which makes it less a context file than a research mandate, the genre that measurably works precisely because it specifies an optimization rather than describing a repository.

What the loop does not do is as instructive as what it does; the repository’s own discussion threads are already mapping the edges. Greedy hill-climbing keeps only monotone improvements, polishing a local optimum while structurally unable to make the temporarily-worse move a redesign requires; nobody should expect architecture discoveries from a process that reverts every regression. The objective is single-metric, which works at nanochat scale where one loss summarizes progress and degrades wherever quality is a vector, the exact reason multi-dimension grading exists everywhere deliverables matter. And the five-minute budget that keeps comparisons clean also bounds the hypothesis class to changes whose effects surface in five minutes, a horizon bias every quant will recognize from short-backtest pathologies.

For a desk adopting the pattern, the mandate file is where the engineering judgment concentrates. It reads like a compliance document because it is one. The clauses write themselves from the failure modes: the objective and its exact computation; the never-touch list, with the evaluator and data preparation at the top for the same reason the untouchable file exists in the original; the per-night experiment and compute budget; the reporting format for what changed and why. An agent optimizing under a written mandate with an incorruptible referee is automation a review committee can reason about. The same agent with edit rights to its own scoreboard is an incident report with a timestamp in the future.

Race teams have run this exact protocol for decades; they call it the overnight dyno program. The engine goes on the test bench at midnight with a change list, each variant gets an identical run cycle, the telemetry decides, then the engineers arrive at dawn to a ranked list of what survived. Nobody calls the dyno an autonomous engineer. The dyno works because the dyno cannot be argued with, and autoresearch is the first ML tool candid enough to make that the entire design.

The economics of the pattern deserve one sentence before the culture: a hundred five-minute single-GPU runs cost a few dollars of electricity against a researcher-day of iteration, a price asymmetry so steep that the only question left is referee integrity, never compute.

The cultural read, briefly, because the star count is itself a datum. The field’s reference implementation of the automated researcher turned out to be a few hundred readable lines with a hard budget and an honest metric, released by the person whose minimal artifacts repeatedly become the field’s teaching materials. Against the agentic-research platforms raising on the same promise, the lesson lands the way it usually does: the loop was never the hard part. The referee was, and is, and on a desk the referee has a name, the evaluator your research process can defend. Build that referee and the hundred overnight experiments are a gift. Skip it and they are a hundred ways to fool yourself before breakfast.

AutoResearch is hill-climbing with an incorruptible referee: a fixed five-minute budget and a regenerating metric make a hundred unattended experiments safe, and one substitution, a backtest for the evaluator, turns the same loop into an overfitting machine.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →