// Insight
AGENTS.md, measured: the convention is a tax
AGENTS.md went from convention to institution in about a year: every coding agent reads it, tooling generates it, and in December it was donated to the Linux Foundation alongside MCP as part of the agentic commons. What nobody had done was measure it. Gloaguen and colleagues at ETH Zurich now have, in the first controlled evaluation of repository context files. The result lands squarely on this blog’s favorite theme. LLM-generated context files make coding agents slightly worse, 0.5 to 2 percent fewer tasks solved, while raising inference costs 20 to 23 percent and adding 2.5 to 4 steps per task; even developer-written files buy only about 4 percent accuracy for up to 19 percent more spend.
The design is clean enough to trust the direction. Agents run SWE-bench Lite and AGENTbench tasks with no context file, with LLM-generated files of the kind tooling auto-produces, and, where repositories actually ship them, with the developers’ own hand-written files, 138 instances across 12 repositories for that arm. The generated files are the damning arm: they read plausibly, describe the repository accurately, and still degrade outcomes while inflating the bill, because the agent dutifully attends to paragraphs of description that rarely bear on the task at hand, then takes extra steps reconciling that context with what it discovers in the code.
The result is the static twin of what the retrieval literature established for dynamic context, where feeding agents curated, minimal, high-signal evidence beat volume every time it was measured. A repository description written for no particular task is the lowest-signal context there is, which is why the human-written files that help are the minimal ones, terse build commands and genuine gotchas, closer to a checklist than a README.
The timing makes the institutional point sharper than the technical one. The convention was standardized, tooled, and donated to a foundation before anyone ran the experiment, a sequence this archive has watched repeatedly and the agentic-reasoning survey’s evaluation gap names as the field’s standing defect: adoption travels on plausibility while measurement trails by quarters. Nothing in the ETH result says the convention should die; a standardized location for agent instructions remains obviously right. What dies is the assumption that filling it is free, and especially the practice of auto-generating the filling, which the data says is paying a fifth more per task to perform slightly worse.
The desk translation extends past coding agents, because every agent deployment now carries an equivalent artifact: the system prompt nobody has A/B tested, the standing instructions block that grew by accretion, the auto-generated tool descriptions. Each is a context file by another name, and each deserves the ETH treatment, run the agent with it, without it, and with a human-pruned minimal version, on your own task set, with cost on the scoreboard next to accuracy. The likely finding, if the pattern holds, is uncomfortable in a familiar way: the cheapest accuracy gain available to most agent deployments in 2026 is deleting context, and almost nobody is measuring in the direction where the gain lives. Configuration earned change control in every other system finance runs; the prompt layer keeps escaping it mostly because it looks like prose instead of code.
The first controlled test of AGENTS.md files: auto-generated ones cost 20-23% more to perform worse, human-written ones buy 4% for 19%; the general lesson is that context is a bill, with deletion the most undermeasured optimization in agent deployment.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.