Skip to content
Tim Frenzel

// Insight

Structured Outputs: the unglamorous feature that makes LLM extraction safe to ship

4 min read
structured-outputJSON-schemareliability
The most useful model update this month is not a benchmark score, it is a guarantee.

OpenAI’s Structured Outputs constrains the model so its response matches a JSON Schema you supply, exactly, every time. For anyone wiring a language model into a pipeline, that quiet guarantee matters more than another point on a leaderboard.

It works through constrained decoding. You pass a schema. At each step the model can only emit tokens that keep the output valid against it. That is the difference between a request and a contract. Before this, you asked the model to “reply in JSON, please,” parsed the text, caught the occasional missing field or trailing comma, retried, and wrote a validator anyway. Now the structure is enforced as the tokens are generated. OpenAI reports that on its evaluation of complex schema following, gpt-4o-2024-08-06 reaches 100% adherence with Structured Outputs, against under 40% for an older model relying on prompting alone. The model on its own gets to 93%. Constrained decoding closes the last gap deterministically.

Complex JSON schema adherence (%)
gpt-4-0613 (prompting)40gpt-4o (model only)93gpt-4o + Structured Outputs100

That last seven points is the whole story. A method that works 93% of the time is a demo. A method that works 100% of the time is infrastructure, because the failure you are buying out of is real. A malformed field in a batch job means a failed run at six in the morning. Someone has to rerun it by hand before anyone trusts the number.

What it changes on a desk

On a desk, this decides whether extraction can ship at all. The prototype-to-production gap I keep running into is rarely the model, it is exactly this kind of plumbing. Pulling counterparty, notional, and maturity out of a trade confirmation, or sentiment and named entities out of a filing, only becomes safe to automate when the output is guaranteed to parse. Nothing about the model got smarter. The integration changed: it is no longer probabilistic. You can delete the retry loop, the regex salvage, and every defensive try-except wrapped around a parse. One declared schema replaces them.

You have two entry points. Set a response format to a JSON Schema for direct extraction, or mark a function-calling tool as strict so its arguments always conform. Same mechanism, either way. For most extraction work the response-format path is the one you want.

It helps to be precise about what changed, because an earlier feature, JSON mode, already promised JSON. JSON mode guaranteed the output was syntactically valid JSON. It did not guarantee the output matched your schema. You could still get well-formed JSON with the wrong fields, a missing key, or a string where you needed a number. Structured Outputs is the stronger promise: valid JSON that also matches the exact shape you declared. “It parses” becomes “it parses into the object my code expects.” Only the second one lets you safely delete the validation layer.

The trap to keep in mind

A valid shape is not the same as a true value.

The JSON will always be valid. It can still be wrong. The model can place a hallucinated figure into a perfectly well-formed field. A clean parse makes that error harder to spot, because the output looks trustworthy. A schema floors the format. Checking the values against the source is still your job.

This maps onto something quants already know. Passing a type check is not the same as being correct, in the same way that a backtest that runs cleanly is not the same as a strategy that makes money. Reliable plumbing is a precondition for trust, nothing more.

So treat it as a floor and build the next layer on top. Use Structured Outputs everywhere you extract, because guaranteed-valid output is strictly better than hoping. Then keep a separate verification step that checks the extracted numbers against the document they came from. Add a confidence gate that can abstain when the source is ambiguous. The discipline is the same one that turns a research notebook into a production model: reliable plumbing lets you spend your skepticism on the content rather than the format.

Structured Outputs guarantees the shape of the output, never its truth. Use it everywhere you extract, then verify the values against the source.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.