// Insight

Tülu 3: an open recipe for post-training your own model

November 30, 20243 min read

post-trainingRLVRopen-recipe

Open weights tell you what a model is. They do not tell you how to make one behave the way you need. Tülu 3, from Ai2, fills that gap: the full post-training recipe: the data, the code, the hyperparameters, the evaluation, released on top of Llama 3.1. For a quant team, the valuable part is not the finished model, it is the reproducible recipe for fine-tuning one on your own tasks. The release ships in 8B and 70B sizes.

Tülu 3: the open post-training pipeline

Every stage ships with its data, code, and recipe. RLVR trains directly on problems with a checkable answer, with no reward model in the loop.

The new ingredient is the last stage. Reinforcement Learning with Verifiable Rewards, RLVR, trains the model directly on problems whose answers can be checked, with no learned reward model standing in between. Most reinforcement learning for language models trains a separate reward model to approximate what a good answer looks like, then optimizes against that approximation. RLVR skips the approximation. When the answer is checkable, the check is the reward.

Why verifiable rewards fit a quant desk

This is the part worth dwelling on, because a quant’s world is full of checkable answers. A pricing identity has to hold. A risk constraint has to be satisfied. A reconciliation has to balance. A piece of generated code has to compile and pass its tests. Each of these is a verifiable reward by construction, which means RLVR is a natural fit for tuning a model on the exact tasks a desk needs it to get right.

The difference from the usual approach matters. Training against a learned reward model means training against a model of correctness, with all the drift and gaming that implies. Training against a verifiable check means training against correctness itself. For a regulated desk, that distinction is the difference between a model tuned to look right and one tuned to be right on the problems where right is defined. The recipe being fully open means you can run that loop on your own verifiable tasks rather than treating alignment as a vendor black box.

The honest scope

Tülu 3 is a post-training recipe rather than a frontier model. It makes Llama 3.1 behave better on the targeted skills. It does not change the ceiling that the base model sets. RLVR also only applies where a verifiable reward exists. For open-ended judgment, where correctness is not checkable, you are back to preference tuning and human review. The technique is powerful exactly where the answer is checkable, and silent everywhere else.

That scope is a feature for a quant rather than a limitation. The tasks worth automating on a desk are disproportionately the checkable ones, because those are the tasks you can trust a model to do without a human reading every output. RLVR is aimed at precisely that set.

How I would use it

As the blueprint for an in-house tuning loop on verifiable tasks. Start from an open base model, follow the released recipe, and define your rewards as the checks you already run: does the number reconcile, does the constraint hold, does the code pass. Keep the recipe and the data versioned, because a reproducible post-training pipeline is itself an auditable artifact. The lesson of Tülu 3 is that alignment does not have to be a mystery handed down from a vendor. On the tasks where correctness is checkable, you can train for it yourself, and prove how you did. For a desk that has to defend every automated decision, that combination, training for correctness and proving how, is worth more than a few points on a public leaderboard.

When the answer is checkable, the check is the reward. Tülu 3 opens the recipe for training a model on the verifiable tasks a quant desk already defines.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →