Skip to content
Tim Frenzel

// Insight

Lost in the middle: a position bias that predates training

6 min read
context-engineeringlong-contextRAG

Give a language model a long context and it reads the beginning and the end well, the middle poorly. That is the lost-in-the-middle curve. Most teams treat it as a quirk that better models or more training will smooth out. A new paper argues the reverse. The U-shaped curve is present at random initialization, before any training and independent of positional encoding, an inherent geometric property of the causal decoder. If the bias is born with the architecture, where you place the load-bearing passage in a prompt is a design decision rather than a detail you can train away.

The phenomenon is well documented. In Liu et al.’s 2023 study, a model answering questions over many retrieved documents does best when the relevant document sits first or last and worst when it sits in the middle. The numbers are blunt. Given the answer document alone, GPT-3.5-Turbo scored 88.3%; given no documents at all, 56.1%; bury the answer in the middle of twenty documents and accuracy fell below that 56.1% no-document floor. Retrieving from the middle of a long context was, in the worst case, worse than not retrieving at all.

GPT-3.5-Turbo accuracy by answer position, 20 documents (%)
1st (start)75.85th57.210th (middle)53.815th55.420th (end)63.2
Liu et al. 2023, Table 6: the U-shape. Highest when the answer sits at the start, a 53.8% trough in the middle, a partial recovery at the end. With no documents at all the model scores 56.1%, so a middle-buried answer trails giving it nothing.

What the new work adds is a proof that the curve is not learned. Three structural forces create it at initialization. The causal mask lets early tokens influence everything downstream, a primacy tail. The residual stream hands the final token an undiluted path to the output, a recency anchor. Between them, signal from the middle must survive the most layers of mixing. Its influence shrinks factorially with depth, a dead zone of order one over (H minus one) factorial for a network of depth H. Untrained Qwen2 and GPT-2 show the U-shape at step zero, identical with or without rotary position embeddings. The usual RoPE-decay story misses the root cause.

Three forces bend the curve, all present at initialization
Primacy tailCausal maskEarly tokens influence everything downstreamRecency deltaResidual streamThe final token keeps an undiluted path to the outputFactorial dead zoneDepth, order 1/(H-1)!Middle signal must survive the most layers of mixing
All three forces are present at random initialization: a geometric property of the causal decoder, identical with or without RoPE, shown on untrained Qwen2 and GPT-2 at step 0.

The practical weight is in what that rules out. If the bias appeared during training, you could hope to fine-tune it flat or swap the positional encoding. The theory says the starting geometry already encodes it. Those fixes work against the grain rather than with it. The authors are careful. They call the dead zone an architectural prior rather than a wall. They do not claim it cannot be mitigated. The honest reading is that mitigation is a recurring tax rather than a one-time patch.

The two ends are not symmetric. The recency anchor is an undiluted path of order one, while the primacy tail is a weaker signal that the residual highway dilutes. The end of the context is therefore the single strongest position, which is the concrete reason to place the most critical passage last rather than first. Depth makes the trough worse. Because the middle influence shrinks factorially in the number of layers, a deeper model carries a more pronounced dead zone. Scaling up the stack deepens the very trough you are fighting. The bias is not confined to hard reasoning either. Liu et al. found it on a trivial synthetic key-value lookup: some models returned the right value at every position, while others still sagged in the middle on a task that needs no reasoning at all. When a model cannot reliably copy a value it can plainly see, the failure is positional rather than cognitive.

For a desk doing retrieval over filings, this turns an abstract curve into an operating rule. Order matters as much as relevance. Rerank so the single most load-bearing passage lands at an edge of the context, ideally last, where the recency anchor is strongest. Keep the context short. RULER found that of models advertising 32K-token windows, only about half held performance out to 32K. The gap grew with task complexity as well as length. A bigger window does not rescue the middle. Liu et al. compared a model against its own extended-context variant and found their position curves nearly superimposed. The 16K version was no better at using the middle than the 8K one. So when the answer to a covenant question sits in the eleventh of twenty retrieved credit-agreement clauses, a longer window will not surface it. Reranking it into the last slot will. The context window is not uniform. Its least reliable real estate is the middle.

The same geometry shapes agents, well beyond retrieval prompts. A long-horizon agent piles tool outputs, observations, and prior steps into a growing context. The middle of that transcript is where its own earlier reasoning goes to be forgotten. That is part of why pruning and summarizing the history beats carrying all of it: a shorter transcript keeps the load-bearing facts out of the dead zone. The cost case for trimming context and the accuracy case now point the same way.

None of this argues against long context. It argues for engineering it. Context engineering, the discipline of deciding what fills the window and in what order, stops being cosmetic once you accept that the window itself has a shape and a blind spot you can predict in advance. The models will keep improving in the middle. The architectural prior says they begin every training run with the deck stacked against it. This is the same conclusion the agent-memory cost work and learned context curation reached from the cost side, now with a geometry underneath it. Treat the context window like a measurement instrument with a known blind spot in the middle of its range. You do not discard the instrument. You put the signal where the sensor can see it.

Lost in the middle is not a training artifact you can patch, it is a geometric property of the causal decoder, present before a single step of training: the model reads the edges of its context best and the middle worst, which is why the load-bearing passage belongs at an edge.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.