The news. On June 16, 2026, researchers posted LoopCoder-v2, which asks a narrow but surprising question: when you make a 7B model "think deeper" by looping one shared block instead of adding layers, how many loops is best? The answer is not "as many as you can afford." Looping the block twice lifts SWE-bench Verified from 43.0 to 64.4 and Multi-SWE from 14.0 to 31.0 — but three or more loops regress, a strongly non-monotonic loop-count effect the authors trace to a positional-offset mismatch. Read the paper →

Picture re-reading a dense paragraph you didn't get the first time. The second read is where understanding jumps — your eyes are the same eyes, your brain the same brain, but the second pass lets you connect things the first pass only registered. A weight-tied looped transformer does exactly this: it feeds the hidden state back through the same block for a second pass, reusing every weight. The remarkable part of LoopCoder-v2 is what happens when you keep going. A third re-read, a fourth — at some point you stop gaining and start losing track of which line said what, mixing up the order. That confusion is the model's failure mode too, and it is the whole story of the paper.

Normally, "deeper" means more parameters. To add reasoning depth, a transformer stacks more distinct layers, and each new layer brings its own weights — more memory, more model. Weight-tied looping breaks that link: it gives a token the effective depth of many layers while storing the weights of just one. The block's output is wired back into its own input, so two loops cost two block-passes of compute but zero extra parameters. That makes looping a pure test-time-compute knob — you decide at inference how hard to think, without retraining a bigger model.

The catch is that the loops are not interchangeable, and the model has to tell them apart. To keep the second pass from being a literal repeat of the first, each loop tags tokens with cross-loop position offsets — slightly shifted positional information so the block knows "this is pass two, not pass one." Two passes fit this scheme cleanly. By the third, the offsets mismatch: tokens get attended to at the wrong relative positions, and the distortion outweighs whatever the extra compute buys. That is why the curve rises, peaks at two, then falls — the defining shape of a non-monotonic result, where adding more of a good thing makes the output worse.

Where the +21 points actually comes from

Hold the model fixed at 7B parameters. Run its block once and it scores 43.0 on SWE-bench Verified — the no-loop baseline. Now loop the block one more time: the hidden state makes a second pass through the exact same weights, costing one extra block-pass of compute per token and zero added parameters. The score climbs to 64.4 — a jump of 21.4 points (that is 64.4 − 43.0), with Multi-SWE moving in lockstep from 14.0 to 31.0. The arithmetic that matters is the denominator: those gains were bought with compute, not capacity — the same 7B weights, run one extra time. Add a third loop and you spend another full block-pass of compute — but the score falls back below the two-loop peak, so the marginal third pass has a negative return. More thinking, less accuracy.

Loops through the blockSWE-bench VerifiedAdded parametersWhat happens
1 (baseline)43.0 ~per the papera single pass through the block
2 (the sweet spot)64.4 ~per the paper0same weights, reused one more pass
3 or moreregresses below 64.4 ~per the paper0cross-loop position offsets mismatch

The deeper point is that loop count is a tunable dial, not a "more is better" slider — and the default intuition had it pointed the wrong way. That rhymes with a thread running through several recent results: variable-width transformers found another fixed transformer default (every layer the same width) was actually a knob worth turning, and non-monotonic teacher strength is the same shape of surprise in distillation — a stronger signal eventually hurts. The honest caveat the authors keep: this is a loop-count study on 7B coding models, and the two-loop optimum comes specifically from their cross-loop positional-offset scheme. A different way of distinguishing the loops might push the sweet spot — the architectural specifics of why two is the ceiling are tied to that scheme, not a universal law of looping.

Goes deeper in: LLM Internals → Transformer Block → Modern Variants & Scale

Related explainers

Continue in trackLLM Internals: Modern Variants & Scale

Frequently Asked Questions