What is the cross-language generalization gap?

It is the drop in a model's coding score when the same problem is moved from Python to another language, with the judging protocol held identical. Multi-LCB measures it by porting each LiveCodeBench Python problem into twelve languages and re-scoring 24 LLMs, so any score difference reflects the language change alone — not a harder problem or a different grader. A large gap means the model's apparent Python skill does not generalize.

Why is Python overfitting a problem for code benchmarks?

Python dominates public code, so models become disproportionately fluent in it, and a Python-only leaderboard reports that fluency as general coding ability. Multi-LCB shows that gap directly: many models that score high on Python fall sharply on other languages. If your team ships in Rust, Go, or TypeScript, the Python number is a weak predictor of how the model will actually perform.

How does Multi-LCB control for contamination across languages?

LiveCodeBench already filters problems by contest release date so models can't have memorized them before training. Multi-LCB keeps that control and adds a per-language check, because a problem and its solution can leak into training data in one language but not another. Holding the task and judge fixed lets it separate two failure modes a single Python score blends: contamination (memorized the answer) and language-specific overfitting (only fluent in Python).

Multi-LCB extends LiveCodeBench to 12 languages — Cross-language generalization gap

TL;DR

What is it: A new benchmark, Multi-LCB, ports every LiveCodeBench Python coding problem into twelve programming languages and re-scores 24 LLMs. The thing it isolates is the cross-language generalization gap — how far a model's coding skill falls when only the language changes.
Why it’s needed: A model can sit at the top of a Python coding leaderboard and still fail in the language your team actually ships. That gap is exactly what an aggregate, Python-centric score hides — which is the core lesson of Evals & Diagnostics and Production Evals.
vs previous: Earlier multilingual benchmarks use different problems for each language, so a score drop confounds the language with the problem's difficulty. Multi-LCB holds the task and the judge fixed and varies only the language, so the drop is pure generalization signal — not a harder problem and not a friendlier grader.

Jargon

LiveCodeBench: A contamination-controlled coding benchmark: it pulls problems from recent competitive-programming contests with known release dates, so anything a model could have memorized before its training cutoff can be filtered out. Multi-LCB is built on top of it.
Data contamination: When a model saw the exact problem (and often its solution) during training, so a high score reflects memorization rather than skill. LiveCodeBench's date filter limits it for Python; Multi-LCB checks for it per language, since a problem can leak in one language and not another.
Python overfitting: A model is disproportionately good at Python because Python dominates public code, so the skill it measured on Python does not transfer to languages it saw far less of, like Rust or Kotlin.
Cross-language generalization: Whether coding ability learned mostly from one language transfers to others. The gap between a model's Python score and its score elsewhere is the quantity Multi-LCB is designed to measure cleanly.
Judging protocol: The fixed procedure that runs a candidate solution against hidden test cases and scores it pass or fail. Keeping it identical across all twelve languages is what makes the scores comparable.
Task porting (transpilation): Rewriting a problem's statement and its tests into another language while preserving the underlying logic, so the only thing that changed between two runs is the language itself.
pass@1: The fraction of problems a model solves on its first attempt — the headline number a code benchmark reports per model.

The news. On June 18, 2026, researchers released Multi-LCB, an extension of LiveCodeBench from Python alone to twelve programming languages. They port each existing Python problem into the other languages while keeping LiveCodeBench's contamination controls and judging protocol intact, then evaluate 24 LLMs. The headline finding: many models that look strong on Python do not carry that skill across languages — aggregate, Python-centric leaderboards were hiding three distinct gaps at once: Python overfitting, language-specific contamination, and large multilingual disparities. Read the paper →

Picture a celebrated chef whose signature dish wins every competition — but every competition is held in their own kitchen, with their own knives, their own oven, their own pantry. On paper they look like a master. Now move them to eleven unfamiliar kitchens, hand them the same recipe, and keep the same head judge with the same scorecard. Some chefs reproduce the dish anywhere; others fall apart the moment the layout changes. A chef judged only in their home kitchen can look like a master and still be lost everywhere else — and that is precisely the blind spot a Python-only coding leaderboard has about a model.

Multi-LCB swaps the kitchens without touching the dish. It takes the contamination-controlled Python problems LiveCodeBench already curates and ports each one into the other eleven languages — same logic, same hidden tests, same pass/fail judge — then re-runs all 24 models. Because the task and the judge never change, any score that drops when the language changes is pure generalization signal, not an artifact of a harder problem set or a more forgiving grader. That is the whole trick: hold everything constant except the one variable you want to study.

Holding the task fixed lets Multi-LCB pull apart two failure modes that a single Python number blends together. One is contamination — the model memorized this exact problem, so its "skill" is recall; LiveCodeBench's date filter catches this in Python, but a solution can leak in one language and not another, so the check has to run per language. The other is language-specific overfitting — the model genuinely reasons about the problem, but only fluently in Python. Separating "it memorized the answer" from "it only speaks Python" is the move that turns a leaderboard back into a diagnostic, and it maps directly onto the four eval failure modes you learn to name in the agent track.

How you benchmark multilingual coding	What it reports	What it can't isolate
Python-only LiveCodeBench	one pass@1 number, measured on Python	whether the skill survives a change of language at all
Separate per-language benchmarks (different problems)	a score for each language, on its own problem set	the language effect — it is tangled with problem difficulty and per-set contamination
Multi-LCB (same task, ported; fixed judge)	per-language scores on the identical problems	— (by construction, the score gap isolates Python overfitting from contamination; 12 languages, 24 LLMs)

Where the gap shows up

Take an illustrative model that scores ~80% pass@1 on the Python problems. Port the very same problems to other languages and re-judge: say it lands around ~62% in C++, ~55% in Go, and ~40% in a rarer target like Kotlin (illustrative numbers — the source reports 12 languages and 24 models, not these per-language figures). Average the non-Python languages and you get roughly ~52%. The cross-language generalization gap is the spread: ~80% minus ~52%, about 28 points that vanish the instant you stop grading on home turf. Because the problems and the judge were held identical, that 28-point drop cannot be explained by harder questions or a stricter grader — it is the model failing to generalize. A single Python number would have reported one confident "80%" and quietly hidden the 28-point cliff underneath it.

Goes deeper in: AI Agents → Evals & Diagnostics → The 4 Eval Failure Modes

Related explainers

This explainer stands alone from its news item (one concept), so its closest neighbors are other results about how a single evaluation number can quietly mislead:

Agent leaderboards mislead under distribution shift (IBM) — predictive validity — the sibling failure: predictive validity shows a ranking fails to transfer across conditions; Multi-LCB shows a single model's skill fails to transfer across languages
WeaveBench — trajectory-aware vs outcome-only grading — another case where a tidy final score hides what actually happened
FutureSim — harness-level agent eval — evaluating the process rather than trusting one aggregate number

Continue in trackProduction Evals: spot when a model's skill drops on a slice your leaderboard never measured

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based