The news. On June 18, 2026, researchers released Multi-LCB, an extension of LiveCodeBench from Python alone to twelve programming languages. They port each existing Python problem into the other languages while keeping LiveCodeBench's contamination controls and judging protocol intact, then evaluate 24 LLMs. The headline finding: many models that look strong on Python do not carry that skill across languages — aggregate, Python-centric leaderboards were hiding three distinct gaps at once: Python overfitting, language-specific contamination, and large multilingual disparities. Read the paper →

Picture a celebrated chef whose signature dish wins every competition — but every competition is held in their own kitchen, with their own knives, their own oven, their own pantry. On paper they look like a master. Now move them to eleven unfamiliar kitchens, hand them the same recipe, and keep the same head judge with the same scorecard. Some chefs reproduce the dish anywhere; others fall apart the moment the layout changes. A chef judged only in their home kitchen can look like a master and still be lost everywhere else — and that is precisely the blind spot a Python-only coding leaderboard has about a model.

Multi-LCB swaps the kitchens without touching the dish. It takes the contamination-controlled Python problems LiveCodeBench already curates and ports each one into the other eleven languages — same logic, same hidden tests, same pass/fail judge — then re-runs all 24 models. Because the task and the judge never change, any score that drops when the language changes is pure generalization signal, not an artifact of a harder problem set or a more forgiving grader. That is the whole trick: hold everything constant except the one variable you want to study.

Holding the task fixed lets Multi-LCB pull apart two failure modes that a single Python number blends together. One is contamination — the model memorized this exact problem, so its "skill" is recall; LiveCodeBench's date filter catches this in Python, but a solution can leak in one language and not another, so the check has to run per language. The other is language-specific overfitting — the model genuinely reasons about the problem, but only fluently in Python. Separating "it memorized the answer" from "it only speaks Python" is the move that turns a leaderboard back into a diagnostic, and it maps directly onto the four eval failure modes you learn to name in the agent track.

How you benchmark multilingual codingWhat it reportsWhat it can't isolate
Python-only LiveCodeBenchone pass@1 number, measured on Pythonwhether the skill survives a change of language at all
Separate per-language benchmarks (different problems)a score for each language, on its own problem setthe language effect — it is tangled with problem difficulty and per-set contamination
Multi-LCB (same task, ported; fixed judge)per-language scores on the identical problems(by construction, the score gap isolates Python overfitting from contamination; 12 languages, 24 LLMs)

Where the gap shows up

Take an illustrative model that scores ~80% pass@1 on the Python problems. Port the very same problems to other languages and re-judge: say it lands around ~62% in C++, ~55% in Go, and ~40% in a rarer target like Kotlin (illustrative numbers — the source reports 12 languages and 24 models, not these per-language figures). Average the non-Python languages and you get roughly ~52%. The cross-language generalization gap is the spread: ~80% minus ~52%, about 28 points that vanish the instant you stop grading on home turf. Because the problems and the judge were held identical, that 28-point drop cannot be explained by harder questions or a stricter grader — it is the model failing to generalize. A single Python number would have reported one confident "80%" and quietly hidden the 28-point cliff underneath it.

Goes deeper in: AI Agents → Evals & Diagnostics → The 4 Eval Failure Modes

Related explainers

This explainer stands alone from its news item (one concept), so its closest neighbors are other results about how a single evaluation number can quietly mislead:

Continue in trackProduction Evals: spot when a model's skill drops on a slice your leaderboard never measured

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based