The news. On July 1, 2026, researchers released Orca ("The World is in Your Mind"), a model that learns one unified world latent and reads it out for three tasks — text generation, image prediction, and embodied action — through lightweight modality-specific decoders on a frozen backbone. It is trained on 125,000 hours of video and 160 million event annotations, and the authors report it outperforms similar-sized specialized baselines while getting stronger as the world latent scales (the gains are stated qualitatively, without a single headline number). Read the paper →

Picture a national weather service. It does not build one machine that spits out temperatures, a second that draws rain maps, and a third that decides whether to sound a storm siren — it runs one physical model of the atmosphere, then reads that single model out three different ways. Orca does the same thing for an AI's understanding of the world: instead of three separate predictors, it learns one shared "world latent" and taps it for words, for pictures, and for actions.

Why does that unification matter? Because the usual recipe trains three genuinely different objectives. A text model learns to predict the next token; a video model learns to predict the next frame; a control policy learns to predict the next action — and each ends up with its own, siloed picture of the world. Orca's claim is that all three are really the same job dressed in different clothes: predict the next state. Fold them into one objective — Next-State-Prediction — and you get a single coherent world model instead of three partial ones. The shared latent space is what the three readout heads all draw from.

How does that single model get trained? Two signals feed the same latent: an "unconscious" stream of dense transitions from raw video — the way the world usually flows — and a "conscious" stream of sparse, human-flagged events described in language. The dense stream teaches broad physical intuition; the sparse stream injects explicit meaning where it matters. A frozen backbone holds that combined understanding fixed while small decoders learn to read it out — and because the readout for embodied action shares the exact same latent as the readout for text, the model's "sense" of the world is consistent across everything it does.

ObjectiveWhat it predictsWhat it produces
Next-token (text)the next worda text-only model, not trained to share a representation with pixels or actions
Next-frame (video)the next imagea vision-only model, not trained to share a representation with language or actions
Next-action (control)the next movea control-only policy, not trained to share a representation with words or pixels
Next-State-Prediction (Orca)the next state of the world [paper]one shared latent, read out to all three

Ground the two training signals in numbers. Orca learns from 125,000 hours of video and 160 million labeled events. At a rough 30 frames per second (illustrative frame rate), 125,000 hours is about 13.5 billion frames — so there is only about one labeled "conscious" event for every ~84 frames of raw "unconscious" video the model absorbs on its own. That lopsided ratio is the point: most of what a world model needs to know is learned for free from watching the world move, and the sparse labels only steer the meaning.

Goes deeper in: LLM Internals → Text Generation → One Token at a Time

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based