What is Next-State-Prediction?

Next-State-Prediction (NSP) is Orca's single training objective: from the current state of the world, predict the next state. It generalizes next-token prediction — the objective behind text LLMs — from words to video frames and embodied actions, so one model can be trained on all three at once instead of using a separate objective for each. Orca (arXiv 2606.30534, "The World is in Your Mind") centers its whole design on this one state-transition target.

Why unify text, image, and action into one objective?

Training three separate objectives — next-token for text, next-frame for video, next-action for control — leaves each model with its own siloed picture of the world. Orca's argument is that all three are the same job in disguise: predict the next state. Folding them into one Next-State-Prediction objective builds a single coherent world latent that every task reads from, which the authors report outperforms similar-sized specialists and keeps improving as the shared latent gets stronger.

Two signals feed one shared latent. An "unconscious" stream of dense transitions from 125,000 hours of raw video teaches broad physical intuition, and a "conscious" stream of 160 million sparse, language-described events plus visual question-answering (VQA) supervision injects explicit meaning. A frozen backbone holds that combined understanding fixed while lightweight modality-specific decoders learn to read it out as text, images, or actions.

Orca learns a unified world model over video and events — Next-State-Prediction

TL;DR

What is it: The Orca paper (arXiv 2606.30534, "The World is in Your Mind") trains one shared world latent and reads it out to three tasks — text, image, and embodied action. The idea it makes concrete is Next-State-Prediction: a single state-transition objective that replaces separate next-token, next-frame, and next-action training.
Why it’s needed: An agent that understands, predicts, and acts in the world needs one coherent picture of that world, not three siloed skills. Folding every modality into a single prediction objective builds that shared picture — and Orca reports it keeps getting better as the world latent gets stronger.
vs previous: The usual recipe optimizes isolated next-token (text), next-frame (video), and next-action (control) objectives in separate models; Orca centers everything on Next-State-Prediction over one frozen backbone, fixing the fragmented, modality-siloed representations that recipe produces.

Jargon

Next-State-Prediction (NSP): Orca's core objective: from the current state, predict the next state of the world — a single state-transition target that generalizes next-token prediction from text to video and action.
World latent: A shared internal representation of the whole scene — one latent space that all three tasks read from, rather than a separate representation per modality.
Unconscious learning: Orca's name for training on dense, continuous transitions in raw video — the steady stream the model soaks up without labels, like watching how the world usually flows.
Conscious learning: Training on sparse, meaningful transitions supplied as language-described events and visual question-answering (VQA) supervision — the few flagged moments that carry explicit meaning.
Frozen backbone: The shared world model's weights are held fixed while small per-task decoders are trained on top — so one representation serves every readout.
Modality-specific decoders: Lightweight heads that turn the shared latent into a concrete output: words, a predicted image, or an action. Three small readouts, one big brain.

The news. On July 1, 2026, researchers released Orca ("The World is in Your Mind"), a model that learns one unified world latent and reads it out for three tasks — text generation, image prediction, and embodied action — through lightweight modality-specific decoders on a frozen backbone. It is trained on 125,000 hours of video and 160 million event annotations, and the authors report it outperforms similar-sized specialized baselines while getting stronger as the world latent scales (the gains are stated qualitatively, without a single headline number). Read the paper →

Picture a national weather service. It does not build one machine that spits out temperatures, a second that draws rain maps, and a third that decides whether to sound a storm siren — it runs one physical model of the atmosphere, then reads that single model out three different ways. Orca does the same thing for an AI's understanding of the world: instead of three separate predictors, it learns one shared "world latent" and taps it for words, for pictures, and for actions.

Why does that unification matter? Because the usual recipe trains three genuinely different objectives. A text model learns to predict the next token; a video model learns to predict the next frame; a control policy learns to predict the next action — and each ends up with its own, siloed picture of the world. Orca's claim is that all three are really the same job dressed in different clothes: predict the next state. Fold them into one objective — Next-State-Prediction — and you get a single coherent world model instead of three partial ones. The shared latent space is what the three readout heads all draw from.

How does that single model get trained? Two signals feed the same latent: an "unconscious" stream of dense transitions from raw video — the way the world usually flows — and a "conscious" stream of sparse, human-flagged events described in language. The dense stream teaches broad physical intuition; the sparse stream injects explicit meaning where it matters. A frozen backbone holds that combined understanding fixed while small decoders learn to read it out — and because the readout for embodied action shares the exact same latent as the readout for text, the model's "sense" of the world is consistent across everything it does.

Objective	What it predicts	What it produces
Next-token (text)	the next word	a text-only model, not trained to share a representation with pixels or actions
Next-frame (video)	the next image	a vision-only model, not trained to share a representation with language or actions
Next-action (control)	the next move	a control-only policy, not trained to share a representation with words or pixels
Next-State-Prediction (Orca)	the next state of the world [paper]	one shared latent, read out to all three

Ground the two training signals in numbers. Orca learns from 125,000 hours of video and 160 million labeled events. At a rough 30 frames per second (illustrative frame rate), 125,000 hours is about 13.5 billion frames — so there is only about one labeled "conscious" event for every ~84 frames of raw "unconscious" video the model absorbs on its own. That lopsided ratio is the point: most of what a world model needs to know is learned for free from watching the world move, and the sparse labels only steer the meaning.

Goes deeper in: LLM Internals → Text Generation → One Token at a Time

Related explainers

Qwen-AgentWorld — world model as a decoupled RL simulator — another "predict the next state" world model, but aimed at simulating environments for RL rather than unifying modalities.
Gemma 4 12B — encoder-free multimodal — a different route to one model for many modalities: fold vision straight into the token stream.
ViQ — text-aligned quantized visual tokens — how images become tokens a single model can read alongside text.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based