Gemini Omni — One shared token space for every modality

LLM
L
one shared token spaceonemodeltexttheimageaudiovideotokenizers
learnaivisually.com/ai-explained/gemini-omni-shared-token-space

The news. On May 25, 2026, Google DeepMind introduced Gemini Omni, the first model in a new "Omni" family that accepts image, audio, video, and text and — for now — outputs video, with physics-aware motion, multi-shot scene consistency, and SynthID watermarking. Google frames it as a model that can "create anything from any input — starting with video." Notably, the announcement is a capabilities-and-rollout post: it does not disclose the architecture, parameter count, tokenizer design, or training recipe. So this explainer teaches the established idea that makes any-to-any models possible — and is careful to label where Gemini Omni's specifics are simply unknown.

Picture the airport again. Hire one cashier per currency — one for yen, one for euros, one for pesos — and the place is a maze of booths, each fluent in exactly one money. That is the bolt-on approach to multimodality: a text model with a vision booth taped to its side, an audio booth taped next to that, each speaking only its own dialect. Now quote every price in one shared currency. A single teller takes in any money, counts it in that one currency, and can pay you back in whichever currency you ask for. That shared currency is what a shared token space is for a model: every modality gets converted at the door into the same kind of token, so one network reads them all — and can emit them too.

The conversion at the door is the load-bearing trick. Text already becomes tokens through ordinary tokenization. The other modalities need their own exchange desks: an image is cut into patches and each patch is mapped — typically by vector quantization against a learned codebook — to a discrete code; audio and video frames get the same treatment, right down to the byte level if you push it. Once everything is a sequence of token IDs in one shared embedding space, the transformer's attention stops caring whether a token came from a pixel, a waveform, or a word — it just attends over the sequence. Crucially, Google has not published how Gemini Omni does any of this; the patch-to-code recipe above is the general, well-documented mechanism behind any-to-any models, not a quoted Gemini detail.

One-hot (12 dims shown)
cat
0
0
1
0
0
0
0
0
0
0
0
0
dog
0
0
0
1
0
0
0
0
0
0
0
0
king
0
0
0
0
0
0
0
1
0
0
0
0
Every token looks the same distance apart
Embedding (8 dims shown)
cat
0.5
0.8
-0.1
0.3
-0.6
0.2
0.7
-0.9
dog
0.4
0.8
-0.1
0.3
-0.5
0.3
0.6
-0.8
king
-0.7
0.1
0.9
-0.5
0.8
-0.3
-0.2
0.6
cat & dog look similar — king looks different

The under-appreciated half is generation. A bolted-on vision encoder is a one-way street: it lets the model read an image, but the model can still only write text. When every modality lives in the same token vocabulary, the model can also predict tokens that a decoder turns back into a picture or a video — the teller paying back out. That is why Gemini Omni's headline is output ("starting with video"), not just understanding, and it is the cleanest line between this and an understanding-only system like GLM-5V.

Bolt-on encoder vs one shared token space

CapabilityEncoder bolted onto a text LLMOne shared token space (Omni-style)
InputsText, plus a vision/audio encoder whose features are projected inEvery modality tokenized into one shared vocabulary
OutputsText onlyAny modality the model has decode tokens for (Gemini Omni: video first)
What attention seesText tokens + a separate block of image featuresOne uniform sequence — all tokens attend to all tokens
Adding a new modalityBolt on another encoder + projector and re-tuneExtend the shared codebook with new codes

General architectural contrast, not a Gemini-Omni-specific claim — Google has not disclosed which design it uses.

A back-of-envelope token count

Walk one clip through the exchange desk (illustrative numbers; the real settings are undisclosed). Take a short clip at 256×256 pixels and 8 frames. Tokenize each frame into a 16×16 grid of patch codes → 256 tokens per frame × 8 frames = 2,048 video tokens. Add a 12-word prompt at roughly 16 text tokens. Both streams become one sequence of about 2,064 tokens drawn from the same vocabulary, and the model attends over all of them uniformly — no separate image-model handoff, no projector glue. Change the resolution or frame count and the token bill moves, but the shape of the idea does not: more signal simply means more tokens in the same space.

Goes deeper in: LLM Internals → Embeddings → The embedding space

Related explainers

Continue in trackLLM Internals — Embeddings

Frequently Asked Questions