What is a shared multimodal token space?

It is one vocabulary and one embedding space in which tokens from every modality - text, image, audio, video - live together. Each modality is converted at the input by a tokenizer (subword splitting for text, patch-to-code vector quantization for pixels and audio), so the transformer reads a single uniform sequence instead of a separate sub-model per modality. Google's Gemini Omni, announced May 25, 2026, is a recent any-input model of this kind, though its specific tokenizer is not publicly documented.

Why does modality unification matter?

It collapses a pile of per-modality machinery - one encoder, projector, and decoder per modality - into a single network that reads and writes everything. That is what lets a model both ingest a video and generate one, not just describe it in text. Adding a modality also becomes 'extend the shared codebook' instead of 'bolt on and re-tune another encoder', a much cheaper way to grow a model's senses.

How is Gemini Omni different from a vision-bolted multimodal model like GLM-5V?

A vision-bolted model attaches an image encoder to a finished text LLM so it can understand pictures, but it still only outputs text. Gemini Omni's headline is generation - it emits video - which requires every modality to live in the same token vocabulary so the model can predict media tokens, not just read them. The contrast is understanding-only versus read-and-write. Google has not published Gemini Omni's architecture, so the comparison is at the level of design families, not disclosed internals.

Google's Gemini Omni — Modality unification in a shared token space

Gemini Omni — One shared token space for every modality

LLM

learnaivisually.com/ai-explained/gemini-omni-shared-token-space

Jargon

Token: The discrete unit a transformer actually reads. For text it is a subword piece; for other modalities it is a learned discrete code. Everything inside the model is a sequence of token IDs. See Tokenization → intro.
Tokenizer: The component that converts raw input into a sequence of discrete IDs before the model sees it — the metaphor's exchange desk. Text tokenizers split characters into subwords; modality tokenizers turn pixels or audio into codes.
Modality: A type of input or output — text, image, audio, video. "Multimodal" means a model handles more than one.
Shared token space: One vocabulary and one embedding space in which tokens from every modality live together, so the model treats a pixel-derived token and a word-derived token the same way.
Vector quantization (VQ) / codebook: The standard way to make a continuous signal (pixels, audio) discrete: map each patch to its nearest entry in a learned codebook, and use that entry's index as a token. This is the general recipe; Google has not disclosed Gemini Omni's specific scheme.
Native multimodal: Trained from the start on all modalities together, versus bolting a separate vision or audio encoder onto a finished text LLM. See the contrast in GLM-5V — native multimodal.
SynthID: Google's watermark embedded into generated media so downstream tools can flag it as AI-produced. It is a provenance feature, not part of the tokenization mechanism.

The news. On May 25, 2026, Google DeepMind introduced Gemini Omni, the first model in a new "Omni" family that accepts image, audio, video, and text and — for now — outputs video, with physics-aware motion, multi-shot scene consistency, and SynthID watermarking. Google frames it as a model that can "create anything from any input — starting with video." Notably, the announcement is a capabilities-and-rollout post: it does not disclose the architecture, parameter count, tokenizer design, or training recipe. So this explainer teaches the established idea that makes any-to-any models possible — and is careful to label where Gemini Omni's specifics are simply unknown.

Picture the airport again. Hire one cashier per currency — one for yen, one for euros, one for pesos — and the place is a maze of booths, each fluent in exactly one money. That is the bolt-on approach to multimodality: a text model with a vision booth taped to its side, an audio booth taped next to that, each speaking only its own dialect. Now quote every price in one shared currency. A single teller takes in any money, counts it in that one currency, and can pay you back in whichever currency you ask for. That shared currency is what a shared token space is for a model: every modality gets converted at the door into the same kind of token, so one network reads them all — and can emit them too.

The conversion at the door is the load-bearing trick. Text already becomes tokens through ordinary tokenization. The other modalities need their own exchange desks: an image is cut into patches and each patch is mapped — typically by vector quantization against a learned codebook — to a discrete code; audio and video frames get the same treatment, right down to the byte level if you push it. Once everything is a sequence of token IDs in one shared embedding space, the transformer's attention stops caring whether a token came from a pixel, a waveform, or a word — it just attends over the sequence. Crucially, Google has not published how Gemini Omni does any of this; the patch-to-code recipe above is the general, well-documented mechanism behind any-to-any models, not a quoted Gemini detail.

One-hot (12 dims shown)

cat

dog

king

Every token looks the same distance apart

→

Embedding (8 dims shown)

cat

0.5

0.8

-0.1

0.3

-0.6

0.2

0.7

-0.9

dog

0.4

0.8

-0.1

0.3

-0.5

0.3

0.6

-0.8

king

-0.7

0.1

0.9

-0.5

0.8

-0.3

-0.2

0.6

cat & dog look similar — king looks different

The under-appreciated half is generation. A bolted-on vision encoder is a one-way street: it lets the model read an image, but the model can still only write text. When every modality lives in the same token vocabulary, the model can also predict tokens that a decoder turns back into a picture or a video — the teller paying back out. That is why Gemini Omni's headline is output ("starting with video"), not just understanding, and it is the cleanest line between this and an understanding-only system like GLM-5V.

Bolt-on encoder vs one shared token space

Capability	Encoder bolted onto a text LLM	One shared token space (Omni-style)
Inputs	Text, plus a vision/audio encoder whose features are projected in	Every modality tokenized into one shared vocabulary
Outputs	Text only	Any modality the model has decode tokens for (Gemini Omni: video first)
What attention sees	Text tokens + a separate block of image features	One uniform sequence — all tokens attend to all tokens
Adding a new modality	Bolt on another encoder + projector and re-tune	Extend the shared codebook with new codes

General architectural contrast, not a Gemini-Omni-specific claim — Google has not disclosed which design it uses.

A back-of-envelope token count

Walk one clip through the exchange desk (illustrative numbers; the real settings are undisclosed). Take a short clip at 256×256 pixels and 8 frames. Tokenize each frame into a 16×16 grid of patch codes → 256 tokens per frame × 8 frames = 2,048 video tokens. Add a 12-word prompt at roughly 16 text tokens. Both streams become one sequence of about 2,064 tokens drawn from the same vocabulary, and the model attends over all of them uniformly — no separate image-model handoff, no projector glue. Change the resolution or frame count and the token bill moves, but the shape of the idea does not: more signal simply means more tokens in the same space.

Goes deeper in: LLM Internals → Embeddings → The embedding space

Related explainers

GLM-5V — native multimodal vs vision-bolted — the understanding side of this story: training jointly vs taping an encoder on
NVIDIA Nemotron 3 Nano Omni — multimodal MoE — what a multimodal model looks like when sparsity (mixture-of-experts) meets a shared input pool
DeepSeek V4 — long-context cost — why piling thousands of video tokens into one sequence makes long-context efficiency the next bottleneck

Continue in trackLLM Internals — Embeddings

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based