Gemini Omni — One shared token space for every modality
LLMThe news. On May 25, 2026, Google DeepMind introduced Gemini Omni, the first model in a new "Omni" family that accepts image, audio, video, and text and — for now — outputs video, with physics-aware motion, multi-shot scene consistency, and SynthID watermarking. Google frames it as a model that can "create anything from any input — starting with video." Notably, the announcement is a capabilities-and-rollout post: it does not disclose the architecture, parameter count, tokenizer design, or training recipe. So this explainer teaches the established idea that makes any-to-any models possible — and is careful to label where Gemini Omni's specifics are simply unknown.
Picture the airport again. Hire one cashier per currency — one for yen, one for euros, one for pesos — and the place is a maze of booths, each fluent in exactly one money. That is the bolt-on approach to multimodality: a text model with a vision booth taped to its side, an audio booth taped next to that, each speaking only its own dialect. Now quote every price in one shared currency. A single teller takes in any money, counts it in that one currency, and can pay you back in whichever currency you ask for. That shared currency is what a shared token space is for a model: every modality gets converted at the door into the same kind of token, so one network reads them all — and can emit them too.
The conversion at the door is the load-bearing trick. Text already becomes tokens through ordinary tokenization. The other modalities need their own exchange desks: an image is cut into patches and each patch is mapped — typically by vector quantization against a learned codebook — to a discrete code; audio and video frames get the same treatment, right down to the byte level if you push it. Once everything is a sequence of token IDs in one shared embedding space, the transformer's attention stops caring whether a token came from a pixel, a waveform, or a word — it just attends over the sequence. Crucially, Google has not published how Gemini Omni does any of this; the patch-to-code recipe above is the general, well-documented mechanism behind any-to-any models, not a quoted Gemini detail.
The under-appreciated half is generation. A bolted-on vision encoder is a one-way street: it lets the model read an image, but the model can still only write text. When every modality lives in the same token vocabulary, the model can also predict tokens that a decoder turns back into a picture or a video — the teller paying back out. That is why Gemini Omni's headline is output ("starting with video"), not just understanding, and it is the cleanest line between this and an understanding-only system like GLM-5V.
Bolt-on encoder vs one shared token space
| Capability | Encoder bolted onto a text LLM | One shared token space (Omni-style) |
|---|---|---|
| Inputs | Text, plus a vision/audio encoder whose features are projected in | Every modality tokenized into one shared vocabulary |
| Outputs | Text only | Any modality the model has decode tokens for (Gemini Omni: video first) |
| What attention sees | Text tokens + a separate block of image features | One uniform sequence — all tokens attend to all tokens |
| Adding a new modality | Bolt on another encoder + projector and re-tune | Extend the shared codebook with new codes |
General architectural contrast, not a Gemini-Omni-specific claim — Google has not disclosed which design it uses.
A back-of-envelope token count
Walk one clip through the exchange desk (illustrative numbers; the real settings are undisclosed). Take a short clip at 256×256 pixels and 8 frames. Tokenize each frame into a 16×16 grid of patch codes → 256 tokens per frame × 8 frames = 2,048 video tokens. Add a 12-word prompt at roughly 16 text tokens. Both streams become one sequence of about 2,064 tokens drawn from the same vocabulary, and the model attends over all of them uniformly — no separate image-model handoff, no projector glue. Change the resolution or frame count and the token bill moves, but the shape of the idea does not: more signal simply means more tokens in the same space.
Goes deeper in: LLM Internals → Embeddings → The embedding space
Related explainers
- GLM-5V — native multimodal vs vision-bolted — the understanding side of this story: training jointly vs taping an encoder on
- NVIDIA Nemotron 3 Nano Omni — multimodal MoE — what a multimodal model looks like when sparsity (mixture-of-experts) meets a shared input pool
- DeepSeek V4 — long-context cost — why piling thousands of video tokens into one sequence makes long-context efficiency the next bottleneck