The news. On June 3, 2026, Google released Gemma 4 12B, an Apache-2.0 model that drops the separate vision and audio encoders most multimodal models bolt on. Instead it projects both kinds of input straight into the language backbone: vision through a lightweight module — reportedly a single matrix multiply plus positional and normalization terms — and audio into the same dimensional space as text tokens. It is the first mid-sized Gemma to take native audio input, runs on 16 GB of VRAM or unified memory, and reportedly scores near Google's larger 26B mixture-of-experts model. Read the announcement →
Picture the meeting. A text prompt is a guest who already speaks the room's language — it walks in and starts talking. A picture and a sound clip don't: the usual fix hires a separate translator for each, a whole second staffer who listens, re-voices everything, and only then lets the guest join. Those translators are the model's vision and audio encoders — extra networks that run before the language model sees a thing. Gemma 4 12B fires the translators. It teaches pictures and sound to speak the room's language directly, in one quick step, so every guest — text, image, audio — sits at the same table as an ordinary token.
Underneath the metaphor, "speaking the room's language" means landing in the model's embedding space — the dense vectors a transformer actually consumes. A token ID becomes a vector by a lookup; an image patch becomes one by a projection. As a toy example, cut a 256×256 image into 16×16 patches and you get 256 patches, each a flat list of 16·16·3 = 768 raw numbers. The old way pushes patches like these through a vision transformer — tens of attention-and-MLP layers — before the LLM gets a single feature. Gemma's encoder-free path instead, by Google's description, applies a single matrix multiply (plus a positional term and normalization) that turns each patch straight into a token, the same shape as a word's embedding. Audio is projected into that same space too. The whole pre-LLM encoder stack collapses to that one projection — and the backbone itself takes over the visual and acoustic processing.
| Approach | How an image enters | Separate encoder? | Cost profile |
|---|---|---|---|
| Encoder-based (ViT + projector) | image → vision transformer (tens of layers) → projector → tokens | yes — a full vision network runs first | more parameters and latency before the first output token |
| Encoder-free (Gemma 4 12B) | patches → one matrix multiply (+ position/norm) → tokens | no separate encoder | ~16 GB, lower pre-decode latency (Google, reported) |
Removing the encoder stack has consequences, but the wins are concrete. A separate vision tower is parameters you store, compute you run, and latency you pay before the first output token; deleting it is a big reason a 12B model can field images and audio inside 16 GB rather than needing a datacenter card, and part of why Google can claim quality near its 26B mixture-of-experts model despite the smaller, simpler stack. The catch is that the backbone now has to learn visual and acoustic structure itself, with no pretrained encoder doing that work for it — which is plausibly why this ships as a 12B model trained for it from the start rather than a vision adapter glued onto an existing text model. The architectural specifics beyond the single-matmul description are not yet fully documented.
The payoff is a cleaner idea of what "multimodal" even requires. You don't strictly need a bespoke eye and ear bolted onto a language model; if every input can be projected into the same token space, one backbone can read all of them. Gemma 4 12B is a bet that for a small, open model meant to run on modest hardware, fewer moving parts beats a heavier, more specialized stack.
Goes deeper in: LLM Internals → Embeddings → From Token IDs to Vectors
Related explainers
- GLM-5V — native multimodal vs vision-bolted — the neighboring question: training a model multimodal from the start versus adapting a text model, a different axis than removing the encoder
- Gemini Omni — modality unification in a shared token space — the same "one token space for every modality" idea, taken to full any-to-any generation
- Gemma 4 QAT — quantization-aware training — the other route to running a real model on modest hardware: shrink the bits, instead of removing the encoder