GLM-5V-Turbo — native multimodal vs text-first vision-bolted designs

LLM
L
GLM-5V hero animation — bolted text-only LLM with vision projector vs native multimodal transformer trained on every modality from step 1
learnaivisually.com/ai-explained/glm-5v-native-multimodal

The news. On April 30, 2026, Z.ai and Tsinghua released GLM-5V-Turbo, a foundation model the authors describe as natively multimodal — text, vision, and tool-trajectory signals trained jointly from the very first parameter update rather than bolted onto a finished text LLM. The paper reports the model is strong on multimodal coding, visual tool use, and agentic tasks, framing perception as a core component of reasoning, planning, and tool use rather than an adapter on top. The architectural specifics — encoder front-end, expert layout, routing scheme — are not yet documented in the public release; what is documented is the design philosophy. Read the paper →

Picture two ways to end up "fluent in pictures." A child who grows up in a household where adults speak, gesture, and point at things from day one builds one brain in which words and visual scenes are inseparable — every neuron that fires for "cup" also fires for the round shape on the table. An adult who became fluent in English first and only later took an art-history course works through a translator: the visual scene is described in words, those words enter the language brain that was already shaped, and the answer comes out in the language the brain was hardened around. Both can answer questions about pictures. But ask either to act on what they see — point at the right cup, plan three steps ahead based on a chart — and the bilingual child's reaction is in a different shape from the adult's translation pipeline.

Native multimodal training is the bilingual-child design. The intent is that during pretraining the loss function sees text loss, vision loss, and tool-trajectory loss together inside the same training step. In that regime, gradients from a misclassified image patch reach the same FFN weights that gradients from a misspelled word reach. The transformer's weights are shaped to carry one shared representation, and language is just one slice of it — alongside pixels and tool outputs. The expensive part is that this requires a full pretraining run over multimodal data, not the much cheaper "reuse a text LLM and tack a vision adapter on" pipeline that became the open-source default after Llama. Z.ai paid that cost on purpose, betting that workloads are shifting from "answer a question" to "drive a UI, read a chart, run a tool, decide a next step" — the kinds of tasks where having pixels in the LLM's formative pretraining matters most.

The bolted design is the adult-learner pipeline. The classical LLaVA-style recipe trains a strong text-only LLM first; then attaches a frozen Vision Transformer (ViT) on the side; then trains a small projector MLP that maps the ViT's image-patch embeddings into the LLM's word-embedding space. At inference the image becomes ~256 patch tokens, the projector rescales them, and they get prepended to the text prompt — the LLM consumes them as if they were extra "words." The reason this approach won the open-source wave is brutal economics: text data is plentiful and cheap, image-text pair data is scarcer, tool-trajectory data is the rarest of all, and reusing a frozen text LLM amortized the most expensive part of training across many vision derivatives. The architectural cost is that the LLM's internal representations were never shaped by vision data during their formative pretraining. The model is asked, after the fact, to project pixel evidence into a representation that was hardened around language.

Inputembeddings
Thecatsatonthemat
Attentioncontext mixing
+ Residualadd & norm
Feed-Forwardtransform
Outputto next layer

The contrast lines up roughly like this:

PropertyText-first / vision-bolted (LLaVA-style)Native multimodal (GLM-5V-Turbo-style)
Training orderLLM first, vision adapter laterText + vision + tool data jointly from step 1
What the LLM weights see during pretrainingText onlyEvery modality
Vision pathwayFrozen ViT + projector MLP, prepended to promptVisual tokens enter the same transformer stack as text
Where reasoning happensIn a language-shaped representationIn a shared representation grounded in all modalities
Training costLower — reuses an existing text LLMHigher — full pretraining over multimodal data
Best fitVisual Q&A on top of an existing text modelAgentic tool use where perception must drive planning

A worked example sharpens what changes. Imagine the model is asked: "Look at this code-review screenshot and decide whether to approve or request changes." Both pipelines turn the screenshot into roughly 256 patch tokens and feed them into a transformer stack alongside the text prompt. In the bolted case, the LLM's feed-forward weights were optimized to predict the next English word; the patches arrive as foreign vectors clipped onto the front of the prompt, and the model has to reason about pixels through a representation that never saw any. In the native case, the same patch tokens flow through weights that were optimized end-to-end to predict the next thing in any modality — including the next tool call — so a meaningful fraction of FFN capacity was trained to recognize patterns of pixels that drive subsequent tool calls. The same forward-pass FLOPs get spent very differently. The bolted model translates pixels into language; the native model thinks in pixels and language at once.

Goes deeper in: LLM Internals → Embeddings → Vector Space and LLM Internals → Transformer Block → The Feed-Forward Network

Frequently Asked Questions