How is the LLaVA-style bolt-on different?

A bolted design trains a strong text-only LLM first, then attaches a separate vision encoder (typically a ViT) and a small projector MLP that maps image patch embeddings into the LLM's word-embedding space. The projector and a thin slice of the LLM are trained on image-text pairs, but the core LLM weights were never shaped by vision data during their pretraining. It is much cheaper to build than a native multimodal model — which is why it became the default open-source recipe — but the LLM's planning circuits live in a language-shaped representation that visual evidence has to be projected into after the fact.

Why does it matter for agentic tool use?

Agentic tasks like 'look at this screenshot, decide which of three actions advances my goal, emit a tool call' demand that perception drive planning, not the other way around. A bolted model plans in a representation that was hardened around language and treats screen evidence as a foreign vector; a native model plans in weights that were optimized end-to-end to predict the next thing in any modality, including the next tool call. As workloads shift from question-answering to driving UIs and reading charts, the asymmetry between the two designs gets larger, not smaller.

GLM-5V-Turbo — native multimodal vs text-first vision-bolted designs

LLM

learnaivisually.com/ai-explained/glm-5v-native-multimodal

TL;DR

What is it: GLM-5V-Turbo (Z.ai + Tsinghua) is a foundation model whose pretraining trains text, vision, and tool-trajectory data jointly from the very first parameter update — a “native multimodal” design where every modality shapes the LLM’s weights from step 1.
Why it’s needed: Workloads are shifting from “answer a question” to “drive a UI, read a chart, plan a tool call.” Agentic tasks need perception to drive planning — which only works if the LLM’s weights were shaped by pixels during their formative pretraining, not after.
vs previous: The LLaVA-style default trains a strong text-only LLM first, then bolts on a frozen ViT plus a small projector MLP. Cheap to build, but the LLM’s reasoning circuits live in a language-shaped representation and have to translate pixels into it after the fact.

Jargon

ViT: Vision Transformer. An encoder that chops an image into fixed-size patch tokens and processes them with self-attention — producing one embedding vector per patch. In the bolted pipeline the ViT is frozen after its own pretraining and never updated alongside the LLM.
LLaVA: Large Language-and-Vision Assistant — the open-source recipe that became the default for adding vision to a text LLM: take a pretrained text model, attach a frozen ViT, train a small projector MLP to bridge the two. Fast and cheap to build, but the LLM weights were never shaped by vision data.
projector MLP: A small feed-forward network that re-scales the ViT’s patch embeddings into the LLM’s word-embedding space. It is the “translator” in the bilingual metaphor — it maps pixels into the language-shaped representation the LLM was pretrained around.
patch token: One fixed-size tile of an image (commonly 14×14 or 16×16 pixels), encoded by the ViT into a single vector. A 336×336 image produces roughly 576 patch tokens; these are prepended to the text prompt so the LLM can attend to them as if they were extra words.
multimodal: Literally “many modalities.” In LLM contexts it means the model processes more than one data type — text, images, audio, tool trajectories — as part of its native sequence. Attention over a multimodal sequence treats every patch token and text token symmetrically.
tool trajectory: A training signal derived from sequences of tool calls and their results — e.g. “search(query) → result → next_action.” Including tool-trajectory data in pretraining shapes the model’s FFN weights to predict tool calls, not just the next word.
agentic: Describes tasks where the model must plan a sequence of actions (tool calls, UI clicks, API requests) rather than produce a single text response. Agentic workloads require perception to drive planning — the key motivation for native multimodal training.
MoE: Mixture-of-Experts. An inference-time architecture where one model contains many small expert sub-networks inside its FFN layers, with a router activating only a few per token. Unrelated to native-multimodal training — GLM-5V-Turbo’s “multimodal” refers to data types, not expert routing.

The news. On April 30, 2026, Z.ai and Tsinghua released GLM-5V-Turbo, a foundation model the authors describe as natively multimodal — text, vision, and tool-trajectory signals trained jointly from the very first parameter update rather than bolted onto a finished text LLM. The paper reports the model is strong on multimodal coding, visual tool use, and agentic tasks, framing perception as a core component of reasoning, planning, and tool use rather than an adapter on top. The architectural specifics — encoder front-end, expert layout, routing scheme — are not yet documented in the public release; what is documented is the design philosophy. Read the paper →

Picture two ways to end up "fluent in pictures." A child who grows up in a household where adults speak, gesture, and point at things from day one builds one brain in which words and visual scenes are inseparable — every neuron that fires for "cup" also fires for the round shape on the table. An adult who became fluent in English first and only later took an art-history course works through a translator: the visual scene is described in words, those words enter the language brain that was already shaped, and the answer comes out in the language the brain was hardened around. Both can answer questions about pictures. But ask either to act on what they see — point at the right cup, plan three steps ahead based on a chart — and the bilingual child's reaction is in a different shape from the adult's translation pipeline.

Native multimodal training is the bilingual-child design. The intent is that during pretraining the loss function sees text loss, vision loss, and tool-trajectory loss together inside the same training step. In that regime, gradients from a misclassified image patch reach the same FFN weights that gradients from a misspelled word reach. The transformer's weights are shaped to carry one shared representation, and language is just one slice of it — alongside pixels and tool outputs. The expensive part is that this requires a full pretraining run over multimodal data, not the much cheaper "reuse a text LLM and tack a vision adapter on" pipeline that became the open-source default after Llama. Z.ai paid that cost on purpose, betting that workloads are shifting from "answer a question" to "drive a UI, read a chart, run a tool, decide a next step" — the kinds of tasks where having pixels in the LLM's formative pretraining matters most.

The bolted design is the adult-learner pipeline. The classical LLaVA-style recipe trains a strong text-only LLM first; then attaches a frozen Vision Transformer (ViT) on the side; then trains a small projector MLP that maps the ViT's image-patch embeddings into the LLM's word-embedding space. At inference the image becomes ~256 patch tokens, the projector rescales them, and they get prepended to the text prompt — the LLM consumes them as if they were extra "words." The reason this approach won the open-source wave is brutal economics: text data is plentiful and cheap, image-text pair data is scarcer, tool-trajectory data is the rarest of all, and reusing a frozen text LLM amortized the most expensive part of training across many vision derivatives. The architectural cost is that the LLM's internal representations were never shaped by vision data during their formative pretraining. The model is asked, after the fact, to project pixel evidence into a representation that was hardened around language.

The contrast lines up roughly like this:

Property	Text-first / vision-bolted (LLaVA-style)	Native multimodal (GLM-5V-Turbo-style)
Training order	LLM first, vision adapter later	Text + vision + tool data jointly from step 1
What the LLM weights see during pretraining	Text only	Every modality
Vision pathway	Frozen ViT + projector MLP, prepended to prompt	Visual tokens enter the same transformer stack as text
Where reasoning happens	In a language-shaped representation	In a shared representation grounded in all modalities
Training cost	Lower — reuses an existing text LLM	Higher — full pretraining over multimodal data
Best fit	Visual Q&A on top of an existing text model	Agentic tool use where perception must drive planning

A worked example sharpens what changes. Imagine the model is asked: "Look at this code-review screenshot and decide whether to approve or request changes." Both pipelines turn the screenshot into roughly 256 patch tokens and feed them into a transformer stack alongside the text prompt. In the bolted case, the LLM's feed-forward weights were optimized to predict the next English word; the patches arrive as foreign vectors clipped onto the front of the prompt, and the model has to reason about pixels through a representation that never saw any. In the native case, the same patch tokens flow through weights that were optimized end-to-end to predict the next thing in any modality — including the next tool call — so a meaningful fraction of FFN capacity was trained to recognize patterns of pixels that drive subsequent tool calls. The same forward-pass FLOPs get spent very differently. The bolted model translates pixels into language; the native model thinks in pixels and language at once.

Goes deeper in: LLM Internals → Embeddings → Vector Space and LLM Internals → Transformer Block → The Feed-Forward Network

GLM-5V-Turbo — native multimodal vs text-first vision-bolted designs

Frequently Asked Questions