What are text-aligned quantized visual tokens?

They are discrete visual codes — chosen from a fixed codebook — whose meaning is trained to line up with language, rather than continuous per-patch vectors. ViQ (Tencent Hunyuan, arXiv 2606.27313, June 2026) produces them with a quantized visual tokenizer trained on a text-alignment objective, so each visual token corresponds to a text-grounded concept. Because the image becomes a sequence of discrete tokens, a multimodal LLM can consume it the same way it consumes text tokens, and ViQ keeps those codes stable across image resolutions without retraining per scale.

How is this different from continuous patch embeddings?

A typical vision encoder gives the model a unique continuous vector for each image patch — expressive, but foreign to the model's discrete token vocabulary and usually tied to the resolution it was trained on. ViQ instead maps patches onto a fixed codebook of discrete, text-aligned tokens, so the picture arrives as token IDs that fit an LLM naturally, and the same codebook serves any resolution. It is the difference between painting every patch freehand and describing the image with a reusable set of labeled stamps.

Why does resolution-agnostic matter?

Because vision systems usually bake the input size into the model, so a new image resolution means new positional handling or fine-tuning. ViQ's codebook is the same regardless of scale: a higher-resolution image simply becomes more tokens drawn from the same set, not a different encoder. That makes one tokenizer usable across thumbnails and large images alike, which is valuable when a multimodal LLM has to handle pictures of wildly varying size.

ViQ: text-aligned visual tokens, quantized at any image resolution — Text-aligned quantized visual tokens vs continuous patches

Jargon

Visual tokenizer: The component that turns an image into the units an LLM consumes. ViQ's job is to make those units discrete tokens, the visual analogue of how a text tokenizer splits a sentence.
Vector quantization (VQ): Replacing a continuous vector with the nearest entry from a fixed codebook — like rounding a precise value to the closest mark on a ruler. It is the same discretize-to-a-grid move as weight quantization, applied to image features.
Codebook: The fixed vocabulary of visual tokens ViQ maps patches onto. Each entry is one reusable "stamp"; an image becomes a sequence of codebook IDs. ViQ does not disclose its codebook size.
Text-aligned: The codebook is trained so its entries correspond to text-grounded concepts — the visual token for a furry ear lives near the word's meaning in the shared semantic space, instead of encoding raw pixels.
Continuous vs discrete representation: Continuous = an unbounded vector of real numbers per patch; discrete = a choice from a finite set. LLMs are built for discrete tokens, so discrete visual codes drop in more naturally.
Resolution-agnostic: The same codebook describes a thumbnail and a billboard — a higher-resolution image just becomes more tokens from the same set, with no retraining per scale.

The news. In June 2026, Tencent Hunyuan released ViQ (arXiv 2606.27313), a method for producing text-aligned visual quantized representations that stay stable across arbitrary image resolutions. Rather than encoding images into continuous features, ViQ learns discrete visual codes whose semantics stay aligned with text, so a multimodal LLM can consume images as quantized tokens at any resolution without retraining per scale. The paper trains a quantized visual tokenizer with a text-alignment objective; it does not disclose the codebook size or the exact quantization scheme. Read the paper →

Start with how a model usually "sees." A standard vision encoder cuts the image into patches and hands the LLM a continuous vector for each patch — a long list of real numbers, unique to that patch, freshly painted every time. It is freehand: expressive, but every smear is one-of-a-kind, foreign to the model's word vocabulary, and tied to the input size the encoder was trained on. That mismatch is why bolting vision onto a language model has always felt like gluing two different alphabets together.

ViQ swaps the paintbrush for a box of labeled rubber stamps. Instead of a unique vector per patch, it keeps a fixed codebook — a finite set of visual tokens — and describes each patch with the nearest stamp. This is vector quantization: the same "round a continuous value to the closest mark" move that weight quantization uses, except the marks here are whole visual concepts. The picture stops being a wall of fresh smears and becomes what an LLM was built for: a sequence of discrete tokens, exactly like the tokens of text.

The "text-aligned" part is what makes the stamps worth having. ViQ trains the codebook so each stamp's meaning lives next to the matching word in the shared semantic space — the visual token for a furry ear sits near the idea of an ear, not near a patch of brown pixels. A plain image-only codebook (the classic VQ-VAE move) would give you discrete tokens, but tokens that mean nothing to the language side; aligning them to text is what lets the LLM treat a picture and a paragraph as the same kind of thing.

Walk a quick illustrative example (ViQ does not publish its codebook size or token counts). Suppose the codebook holds 8,192 visual tokens and a 256×256 image becomes 256 of them. Quadruple the resolution to 512×512 and you simply get more tokens drawn from the same 8,192 — say ~1,024 — not a different encoder and not a retrain. That is the whole resolution-agnostic claim in one line: scale changes how many stamps you use, never which box you reach into. Continuous patch encoders, by contrast, usually bake the input size into the model, so a new resolution often means new positional handling or fine-tuning.

How the image is represented	What the LLM receives	Cost & fit
Continuous patch features (typical vision encoder)	a unique real-valued vector per patch	Expressive, but foreign to the token vocabulary and often size-locked
Image-only VQ (e.g. VQ-VAE)	discrete codes optimized to reconstruct pixels	Token-like, but the codes are not grounded in language
ViQ (text-aligned VQ)	discrete codes aligned to text concepts, any resolution [paper]	Drops into an LLM like text tokens; one codebook across scales

Goes deeper in: LLM Internals → Quantization → The Quantization Process

Related explainers

Gemma 4 12B — encoder-free multimodal — a different answer to the same question of how pixels should enter an LLM.
Gemini Omni — a shared token space — the broader goal ViQ serves: making every modality live in one token space.
Keye-VL 2.0 — DeepSeek sparse attention for video — what you do after images are tokens, when there are far too many of them.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based