The news. In June 2026, Tencent Hunyuan released ViQ (arXiv 2606.27313), a method for producing text-aligned visual quantized representations that stay stable across arbitrary image resolutions. Rather than encoding images into continuous features, ViQ learns discrete visual codes whose semantics stay aligned with text, so a multimodal LLM can consume images as quantized tokens at any resolution without retraining per scale. The paper trains a quantized visual tokenizer with a text-alignment objective; it does not disclose the codebook size or the exact quantization scheme. Read the paper →
Start with how a model usually "sees." A standard vision encoder cuts the image into patches and hands the LLM a continuous vector for each patch — a long list of real numbers, unique to that patch, freshly painted every time. It is freehand: expressive, but every smear is one-of-a-kind, foreign to the model's word vocabulary, and tied to the input size the encoder was trained on. That mismatch is why bolting vision onto a language model has always felt like gluing two different alphabets together.
ViQ swaps the paintbrush for a box of labeled rubber stamps. Instead of a unique vector per patch, it keeps a fixed codebook — a finite set of visual tokens — and describes each patch with the nearest stamp. This is vector quantization: the same "round a continuous value to the closest mark" move that weight quantization uses, except the marks here are whole visual concepts. The picture stops being a wall of fresh smears and becomes what an LLM was built for: a sequence of discrete tokens, exactly like the tokens of text.
The "text-aligned" part is what makes the stamps worth having. ViQ trains the codebook so each stamp's meaning lives next to the matching word in the shared semantic space — the visual token for a furry ear sits near the idea of an ear, not near a patch of brown pixels. A plain image-only codebook (the classic VQ-VAE move) would give you discrete tokens, but tokens that mean nothing to the language side; aligning them to text is what lets the LLM treat a picture and a paragraph as the same kind of thing.
Walk a quick illustrative example (ViQ does not publish its codebook size or token counts). Suppose the codebook holds 8,192 visual tokens and a 256×256 image becomes 256 of them. Quadruple the resolution to 512×512 and you simply get more tokens drawn from the same 8,192 — say ~1,024 — not a different encoder and not a retrain. That is the whole resolution-agnostic claim in one line: scale changes how many stamps you use, never which box you reach into. Continuous patch encoders, by contrast, usually bake the input size into the model, so a new resolution often means new positional handling or fine-tuning.
| How the image is represented | What the LLM receives | Cost & fit |
|---|---|---|
| Continuous patch features (typical vision encoder) | a unique real-valued vector per patch | Expressive, but foreign to the token vocabulary and often size-locked |
| Image-only VQ (e.g. VQ-VAE) | discrete codes optimized to reconstruct pixels | Token-like, but the codes are not grounded in language |
| ViQ (text-aligned VQ) | discrete codes aligned to text concepts, any resolution [paper] | Drops into an LLM like text tokens; one codebook across scales |
Goes deeper in: LLM Internals → Quantization → The Quantization Process
Related explainers
- Gemma 4 12B — encoder-free multimodal — a different answer to the same question of how pixels should enter an LLM.
- Gemini Omni — a shared token space — the broader goal ViQ serves: making every modality live in one token space.
- Keye-VL 2.0 — DeepSeek sparse attention for video — what you do after images are tokens, when there are far too many of them.