The news. In June 2026, Tencent Hunyuan released ViQ (arXiv 2606.27313), a method for producing text-aligned visual quantized representations that stay stable across arbitrary image resolutions. Rather than encoding images into continuous features, ViQ learns discrete visual codes whose semantics stay aligned with text, so a multimodal LLM can consume images as quantized tokens at any resolution without retraining per scale. The paper trains a quantized visual tokenizer with a text-alignment objective; it does not disclose the codebook size or the exact quantization scheme. Read the paper →

Start with how a model usually "sees." A standard vision encoder cuts the image into patches and hands the LLM a continuous vector for each patch — a long list of real numbers, unique to that patch, freshly painted every time. It is freehand: expressive, but every smear is one-of-a-kind, foreign to the model's word vocabulary, and tied to the input size the encoder was trained on. That mismatch is why bolting vision onto a language model has always felt like gluing two different alphabets together.

ViQ swaps the paintbrush for a box of labeled rubber stamps. Instead of a unique vector per patch, it keeps a fixed codebook — a finite set of visual tokens — and describes each patch with the nearest stamp. This is vector quantization: the same "round a continuous value to the closest mark" move that weight quantization uses, except the marks here are whole visual concepts. The picture stops being a wall of fresh smears and becomes what an LLM was built for: a sequence of discrete tokens, exactly like the tokens of text.

The "text-aligned" part is what makes the stamps worth having. ViQ trains the codebook so each stamp's meaning lives next to the matching word in the shared semantic space — the visual token for a furry ear sits near the idea of an ear, not near a patch of brown pixels. A plain image-only codebook (the classic VQ-VAE move) would give you discrete tokens, but tokens that mean nothing to the language side; aligning them to text is what lets the LLM treat a picture and a paragraph as the same kind of thing.

Walk a quick illustrative example (ViQ does not publish its codebook size or token counts). Suppose the codebook holds 8,192 visual tokens and a 256×256 image becomes 256 of them. Quadruple the resolution to 512×512 and you simply get more tokens drawn from the same 8,192 — say ~1,024 — not a different encoder and not a retrain. That is the whole resolution-agnostic claim in one line: scale changes how many stamps you use, never which box you reach into. Continuous patch encoders, by contrast, usually bake the input size into the model, so a new resolution often means new positional handling or fine-tuning.

How the image is representedWhat the LLM receivesCost & fit
Continuous patch features (typical vision encoder)a unique real-valued vector per patchExpressive, but foreign to the token vocabulary and often size-locked
Image-only VQ (e.g. VQ-VAE)discrete codes optimized to reconstruct pixelsToken-like, but the codes are not grounded in language
ViQ (text-aligned VQ)discrete codes aligned to text concepts, any resolution [paper]Drops into an LLM like text tokens; one codebook across scales

Goes deeper in: LLM Internals → Quantization → The Quantization Process

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based