What is a fused INT8 GEMM kernel?

It is a single GPU kernel that runs a quantized matrix multiply entirely in 8-bit integers. The multiply executes as int8 × int8 → int32 directly on the tensor cores, and the dequantization back to bf16 — plus the per-token and per-channel scaling and the bias add — is folded into the kernel's epilogue, the final stage after the multiply-accumulate. In arXiv 2606.14598 (built for the Ideogram 4.0 diffusion transformer), this fused Triton kernel runs 2.8–4.2x faster per GEMM than the bf16 version, with cosine similarity 1.0 to the baseline.

Why is naive INT8 quantization sometimes slower than FP8?

Because the typical INT8 deployment dequantizes the 8-bit weights back to bf16 before it multiplies. That means the actual matrix multiply runs in bf16, at bf16 speed, and the fast INT8 tensor cores never switch on — and you have also paid for the conversion. On a consumer Ampere GPU that combination can land an INT8 model slower than an FP8 or NF4 one, even though INT8 was chosen for speed. The fix is to keep the multiply in INT8 on the tensor cores and convert only the final result.

How does folding dequantization into the epilogue help?

The epilogue is the last stage of a GEMM kernel, after the multiply-accumulate. By converting the int32 running total back to bf16 there — and applying the per-channel and per-token scales and the bias in the same pass — the expensive matrix multiply stays at INT8 tensor-core speed and the cheap conversion happens exactly once, on the final output, inside the same kernel. There is no round trip to bf16 before the multiply and no extra kernel launch, so the model keeps INT8's speed instead of throwing it away.

INT8 finally beats FP8 on consumer GPUs — Fused INT8 GEMM kernel

Jargon

INT8 / W8A8: INT8 is the 8-bit integer number format. W8A8 means both the weights and the activations are quantized to 8-bit integers, so the matrix multiply itself can run in integer math. See how numbers shrink.
GEMM: General Matrix Multiply — the matrix-times-matrix operation at the core of every transformer layer. It is where most of the compute goes, so making the GEMM faster makes the model faster. See matrix multiply on a GPU.
Tensor core: A dedicated GPU unit that multiplies a small matrix tile in a single shot. On Ampere (RTX 30-series) it can take int8 × int8 and accumulate into a 32-bit integer — much faster than doing the same multiply on the general CUDA cores. See beyond CUDA cores.
Dequantization: Converting a low-bit value back to a higher-precision number by multiplying it by a stored scale factor. The question is when you do it — before the matmul (the trap) or fused into the very end (the fix).
Epilogue: The final stage of a GEMM kernel, after the multiply-accumulate finishes, where you apply the output scale, add the bias, and run any activation before writing the result out. The fused kernel hides the dequantization here.
Kernel fusion: Combining several GPU operations into one kernel launch so intermediate values never round-trip to slow HBM memory. See operator fusion.
Per-token / per-channel scale: Each row (token) and each column (channel) of the matrix gets its own dequant scale, so a few outlier values don't blow up the error for the whole tensor.
Triton autotuning: Triton searches over tile sizes and kernel configurations for each GEMM shape to find the fastest one, instead of a human hand-picking it. See how Triton works.

The news. On June 12, 2026, a paper (arXiv 2606.14598), written while building the Ideogram 4.0 diffusion transformer, introduced a single fused Triton INT8 GEMM kernel. Its starting observation is awkward: on consumer Ampere GPUs, W8A8 INT8 models — deployed for speed — frequently run slower than FP8 or NF4, because the typical kernel dequantizes the 8-bit weights back to bf16 before it multiplies, so it never touches the INT8 tensor cores. The kernel keeps the math in INT8 the whole way and reports 2.8–4.2× faster per GEMM with no measurable quality loss (the paper puts the output at cosine similarity 1.0 to the bf16 baseline). Read the paper →

Picture the coin-counting machine. You switched the model's weights to coins — INT8, tiny denominations — for one reason: a coin-counting machine can rip through a giant pile of them fast. But the naive INT8 deployment does something strange before it counts. It melts every coin back into a paper bill — dequantizes each 8-bit weight back to bf16 — and then tallies the bills by hand. You paid to mint the coins, you paid to melt them, and you still counted at bill speed. The INT8 tensor cores never switch on, so the matmul runs at bf16 speed after a conversion you paid for. That is how a model you quantized to INT8 can land slower than the FP8 it was supposed to beat.

The reason coins are worth it is the machine. A tensor core is a dedicated unit that multiplies a whole tile of numbers in a single shot, and on Ampere a tensor core speaks INT8 directly — it takes int8 × int8 and piles the products into a 32-bit integer running total. Feed it coins and it is the fastest counter in the building. Melt them into bills first and you have locked the machine in a closet and gone back to counting by hand.

CUDA Core vs Tensor Core — same 4×4 multiply

CUDA Core

1 result

per cycle

Tensor Core

16 results

per cycle

16× throughput for 4×4 matrix multiply

So the kernel keeps the coins as coins. It runs the whole matrix multiply in INT8 on the tensor cores, accumulating into a 32-bit integer — and only at the very end, in the GEMM epilogue, does it convert that one grand total back to bf16, apply the per-channel and per-token scale, and fold the bias add into the same pass. One kernel, one trip through memory. The expensive matmul stays at INT8 tensor-core speed; the cheap conversion happens once, on the final total, fused into the same kernel — no round trip to bf16.

Unfused (3 kernels)

HBM (read)

↓

matmul

↓

HBM (write+read)

↓

bias add

↓

HBM (write+read)

↓

ReLU

↓

HBM (write)

6 HBM accesses

Fused (1 kernel)

HBM (read)

↓

matmul
+ bias
+ ReLU

↓

HBM (write)

2 HBM accesses

3× fewer HBM accesses — same computation

Two details keep it both safe and fast. The scale is per-token and per-channel — each row and each column gets its own exchange rate — so a few outlier values don't blow up the error for the whole tensor, and the paper reports the output stays at cosine similarity 1.0 against the bf16 baseline. And because the best tile shape depends on the matrix dimensions, the kernel is Triton-autotuned across GEMM shapes rather than hand-tuned for one.

Deployment path	What the matmul actually runs as	INT8 tensor cores used?	1024px render, RTX 3090
Naive INT8 (dequant → bf16)	bf16, after a conversion	No	— (the trap — slower than below)
FP8	FP8	No (FP8 path)	172.9s
NF4	dequant → bf16	No	164.5s
Fused INT8 (this kernel)	int8 × int8 → int32	Yes	156.5s

What "2.8–4.2× per GEMM" buys end to end

Hold the resolution at 1024px on an RTX 3090 and render one image three ways. NF4 takes 164.5s, FP8 takes 172.9s, and the fused INT8 kernel takes 156.5s — so INT8 comes in 8 seconds faster than NF4 and 16.4 seconds faster than FP8, the formats it usually loses to. (The 8-second NF4 margin is small enough to sit near run-to-run timing noise; the FP8 gap is the clearer win.) Zoom into a single matrix multiply and the gap is much wider: the fused INT8 kernel runs 2.8–4.2× faster than the bf16 version of that same GEMM. So why is the whole image only about 1.1× (~9–10%) faster at 768px, not 3×? Because a diffusion transformer spends its time on more than GEMMs — attention, normalization, and the image-decode stage all take a share — and only the GEMMs got faster. It is Amdahl's law in one line: you only speed up the fraction you actually touched. (All figures from arXiv 2606.14598.)

Goes deeper in: GPU & CUDA → Tensor Cores & Mixed Precision → Precision Formats

Related explainers

SOP — hardware-aware per-layer PTQ at FP6 — SOP picks which low-bit format each layer gets; this kernel makes a chosen INT8 format actually hit the tensor cores. Two halves of "quantization that pays off."
Mixed quantization — NVFP4 prefill, bf16 decode — another "the format only helps if the kernel cooperates" story, split by inference phase.
QCA — outlier injection PTQ — the outlier problem this kernel sidesteps with its per-token and per-channel scales.

Continue in trackTensor Cores: which precision formats the hardware multiplies natively, and why it matters

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based