The news. On June 12, 2026, a paper (arXiv 2606.14598), written while building the Ideogram 4.0 diffusion transformer, introduced a single fused Triton INT8 GEMM kernel. Its starting observation is awkward: on consumer Ampere GPUs, W8A8 INT8 models — deployed for speed — frequently run slower than FP8 or NF4, because the typical kernel dequantizes the 8-bit weights back to bf16 before it multiplies, so it never touches the INT8 tensor cores. The kernel keeps the math in INT8 the whole way and reports 2.8–4.2× faster per GEMM with no measurable quality loss (the paper puts the output at cosine similarity 1.0 to the bf16 baseline). Read the paper →

Picture the coin-counting machine. You switched the model's weights to coins — INT8, tiny denominations — for one reason: a coin-counting machine can rip through a giant pile of them fast. But the naive INT8 deployment does something strange before it counts. It melts every coin back into a paper bill — dequantizes each 8-bit weight back to bf16 — and then tallies the bills by hand. You paid to mint the coins, you paid to melt them, and you still counted at bill speed. The INT8 tensor cores never switch on, so the matmul runs at bf16 speed after a conversion you paid for. That is how a model you quantized to INT8 can land slower than the FP8 it was supposed to beat.

The reason coins are worth it is the machine. A tensor core is a dedicated unit that multiplies a whole tile of numbers in a single shot, and on Ampere a tensor core speaks INT8 directly — it takes int8 × int8 and piles the products into a 32-bit integer running total. Feed it coins and it is the fastest counter in the building. Melt them into bills first and you have locked the machine in a closet and gone back to counting by hand.

CUDA Core vs Tensor Core — same 4×4 multiply
CUDA Core
1 result
per cycle
vs
Tensor Core
16 results
per cycle
16× throughput for 4×4 matrix multiply

So the kernel keeps the coins as coins. It runs the whole matrix multiply in INT8 on the tensor cores, accumulating into a 32-bit integer — and only at the very end, in the GEMM epilogue, does it convert that one grand total back to bf16, apply the per-channel and per-token scale, and fold the bias add into the same pass. One kernel, one trip through memory. The expensive matmul stays at INT8 tensor-core speed; the cheap conversion happens once, on the final total, fused into the same kernel — no round trip to bf16.

Unfused (3 kernels)

HBM (read)
matmul
HBM (write+read)
bias add
HBM (write+read)
ReLU
HBM (write)

6 HBM accesses

vs

Fused (1 kernel)

HBM (read)
matmul
+ bias
+ ReLU
HBM (write)

2 HBM accesses

3× fewer HBM accesses — same computation

Two details keep it both safe and fast. The scale is per-token and per-channel — each row and each column gets its own exchange rate — so a few outlier values don't blow up the error for the whole tensor, and the paper reports the output stays at cosine similarity 1.0 against the bf16 baseline. And because the best tile shape depends on the matrix dimensions, the kernel is Triton-autotuned across GEMM shapes rather than hand-tuned for one.

Deployment pathWhat the matmul actually runs asINT8 tensor cores used?1024px render, RTX 3090
Naive INT8 (dequant → bf16)bf16, after a conversionNo(the trap — slower than below)
FP8FP8No (FP8 path)172.9s
NF4dequant → bf16No164.5s
Fused INT8 (this kernel)int8 × int8 → int32Yes156.5s

What "2.8–4.2× per GEMM" buys end to end

Hold the resolution at 1024px on an RTX 3090 and render one image three ways. NF4 takes 164.5s, FP8 takes 172.9s, and the fused INT8 kernel takes 156.5s — so INT8 comes in 8 seconds faster than NF4 and 16.4 seconds faster than FP8, the formats it usually loses to. (The 8-second NF4 margin is small enough to sit near run-to-run timing noise; the FP8 gap is the clearer win.) Zoom into a single matrix multiply and the gap is much wider: the fused INT8 kernel runs 2.8–4.2× faster than the bf16 version of that same GEMM. So why is the whole image only about 1.1× (~9–10%) faster at 768px, not 3×? Because a diffusion transformer spends its time on more than GEMMs — attention, normalization, and the image-decode stage all take a share — and only the GEMMs got faster. It is Amdahl's law in one line: you only speed up the fraction you actually touched. (All figures from arXiv 2606.14598.)

Goes deeper in: GPU & CUDA → Tensor Cores & Mixed Precision → Precision Formats

Related explainers

Continue in trackTensor Cores: which precision formats the hardware multiplies natively, and why it matters

Frequently Asked Questions