Learn AI VisuallyTracksAI Explained

LLM Inference Engine Internals Explained Visually

The Inference Engine

What is an Inference Engine?

An inference engine is the system that sits between your API request and the GPU. When you send a prompt to ChatGPT, Claude, or any LLM API, an inference engine handles everything: queuing your request alongside hundreds of others, allocating GPU memory for your KV cache, running the actual model computation, and streaming tokens back to you.

The most widely used open-source inference engine is vLLM, created at UC Berkeley. This module follows vLLM's architecture to explain how inference engines work.

Why Not Just Run the Model Directly?

The naive approach — process one request at a time — wastes the GPU. A single decode step takes ~10ms, but the GPU spends most of that time waiting for memory reads, not computing. Meanwhile, dozens of other requests sit idle in a queue.

An inference engine solves this by batching hundreds of requests together and processing them all in a single GPU forward pass. The GPU reads model weights from memory once and applies them to every request simultaneously.

Naive: One at a Time

R1

Step 1 — only R1 runs

R1

Step 2 — only R2 runs

R1

Step 3 — only R3 runs

GPU reads weights 3× for 3 requests

Engine: All at Once

R1
R2
R3
R4
R5

Step 1 — all 5 run together

GPU reads weights 1× for all 5 requests

Gray = idle GPU capacity. Batching fills the GPU on every step.

In the Batching module, you learned that processing multiple requests together is more efficient than one at a time. An inference engine automates this — and goes much further, dynamically managing which requests run, how memory is allocated, and when requests join or leave the batch.

The Engine Loop

The engine runs a tight loop, repeating thousands of times per second (every ~10-50ms):

Engine Iteration Loop

Schedule
Pick which requests run
→
Allocate
Assign KV cache blocks
→
Execute
Run GPU forward pass
→
Sample
Generate next tokens
↻ repeat every ~10-50ms

Each iteration:

  1. Schedule — the scheduler picks which requests get GPU time this step
  2. Allocate — the memory manager assigns KV cache blocks
  3. Execute — the model executor runs one GPU forward pass on all active requests
  4. Sample — next tokens are generated and streamed back

This loop is the heartbeat of every LLM serving system. The rest of this module zooms into each stage.

On the right panel: Click each stage of the pipeline to preview what it does. Notice how every iteration processes ALL active requests simultaneously — not one at a time.

The Scheduler

The Scheduler

The scheduler is the traffic controller of the inference engine. Every iteration, it decides which requests get GPU time — and which ones wait.

Three terms before we go further:

  • Iteration — one GPU forward pass that processes all currently-active requests at once. Not "one decoded token per request" — one batched pass over the whole running set.
  • Token budget — the maximum tokens (prefill + decode combined) the engine will process in a single iteration. Set globally at engine startup, not per request.
  • GPU memory budget — the number of KV-cache blocks the memory manager is allowed to hold across all active requests. Also global.

Every admission, preemption, and gauge below is measured in these units.

Two Queues

The scheduler maintains two lists:

  • Waiting queue — new requests that haven't started yet. They're queued in order of arrival (FCFS — first come, first served).
  • Running set — requests that are actively generating tokens on the GPU. These get priority: the scheduler never starves a running request to admit a new one.

Each iteration, the scheduler tries to move requests from waiting → running, subject to two constraints.

Two Constraints: Token Budget and GPU Memory

The scheduler admits one new request per iteration, but only if it passes two checks:

1. Token budget — the maximum tokens the engine will process in one iteration (e.g., 4096 in production). A prefilling request consumes its entire prompt length from the budget (a 512-token prompt uses 512). A decoding request consumes just 1 token. So once requests are running, the engine can sustain dozens of them — each only costs 1 token per step. But admitting a new request with a long prompt temporarily eats a big chunk of the budget.

2. GPU memory — every active request needs KV cache blocks in GPU memory. A request with a 512-token prompt needs 32 blocks (at 16 tokens per block). If there aren't enough free blocks, the new request can't start — even if there's token budget left.

In practice, a GPU like the A100 (80 GB) might hold ~60,000 blocks. With typical request sizes, this supports hundreds of concurrent requests. The token budget is usually the tighter constraint.

How Many Requests Can Run?

It depends on the phase:

  • During decode, each request costs only 1 token of budget. With a budget of 4096, you could have ~4000 requests decoding simultaneously (if memory allows).
  • During prefill, one long prompt (say 2048 tokens) eats half the budget alone, blocking other admissions for that iteration.

In vLLM V1, there's no separate "prefill phase" or "decode phase" — the scheduler treats all tokens uniformly, represented as {request_id: num_tokens}. This means a single iteration can mix prefills and decodes.

Preemption: When Memory Runs Out

GPU memory is finite. When the KV cache blocks are exhausted and a higher-priority request needs space, the scheduler preempts a running request:

  1. The lowest-priority running request is selected
  2. Its KV cache blocks are evicted (freed from GPU memory)
  3. The request re-enters the waiting queue
  4. When it's readmitted later, it must be re-prefilled (recomputed)

This is the serving equivalent of OS process swapping — temporarily evicting work to make room for other work. Preemption trades compute (re-prefilling) for fairness — no single long request can monopolize the GPU forever.

On the right panel: The simulator starts with 6 requests in the waiting queue. Click Play to watch the scheduler admit them one per iteration. Notice the two gauges: token budget (pink) and GPU memory (blue). Keep clicking "+ Request" until the memory gauge turns red — that's when preemption kicks in, evicting a running request (marked "preempted") back to waiting.

The Memory Manager

The Memory Manager

In the PagedAttention module, you learned that KV cache is split into fixed-size blocks that don't need to be contiguous in memory — like OS virtual memory pages. Now see how the engine manages those blocks at runtime, allocating and freeing them across hundreds of concurrent requests.

Block Tables

Every active request has a block table — a mapping from logical blocks to physical blocks.

Block Table: Logical → Physical

L0
→
P7
16 tokens
L1
→
P2
16 tokens
L2
→
P14
16 tokens
L3
→
P9
16 tokens

Contiguous logical blocks → scattered physical blocks in GPU memory

  • Logical blocks are what the request sees: contiguous positions (L0, L1, L2...) for its token sequence
  • Physical blocks are where the data actually lives in GPU memory — scattered across whatever blocks happen to be free

The request doesn't know or care that L0 maps to physical block 7 and L1 maps to physical block 2. The block table handles the translation, just like a page table in OS virtual memory.

Block Lifecycle

Blocks go through three phases:

Allocation on prefill — when a new request starts, the memory manager grabs enough blocks for all prompt tokens. A 512-token prompt with block size 16 needs 32 blocks (512 ÷ 16 = 32). These blocks are allocated from a free pool.

Growth during decode — as the model generates new tokens, they fill the current block. When a block is full (16 tokens written), the memory manager allocates one more block from the free pool. This happens automatically each iteration.

Deallocation on finish — when a request completes (hits the end-of-sequence token or max length), all its blocks return to the free pool immediately. This frees memory for new requests.

Why This Matters

The free block pool uses a doubly-linked list for O(1) allocation and deallocation. Memory waste only occurs in the last block of each request (partially filled), keeping utilization above 96%.

This is what makes continuous batching possible: because blocks are allocated and freed dynamically per-request, the engine can smoothly handle requests arriving and completing at different times without fragmenting GPU memory.

On the right panel: Watch the memory grid as requests are processed. Each color is a different request's KV cache blocks. Click any colored block to see its block table — the logical → physical mapping. Notice how blocks are scattered (non-contiguous) but the request sees a clean sequence.

The Model Executor

The Model Executor

The scheduler decided who runs. The memory manager allocated their KV cache blocks. Now the model executor does the actual GPU computation.

One Forward Pass for Everyone

Every iteration, the executor takes all active requests and flattens their tokens into a single long sequence. Position indices and attention masks ensure each request only attends to its own tokens — they don't interfere with each other.

Active Requests

t0
t1
t2
t3
R1
t0
t1
t2
R2
t0
t1
t2
t3
t4
R3
↓ flatten into one sequence

GPU Batch Tensor

t0
t1
t2
t3
t0
t1
t2
t0
t1
t2
t3
t4
0
1
2
3
0
1
2
0
1
2
3
4

position indices reset per request

Attention Mask

each request only attends to its own tokens (colored blocks)

Why flatten? Because the most expensive part of each iteration is loading model weights from GPU memory — a 7B-parameter model means reading ~14 GB of weights every single step. If you process one request at a time, you pay that 14 GB read cost per request. But if you flatten 50 requests into one batch, you read the weights once and apply them to all 50 requests in the same pass. The compute cost per request barely increases, but you've spread that massive memory read across 50x more useful work.

Prefill vs Decode: Two Very Different Workloads

Not all iterations are equal. The engine interleaves two types of work:

Prefill processes all prompt tokens at once. A 512-token prompt means 512 tokens flow through every transformer layer in parallel. This is compute-bound — the GPU's compute units are saturated. High arithmetic intensity (many FLOPs per byte of memory read).

Decode generates one new token per request. The model reads all cached KV values but only computes attention for one new token. This is Memory-bandwidth-bound — the GPU spends most of its time reading KV cache from HBM. Low arithmetic intensity.

MEMORY-BANDWIDTH-BOUNDGPU & CUDA → roofline-model
A workload limited by how fast the GPU can read data from HBM, not by how fast it can compute. Decode is the canonical example — the model spends most of its time loading weights and KV cache.

If you completed the GPU & CUDA track's Roofline module, you'll recognize this:

ROOFLINEGPU & CUDA → roofline-model
A plot of compute throughput vs arithmetic intensity. Operations either hit the compute roof (compute-bound) or sit on the memory-bandwidth slope (memory-bound).

Prefill vs Decode on the Roofline

Decode~1 FLOP/bytePrefill~100 FLOP/byte← memory-boundcompute-bound →

Same GPU, fundamentally different bottlenecks

Prefill sits near the compute roof — the GPU's math units are the bottleneck. Decode sits on the memory bandwidth slope — the GPU spends most time reading KV cache, not computing. Same GPU, same model, but fundamentally different bottlenecks depending on whether you're processing a prompt or generating tokens.

Sampling

After the forward pass, the executor has logits (raw scores) for each request's next token. These go through sampling — temperature scaling, top-p filtering — to produce the actual next token. Finished requests (hit end-of-sequence or max length) are removed from the running set, freeing their memory blocks.

On the right panel: Toggle between "Prefill" and "Decode" to see how they differ. Watch the compute vs bandwidth gauges — prefill maxes out compute, decode maxes out bandwidth. The flattened batch tensor shows how all requests are concatenated for one forward pass.

Continuous Batching in Action

Continuous Batching in Action

Now let's put all the pieces together and see why continuous batching is such a breakthrough.

The Problem with Static Batching

Traditional static batching groups requests together and processes them as a fixed batch. The problem: requests finish at different times. A short response (20 tokens) completes long before a long response (200 tokens), but the GPU slot sits idle until the entire batch finishes.

Static Batching

1
A
A
A
A
A
2
B
B
B
3
C
C
4
D
D
D
D
t1t2t3t4t5

Continuous Batching

1
A
A
A
A
A
2
B
B
B
E
E
3
C
C
F
F
F
4
D
D
D
D
G
t1t2t3t4t5

Gray = idle GPU slot. Continuous batching fills gaps immediately.

Those gray idle slots are wasted GPU cycles. The longer the variance in response lengths, the more waste.

Continuous Batching: Fill the Gaps

Continuous batching — formally called iteration-level scheduling, introduced by the Orca paper (OSDI '22) — makes a simple but powerful change: at every iteration, the scheduler can add new requests to freed slots immediately.

The full engine iteration loop:

  1. Scheduler picks the batch — running requests first, then admits waiting requests within the token budget
  2. Memory manager allocates blocks for new requests, grows blocks for ongoing ones
  3. Executor runs one forward pass on all active tokens (flattened into a single batch)
  4. Tokens sampled for each request — finished requests freed
  5. Scheduler re-evaluates — new requests fill the gaps left by completed ones
  6. Repeat

No idle slots. No wasted GPU cycles. New requests start generating within one iteration (~10-50ms) of a slot opening up.

The Numbers

The throughput improvement is dramatic:

  • Static batching → baseline
  • Continuous batching → 8x throughput improvement
  • Continuous batching + PagedAttention (vLLM) → 23x throughput improvement

The combination of iteration-level scheduling (no idle slots) and PagedAttention (no memory fragmentation) is what makes modern LLM serving practical.

On the right panel: Watch the side-by-side comparison. Static batching (left) shows idle slots as requests finish. Continuous batching (right) immediately fills gaps with new requests. The throughput ratio at the bottom shows how much more work continuous batching gets done. Click Play to start the simulation.

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com