What is the LLM Internals track?

Nine interactive modules covering tokenization, embeddings, self-attention, the transformer block, text generation, KV cache, quantization, batching, and paged attention.

Where should I start?

Module 1 (Tokenization). Each module stands alone but builds on the prior one.

Do I need math background?

Basic linear algebra helps, but every concept is illustrated visually first.

What is temperature in LLM generation?

Temperature scales the logits (raw scores) before softmax. Temperature 1.0 makes it flatter (more random). Temperature = 0 is equivalent to greedy decoding. Concretely, softmax is applied to logits divided by the temperature, so higher temperature shrinks the gap between top and bottom probabilities. A typical creative-writing setting is 0.7–1.0; a deterministic factual-QA setting is 0.0–0.2. Temperature never changes which tokens are possible — only how often each is picked.

What is the difference between top-k and top-p sampling?

Top-k keeps exactly k candidate tokens regardless of their probabilities. Top-p keeps the smallest set of tokens whose probabilities sum to p. Top-p adapts — when the model is confident it considers fewer tokens; when uncertain, more. This adaptivity is why top-p (nucleus sampling) has largely replaced top-k in practice: a fixed k of 40 can admit junk tokens when the model is very confident and cut off good options when it is uncertain. Common production settings are top-p of 0.9–0.95 combined with temperature 0.7.

Why not always use greedy decoding?

Greedy decoding picks the locally best token at each step but can miss globally better sequences. It also produces repetitive, generic text. Sampling introduces variety that often produces more natural-sounding output. Greedy output frequently falls into loops ("the the the") when two tokens share nearly equal probability, because the same small bias keeps winning forever. Greedy is still preferred for tasks where determinism matters — classification, structured JSON extraction, deterministic tool-call arguments — but creative generation and open-ended chat almost always use sampling with temperature and top-p.

The raw, unnormalized scores output by the model's final linear layer — one score per vocabulary token. Softmax converts logits into a probability distribution. Higher logits mean the model thinks that token is more likely. For a 100K-vocabulary model, each forward pass produces a 100K-element logits vector, and sampling controls like temperature, top-k, and top-p all operate on this vector before the softmax. Logits are also where logit bias and constrained decoding hook in — you add or subtract values to force or forbid specific tokens before sampling.

Why is generation slow compared to processing a prompt?

The prompt can be processed in parallel (all tokens at once). But generation is sequential — each new token depends on the previous one. This serial dependency is the fundamental bottleneck. Prompt processing (prefill) saturates GPU compute by batching the whole prompt into one large matmul, while generation (decode) runs one token at a time and is limited by memory bandwidth — each step must stream the entire model's weights from HBM. That is why prefill-to-decode throughput ratios of 100–1000x are common, and why techniques like speculative decoding and CUDA graphs target the decode phase.

LLM Text Generation — Sampling & Decoding

One Token at a Time

What is LLM Text Generation?

Large language models generate text one token at a time. At each step, the model outputs a probability distribution over its entire vocabulary (~100K tokens), and one token is selected. The selection method — called the decoding strategy — dramatically affects output quality. Greedy decoding always picks the highest-probability token. Temperature controls randomness. Top-k limits choices to the k most likely tokens. Top-p (nucleus sampling) keeps tokens whose cumulative probability reaches a threshold. Most production systems use a combination.

One Token at a Time

Every response you get from an LLM — every word, every sentence — is produced one token at a time. There is no master plan, no grammar check, no lookahead. Just a probability distribution over the vocabulary, repeated until the response is done.

The Autoregressive Loop

At each step the model does exactly four things:

This is called autoregressive generation — each output becomes part of the next input. The model never revises what it wrote. It never goes back and changes "sat" after seeing what came next. Left to right, one token at a time, forever forward.

Context Window

The model can only see a fixed number of past tokens at once. This limit is called the context window.

GPT-3: 4,096 tokens
GPT-4 (original): 8,192 tokens
GPT-4 Turbo / GPT-4o: 128,000 tokens

Andrej Karpathy's nanoGPT states this plainly: "if the sequence context is growing too long we must crop it at block_size." Tokens that fall outside the window are dropped — the model simply cannot see them.

Longer context windows are expensive: attention cost scales quadratically with sequence length. A 128K context window requires roughly 1,000× more attention compute than a 4K window.

What the Simulator Shows

Press Step on the right panel to generate one token at a time. The indigo pills are the prompt. Each new violet pill is a sampled token appended to the sequence. The probability bars show the model's distribution at that step — a different context produces a different distribution.

The model has no understanding of the whole response. It has no plan. It cannot see what it wrote two steps ago unless those tokens are still in the context window. Generation is local, left to right, never going back to revise.

From Logits to Probabilities

The model's final layer doesn't output probabilities. It outputs logits — raw, unbounded scores, one per vocabulary token. They can be positive, negative, large, or small. They mean nothing on their own until transformed.

What a Logit Is

A logit is just a score. For a vocabulary of 50,000 tokens, the model outputs 50,000 numbers like:

"sat"    →  2.1
"is"     →  1.5
"walked" →  0.8
"the"    → -0.3
...

These aren't percentages. They aren't probabilities. They're raw activations from the output projection layer — the last linear transformation in the model.

The Six-Step Pipeline

Getting from raw logits to a sampled token requires six steps. This is the complete generation algorithm — the same order our simulator uses:

Step 1 — Raw logits: The model's output. Arbitrary real numbers.

Step 2 — Divide by temperature: logits = logits / temperature. Scaling before softmax. Temperature 1.0 leaves them unchanged. We'll cover this in detail next step.

Step 3 — Softmax: Converts any set of numbers into a valid probability distribution — all positive, summing to 1.0. The formula: exp(logit_i) / sum(exp(logit_j)). Softmax can handle any input numbers, no matter how extreme.

Step 4 — Top-k filter: Zero out every token except the K most probable. If K=5, only the top 5 survive.

Step 5 — Top-p filter: Zero out low-probability tokens until the remaining tokens collectively cover at least probability p. This adapts to the shape of the distribution.

Step 6 — Sample: Draw one token randomly from the surviving probabilities (renormalized to sum to 1.0).

Why Softmax?

Softmax solves a specific problem: turning arbitrary numbers into a valid probability distribution. Two properties make it the right choice:

exp(x) is always positive — so every token gets a non-negative probability
Dividing by the sum guarantees they add to exactly 1.0

Any set of real numbers goes in. A clean probability distribution comes out.

A Note on Karpathy's nanoGPT

Karpathy's nanoGPT applies the top-k filter before softmax — directly on the raw logits, setting non-top-k values to -inf so softmax zeroes them out. Our simulator applies top-k and top-p after softmax, on probabilities, which is more intuitive to visualize. The two approaches are mathematically equivalent: filtering before or after softmax produces the same final distribution.

This is the entire generation algorithm. Six steps. About ten lines of code in nanoGPT. Every API parameter you've ever seen — temperature, top_p, top_k, max_tokens — maps directly to this pipeline.

Temperature

Temperature is the simplest parameter in the generation pipeline — one line of code that completely controls how decisive or exploratory the model is.

The Formula

logits = logits / temperature

That's it. Before softmax, divide every logit by the temperature value. Everything else — softmax, filtering, sampling — runs unchanged.

What It Does

Temperature controls how spread out or peaked the final probability distribution is.

Low temperature (0.1): Dividing by 0.1 is the same as multiplying by 10 — it amplifies the differences between logits. If "sat" had a logit of 2.1 and "is" had 1.5, at temperature 0.1 those become 21 and 15. After softmax, "sat" dominates. The distribution spikes sharply around the top token. Nearly deterministic.

Temperature 1.0: No change. The logits pass through unmodified. You get the model's raw confidence.

High temperature (2.0): Dividing by 2.0 shrinks all logits toward zero. The gap between "sat" (2.1 → 1.05) and "walked" (0.8 → 0.4) narrows. After softmax, the distribution flattens. Low-probability tokens get a real chance.

The Decisiveness Knob

Think of temperature as a decisiveness knob:

Low temperature — sure, picks the obvious choice. Good for factual answers, code completion, structured output.
High temperature — considers many candidates roughly equally. Good for creative writing, brainstorming, varied responses.

Try dragging the Temperature slider on the right panel. Watch the probability bars flatten at high temperature and spike at low temperature.

What Temperature Does Not Do

Temperature cannot add candidates that weren't there, and it cannot remove the ones that were. The full vocabulary is still there — it just gets different probabilities. The ranking of tokens by probability never changes. The token ranked #1 stays ranked #1 regardless of temperature.

Filtering is the job of top-k and top-p, covered in the next step.

Temperature doesn't add or remove candidates — it only reshapes the distribution. The ranking stays the same. You're not changing what the model knows; you're changing how decisive it is about applying that knowledge.

Top-K and Top-P Sampling

Temperature reshapes the distribution but doesn't cut anything out. Top-k and top-p are the actual filters — they eliminate low-probability tokens so only viable candidates can be sampled.

Top-K: Fixed Count

Top-k keeps exactly K tokens and zeros out the rest.

If K=5, the model keeps the 5 highest-probability tokens. Everything else gets probability 0. The 5 survivors are renormalized to sum to 1.0, and sampling proceeds.

Top-k is simple but rigid. The same count is kept regardless of whether the distribution is peaked (where maybe 2 tokens dominate) or flat (where maybe 20 tokens are roughly equal). K=5 keeps 5 candidates in both cases.

Top-P: Adaptive Nucleus

Top-p (also called nucleus sampling) keeps the smallest set of tokens whose cumulative probability reaches threshold p.

The algorithm: sort tokens by probability, descending. Keep adding tokens until their cumulative sum reaches p. Cut everything else.

The key advantage: top-p adapts to the distribution shape.

Peaked distribution (one token has 90% probability, p=0.9): only one token survives — the top token already clears the threshold.
Flat distribution (20 tokens each have 5%, p=0.9): 18 tokens survive — the model needs to accumulate 18 × 5% = 90% to reach the threshold.

This adaptive behavior is exactly what you want. When the model is confident, only the likely candidates survive. When the model is uncertain, many candidates get a chance.

How They Work Together

Top-k runs first, capping the candidate pool at K tokens. Top-p runs second, further narrowing it based on cumulative probability. Having both lets you set an upper bound (top-k) while still getting the adaptive behavior of nucleus sampling (top-p).

Experiment with the Top-K and Top-P sliders on the right panel. At top-p=0.5 with a flat distribution, notice how the cutoff moves left — fewer tokens are needed to reach 50% of the probability mass. At top-p=0.5 with a peaked distribution, just one or two tokens cover the threshold.

Most production LLMs combine temperature and top-p — these two parameters together handle the vast majority of use cases. Top-k is often left at a high default or disabled. Temperature + top-p is the most common pair.

Greedy vs Creative

The same model, the same weights, the same prompt — completely different outputs depending on one choice: how you sample.

Greedy Decoding

Set temperature to 0.1 and top-k to 1 on the right panel, then step through generation. Every time: "sat on the mat ." Grammatically correct. Completely deterministic. And after a few sentences — repetitive enough to be useless.

Greedy decoding always picks the single most likely token. It is the most "rational" approach and the worst for open-ended generation.

The Problem with High Probability

High probability does not mean good writing. Consider how humans actually talk:

"The cat sat on the mat."

The most likely next word after "on the" is "mat" — but in real text, humans write "on the table", "on the floor", "on the roof", "on the edge of the windowsill". Human language is full of choices that are not the statistically most likely option. Greedy output sounds like a template, not a person.

This is why stochastic sampling exists. Sometimes picking the 3rd or 5th most probable token produces something that reads better — more specific, more surprising, more human.

Beam Search (Brief Mention)

Beam search is a middle ground: instead of committing to one token at each step, it keeps N best partial sequences in parallel and returns the highest-scoring complete sequence.

It is good for tasks with a single correct answer — machine translation, summarization where fidelity matters, structured output. It is bad for open-ended chat, creative writing, and anything where variety is desirable. Beam search still tends toward repetitive, generic responses. Most chatbots and assistants don't use it.

Karpathy's nanoGPT makes the tradeoff explicit with a single flag:

do_sample = True   # stochastic — temperature + top-k apply
do_sample = False  # greedy — always pick the argmax

One boolean captures the fundamental choice: deterministic precision or stochastic variety.

Intuition

Greedy: "I will always choose the most expected word." Output is safe, predictable, repetitive.

Stochastic: "I will sometimes choose a less expected word." Output has texture, surprise, personality.

The model doesn't change — only the sampling strategy does. The weights are identical. The "personality" of the output — cautious or creative, predictable or surprising — is controlled entirely by temperature, top-k, and top-p.

Generation in Practice

You've seen the core algorithm. Here's what gets layered on top in real production systems.

Stop Tokens

Generation doesn't stop automatically at a period. Models use special stop tokens — reserved tokens that signal end-of-output.

Common examples:

<|endoftext|> — GPT-2/GPT-3 end of document
<|im_end|> — ChatML format (used by GPT-3.5/4 in chat mode)
</s> — Llama/Mistral

When a stop token is sampled, generation halts immediately. You can also pass custom stop strings via API — generation stops when any of them appear in the output.

Repetition Penalty

Greedy and even stochastic generation can produce repetitive loops: "the the the the" or "I think I think I think." Repetition penalty is a post-hoc fix: before temperature is applied, the logits of tokens that have already appeared in the generated sequence are reduced by a penalty factor.

This is applied in the pipeline before temperature: logit[token] /= penalty if that token appeared recently. The effect is that already-used tokens become less likely to appear again.

Max Tokens

Every API call includes a max_tokens parameter — a hard upper limit on how many tokens can be generated. This is the primary cost control lever. A 1,000-token response costs roughly 10× as much as a 100-token response. Setting max_tokens prevents runaway generation and keeps costs predictable.

Production Defaults

Most production LLM deployments use roughly these defaults:

Temperature: 0.7 – 1.0
Top-p: 0.9 – 0.95
Repetition penalty: 1.1 – 1.3
Top-k: 40 – 100 (or disabled)

Lower temperature and top-p for factual tasks (code, Q&A). Higher for creative tasks (writing, brainstorming).

Streaming

LLMs send tokens to the user as they are generated — you don't wait for the full response. This is why ChatGPT's responses appear word by word in real time.

Under the hood, each token goes through the full pipeline: logits → temperature → softmax → top-k → top-p → sample. That token is streamed immediately to the client. The next forward pass starts. Streaming doesn't change the algorithm — it just surfaces each result immediately rather than batching all tokens before sending.

Every API parameter you've seen — temperature, top_p, max_tokens, stop, repetition_penalty — maps directly to one step in the six-step pipeline. There is no magic. It's the same algorithm, configured by these few numbers.