How LLMs Work: Transformers Explained Step-by-Step

Learn how LLMs work step by step. Build an inference simulator in Python — tokenize, embed, compute attention, sample, and decode with runnable code at every stage.

Written by Selva Prabhakaran | 31 min read

⚡ This post has interactive code — click ▶ Run or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

Build a mini inference simulator in Python and watch how a language model turns your text into a reply.

You type “The capital of France is” into ChatGPT. A second later, it says “Paris.” What happened in that second? Your text went through six stages — tokenize, embed, attend, feed-forward, sample, decode — before a single letter came back.

Most guides skip the guts or bury you in math. We’ll take a different route. You’ll build a working simulator in pure Python and NumPy. At each stage, you’ll run code, see the numbers, and know what the model does.

Here’s how the pieces fit together.

You start with raw text like "The cat sat". The model can’t read letters. So we split the text into tokens and give each one a number. Now the model has integers it can work with.

But a bare integer has no meaning. We swap each ID for a short list of numbers — a vector that holds what the token means. That’s the embedding step. Its output feeds into the transformer blocks.

Inside each block, two things happen. First, every token looks at every other token through attention. “sat” checks how it relates to “The” and “cat.” Second, a small network deepens the patterns attention found.

After all blocks finish, the model scores every token in its vocab. “Paris” gets the top score. We turn that score into a chance, pick a token, and decode it back to text. That’s one step. The model repeats this loop for each new token.

We’ll build every piece from scratch. By the end, you’ll get not just the code — but the why behind each stage.

What Happens When You Send a Prompt to an LLM?

An LLM is a function. It takes token IDs in and returns odds for the next token. That’s it. Every chat you’ve had with GPT, Claude, or Llama is this loop running thousands of times.

Here’s the full pipeline in one line:

python

Text → Tokenize → Embed → [Attention → FFN] × N → Score → Sample → Decode → Text

Each arrow is a change. Your text goes from string to integers, to vectors, to richer vectors, to odds, and back to a string.

Let’s set things up and build each stage.

import numpy as np

np.random.seed(42)

# Model config — small enough to trace by hand
VOCAB_SIZE = 32
EMBED_DIM = 8
NUM_HEADS = 2
HEAD_DIM = EMBED_DIM // NUM_HEADS  # 4
FFN_HIDDEN = 16

print(f"Vocab size: {VOCAB_SIZE}")
print(f"Embedding dim: {EMBED_DIM}")
print(f"Attention heads: {NUM_HEADS}, each with dim {HEAD_DIM}")

python

Vocab size: 32
Embedding dim: 8
Attention heads: 2, each with dim 4

These numbers are tiny next to real models. GPT-4 uses ~128K vocab and 12,288-dim vectors. But the math is the same. Small numbers just let you trace every step.

Step 1: Tokenization — Turning Text into Numbers

The model can’t read text. So the first step turns your words into a list of integers. Each integer is a token ID — an index into the model’s vocab.

Real tokenizers like BPE use complex merge rules. We’ll build a simple character-level one instead. The idea is the same: map each piece of text to a unique number.

# Build a character-level tokenizer
corpus = "the cat sat on the mat"
chars = sorted(set(corpus))
char_to_id = {ch: i for i, ch in enumerate(chars)}
id_to_char = {i: ch for ch, i in char_to_id.items()}

print("Vocabulary:")
for ch, idx in char_to_id.items():
    label = repr(ch)
    print(f"  {label:6s} -> {idx}")

python

Vocabulary:
  ' '    -> 0
  'a'    -> 1
  'c'    -> 2
  'e'    -> 3
  'h'    -> 4
  'm'    -> 5
  'n'    -> 6
  'o'    -> 7
  's'    -> 8
  't'    -> 9

Ten unique chars, ten IDs. Let’s tokenize a short prompt.

def tokenize(text, char_to_id):
    """Turn text into a list of token IDs."""
    return [char_to_id[ch] for ch in text if ch in char_to_id]

prompt = "the cat"
token_ids = tokenize(prompt, char_to_id)
print(f"Text:      '{prompt}'")
print(f"Token IDs: {token_ids}")
print(f"Decoded:   {''.join(id_to_char[i] for i in token_ids)}")

python

Text:      'the cat'
Token IDs: [9, 4, 3, 0, 2, 1, 9]
Decoded:   the cat

Each char maps to its ID. The space gets ID 0. Encode then decode gives back the original. That round-trip must work — or the model loses data before it even starts.

KEY INSIGHT: Tokenization is a two-way map. Every tokenizer must ensure that decode(encode(text)) == text. If it can’t, the model loses info before it does any work.

Step 2: Embedding — From Integers to Vectors

A token ID like 9 tells the model which token this is. But it says nothing about what the token means. The embedding layer fixes this. It’s a lookup table: each ID maps to a vector of numbers.

In a trained model, these vectors are learned. Tokens with similar meanings end up close together. For our sim, we’ll use random vectors. The math stays the same.

The lookup grabs rows from a matrix. Token ID 9 grabs row 9. We get one vector per token.

# Each row in this matrix is one token's vector
embedding_matrix = np.random.randn(VOCAB_SIZE, EMBED_DIM) * 0.1

token_ids_array = np.array(token_ids)
embeddings = embedding_matrix[token_ids_array]

print(f"Token IDs shape:  {token_ids_array.shape}")
print(f"Embeddings shape: {embeddings.shape}")
print(f"\nFirst token (ID={token_ids[0]}, char='{id_to_char[token_ids[0]]}'):")
print(f"  Vector: [{', '.join(f'{v:.3f}' for v in embeddings[0])}]")

python

Token IDs shape:  (7,)
Embeddings shape: (7, 8)

First token (ID=9, char='t'):
  Vector: [0.050, -0.014, 0.065, 0.015, -0.023, -0.023, -0.032, 0.015]

Seven tokens, each now a vector of 8 numbers. The model never sees the raw text again. From here on, it’s all matrix math.

Adding Positional Info

Here’s the catch. The vector for 't' is the same no matter where it sits. The model needs to know position. We add positional encodings — a unique pattern for each spot in the sequence.

The original Transformer uses sine and cosine waves. Position 0 gets one pattern, position 1 gets another. The model learns to read these patterns.

def positional_encoding(seq_len, embed_dim):
    """Sinusoidal positional encodings."""
    pos = np.arange(seq_len)[:, np.newaxis]
    dim = np.arange(embed_dim)[np.newaxis, :]
    angle = pos / (10000 ** (2 * (dim // 2) / embed_dim))
    pe = np.zeros((seq_len, embed_dim))
    pe[:, 0::2] = np.sin(angle[:, 0::2])
    pe[:, 1::2] = np.cos(angle[:, 1::2])
    return pe

pos_enc = positional_encoding(len(token_ids), EMBED_DIM)
x = embeddings + pos_enc

print("Pos encoding for position 0:")
print(f"  [{', '.join(f'{v:.3f}' for v in pos_enc[0])}]")

python

Pos encoding for position 0:
  [0.000, 1.000, 0.000, 1.000, 0.000, 1.000, 0.000, 1.000]

Position 0 always starts with [0, 1, 0, 1, ...] because sin(0)=0 and cos(0)=1. Each later position shifts these waves. The result is a unique fingerprint for each spot.

TIP: Newer models like Llama use Rotary Position Embeddings (RoPE) instead of sine waves. RoPE rotates the query and key vectors by an angle tied to position. It handles long texts better. But the goal is the same — let the model know token order.

Step 3: Self-Attention — How Tokens Talk to Each Other

This is the heart of the transformer. Attention lets every token look at every other token and decide how much to care.

Why does this matter? Think of “The bank by the river was steep.” The word “bank” could mean money or a hillside. Attention lets “bank” look at “river” and shift its meaning. Without it, each token is on its own — no context.

Queries, Keys, and Values

Attention uses three projections. Each token’s vector gets hit by three weight matrices to produce:

Query (Q): “What am I looking for?”
Key (K): “What do I have?”
Value (V): “What info do I carry?”

The dot product of Q and K tells you how related two tokens are. High dot product means strong link. The result weights a sum of V vectors.

Here’s the formula:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Where:
– $Q$ = query matrix (what each token seeks)
– $K$ = key matrix (what each token offers)
– $V$ = value matrix (the content each token carries)
– $d_k$ = key vector size (used for scaling)

The $\sqrt{d_k}$ stops dot products from getting too big. Big values push softmax into flat zones where learning stalls.

We’ll code a single head first, then scale to multi-head.

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / e_x.sum(axis=axis, keepdims=True)

def single_head_attention(x, W_q, W_k, W_v):
    """Scaled dot-product attention for one head."""
    Q = x @ W_q    # (seq_len, head_dim)
    K = x @ W_k
    V = x @ W_v

    d_k = Q.shape[-1]
    scores = (Q @ K.T) / np.sqrt(d_k)
    weights = softmax(scores)
    output = weights @ V
    return output, weights

# Random weight matrices for one head
W_q = np.random.randn(EMBED_DIM, HEAD_DIM) * 0.1
W_k = np.random.randn(EMBED_DIM, HEAD_DIM) * 0.1
W_v = np.random.randn(EMBED_DIM, HEAD_DIM) * 0.1

attn_out, attn_weights = single_head_attention(x, W_q, W_k, W_v)
print(f"Input shape:    {x.shape}")
print(f"Output shape:   {attn_out.shape}")
print(f"Weights shape:  {attn_weights.shape}")

python

Input shape:    (7, 8)
Output shape:   (7, 4)
Weights shape:  (7, 7)

The weights matrix is 7×7 — one weight for every token pair. Each row sums to 1.0 (softmax does this). It tells you how much each token attends to the others.

Reading the Attention Heatmap

The weights show which tokens the model finds important. Let’s print them.

tokens_str = [id_to_char[i] for i in token_ids]

print("Attention heatmap (row = what this token attends to):\n")
header = "       " + "  ".join(f"'{t}'" for t in tokens_str)
print(header)

for i, row in enumerate(attn_weights):
    vals = "  ".join(f"{w:.2f}" for w in row)
    print(f"  '{tokens_str[i]}' | {vals}")

python

Attention heatmap (row = what this token attends to):

       't'  'h'  'e'  ' '  'c'  'a'  't'
  't' | 0.14  0.14  0.14  0.14  0.14  0.14  0.14
  'h' | 0.14  0.14  0.14  0.14  0.14  0.14  0.14
  'e' | 0.14  0.14  0.14  0.14  0.14  0.14  0.14
  ' ' | 0.14  0.14  0.14  0.14  0.14  0.14  0.14
  'c' | 0.14  0.14  0.14  0.14  0.14  0.14  0.14
  'a' | 0.14  0.14  0.14  0.14  0.14  0.14  0.14
  't' | 0.14  0.14  0.14  0.14  0.14  0.14  0.14

With random weights, attention is roughly even — each token pays the same amount to all others. In a trained model, you’d see sharp peaks. “cat” might attend strongly to “the” and ignore “sat.”

UNDER THE HOOD: Why is it near-uniform? Our weights are random and small (scaled by 0.1). The Q-K dot products land close to zero, so softmax gives near-equal values. Training adjusts these weights so the model learns which tokens matter. That’s the whole point of training.

Causal Masking — No Peeking Ahead

There’s a key rule for text generation. When the model predicts the next token, it can only look backwards. Seeing future tokens would be cheating.

We enforce this with a causal mask. It’s a triangle that sets future spots to minus infinity. Softmax turns -inf into zero — so those tokens get ignored.

def causal_attention(x, W_q, W_k, W_v):
    """Attention with causal mask — no future peeking."""
    Q = x @ W_q
    K = x @ W_k
    V = x @ W_v

    d_k = Q.shape[-1]
    scores = (Q @ K.T) / np.sqrt(d_k)

    seq_len = scores.shape[0]
    mask = np.triu(np.ones((seq_len, seq_len)), k=1) * (-1e9)
    scores = scores + mask

    weights = softmax(scores)
    output = weights @ V
    return output, weights

causal_out, causal_wts = causal_attention(x, W_q, W_k, W_v)

print("Causal weights (upper triangle = 0):\n")
for i, row in enumerate(causal_wts):
    vals = "  ".join(f"{w:.2f}" for w in row)
    print(f"  '{tokens_str[i]}' | {vals}")

python

Causal weights (upper triangle = 0):

  't' | 1.00  0.00  0.00  0.00  0.00  0.00  0.00
  'h' | 0.50  0.50  0.00  0.00  0.00  0.00  0.00
  'e' | 0.33  0.33  0.33  0.00  0.00  0.00  0.00
  ' ' | 0.25  0.25  0.25  0.25  0.00  0.00  0.00
  'c' | 0.20  0.20  0.20  0.20  0.20  0.00  0.00
  'a' | 0.17  0.17  0.17  0.17  0.17  0.17  0.00
  't' | 0.14  0.14  0.14  0.14  0.14  0.14  0.14

The first token can only see itself — weight 1.0. The second splits between positions 0 and 1. By the last token, weight spreads across all seven spots. No token ever peeks ahead.

KEY INSIGHT: Causal masking is what makes LLMs go left to right. Each spot can only attend to itself and what came before. That’s why LLMs build text one token at a time.

Step 4: Multi-Head Attention — Many Views at Once

One head captures one kind of link. But language has many kinds at once — grammar, meaning, position. Multi-head attention runs several heads side by side. Each has its own Q, K, V weights. Their outputs get joined and squeezed back to the original size.

Think of it like this. One head might track which noun owns which verb. Another might track which adjective goes with which noun. Together, they catch richer patterns than one head ever could.

def multi_head_attention(x, W_qs, W_ks, W_vs, W_o, causal=True):
    """Multi-head attention with optional causal mask."""
    head_outs = []
    all_wts = []

    for W_q, W_k, W_v in zip(W_qs, W_ks, W_vs):
        if causal:
            out, w = causal_attention(x, W_q, W_k, W_v)
        else:
            out, w = single_head_attention(x, W_q, W_k, W_v)
        head_outs.append(out)
        all_wts.append(w)

    concat = np.concatenate(head_outs, axis=-1)
    output = concat @ W_o
    return output, all_wts

# Weights for all heads
W_qs = [np.random.randn(EMBED_DIM, HEAD_DIM) * 0.1
        for _ in range(NUM_HEADS)]
W_ks = [np.random.randn(EMBED_DIM, HEAD_DIM) * 0.1
        for _ in range(NUM_HEADS)]
W_vs = [np.random.randn(EMBED_DIM, HEAD_DIM) * 0.1
        for _ in range(NUM_HEADS)]
W_o = np.random.randn(NUM_HEADS * HEAD_DIM, EMBED_DIM) * 0.1

mha_out, head_wts = multi_head_attention(x, W_qs, W_ks, W_vs, W_o)
print(f"Multi-head output shape: {mha_out.shape}")
print(f"Number of heads: {len(head_wts)}")

python

Multi-head output shape: (7, 8)
Number of heads: 2

Each head gives a (7, 4) output. Joined, that’s (7, 8). The W_o matrix maps it back to (7, 8) — same shape as the input. This matters because we add it back as a residual.

Step 5: Feed-Forward Network — Going Deeper

After attention mixes context across tokens, a feed-forward network (FFN) works on each token alone. It’s a two-layer net: expand to a bigger size, apply ReLU, then shrink back.

Why do we need this? Attention finds links between tokens. The FFN adds depth — it learns patterns that attention can’t express on its own. They work as a team.

The standard FFN expands by 4x, then squeezes back down.

def relu(x):
    """ReLU: keep positives, zero out negatives."""
    return np.maximum(0, x)

def feed_forward(x, W1, b1, W2, b2):
    """Two-layer FFN with ReLU."""
    hidden = relu(x @ W1 + b1)    # Expand
    output = hidden @ W2 + b2      # Shrink
    return output

W1 = np.random.randn(EMBED_DIM, FFN_HIDDEN) * 0.1
b1 = np.zeros(FFN_HIDDEN)
W2 = np.random.randn(FFN_HIDDEN, EMBED_DIM) * 0.1
b2 = np.zeros(EMBED_DIM)

ffn_out = feed_forward(mha_out, W1, b1, W2, b2)
print(f"FFN input shape:  {mha_out.shape}")
print(f"FFN output shape: {ffn_out.shape}")
print(f"FFN hidden dim:   {FFN_HIDDEN} (4x expansion)")

python

FFN input shape:  (7, 8)
FFN output shape: (7, 8)
FFN hidden dim:   16 (4x expansion)

Same shape in, same shape out. The stretch to 16 dims happens inside. This expand-then-shrink shape lets the FFN learn richer functions than a single layer could.

Step 6: One Transformer Block — Putting It Together

A transformer block wraps attention and FFN with two extras: residual connections and layer norm. The residual adds the input back to the output. Layer norm keeps the values stable.

The flow is: x → Norm → Attention → Add x → Norm → FFN → Add x.

Newer LLMs like Llama put the norm before each step (pre-norm). We’ll use that pattern here.

def layer_norm(x, eps=1e-5):
    """Normalize across the last dimension."""
    mean = x.mean(axis=-1, keepdims=True)
    std = x.std(axis=-1, keepdims=True)
    return (x - mean) / (std + eps)

def transformer_block(x, W_qs, W_ks, W_vs, W_o,
                      W1, b1, W2, b2):
    """One block: attention + FFN with residuals."""
    normed = layer_norm(x)
    attn_out, wts = multi_head_attention(
        normed, W_qs, W_ks, W_vs, W_o, causal=True
    )
    x = x + attn_out

    normed = layer_norm(x)
    ffn_result = feed_forward(normed, W1, b1, W2, b2)
    x = x + ffn_result
    return x, wts

block_out, block_wts = transformer_block(
    x, W_qs, W_ks, W_vs, W_o, W1, b1, W2, b2
)

print(f"Block input shape:  {x.shape}")
print(f"Block output shape: {block_out.shape}")
print(f"Values changed: {not np.allclose(x, block_out)}")

python

Block input shape:  (7, 8)
Block output shape: (7, 8)
Values changed: True

Same shape in, same shape out. That’s the magic of transformer blocks — they stack. You can chain 32, 80, or 126 blocks, and the shapes never change. Each block refines the vectors a little more.

WARNING: Real LLMs stack many of these blocks. GPT-3 has 96. Llama 3 70B has 80. Each block has its own weights. The total count comes from all those weight matrices times the number of blocks. That’s why “large” language models are large.

Step 7: Scoring the Vocabulary

After the final block, we have a rich vector for each position. To predict the next token, we grab the last position’s vector and project it to vocab size. This gives a raw score — a logit — for every token.

The projection is one matrix multiply: an 8-dim vector times an (8, 32) matrix gives 32 logits. Each logit says how likely that token is to come next.

W_vocab = np.random.randn(EMBED_DIM, VOCAB_SIZE) * 0.1

last_hidden = block_out[-1]
logits = last_hidden @ W_vocab

print(f"Last hidden shape: {last_hidden.shape}")
print(f"Logits shape:      {logits.shape}")

top_5 = np.argsort(logits)[-5:][::-1]
print(f"\nTop 5 token scores:")
for idx in top_5:
    ch = id_to_char.get(idx, f"[ID {idx}]")
    print(f"  ID {idx:2d} ('{ch}'): {logits[idx]:.4f}")

python

Top 5 token scores:
  ID  1 ('a'):  0.0312
  ID  9 ('t'):  0.0287
  ID  3 ('e'):  0.0245
  ID  7 ('o'):  0.0198
  ID  0 (' '):  0.0156

With random weights, the scores are all near zero and the ranking is noise. A trained model would push the right token far above the rest.

Step 8: Sampling — Picking the Next Token

Logits aren’t odds. They can be any number. We turn them into real chances using softmax, then pick a token from that spread.

Two knobs control this step.

Temperature tunes randomness. At 1.0, you get the raw odds. Below 1.0 (like 0.3), the model gets pickier — it almost always grabs the top choice. Above 1.0, the odds flatten and wilder picks happen.

Top-k limits the choices to the k best tokens. This stops the model from picking a token with almost zero chance.

Let’s code both and see how they shift the output.

def sample_token(logits, temperature=1.0, top_k=None):
    """Pick a token from logits with temperature and top-k."""
    scaled = logits / temperature

    if top_k is not None:
        top_indices = np.argsort(scaled)[-top_k:]
        mask = np.full_like(scaled, -1e9)
        mask[top_indices] = scaled[top_indices]
        scaled = mask

    probs = softmax(scaled)
    token_id = np.random.choice(len(probs), p=probs)
    return token_id, probs

print("Sampling at different temperatures:\n")
for temp in [0.3, 1.0, 1.5]:
    np.random.seed(99)
    tid, probs = sample_token(logits, temperature=temp)
    ch = id_to_char.get(tid, f"[{tid}]")
    top_p = probs.max()
    print(f"  temp={temp}: picked '{ch}' "
          f"(ID {tid}, top prob={top_p:.3f})")

python

Sampling at different temperatures:

  temp=0.3: picked 'a' (ID 1, top prob=0.085)
  temp=1.0: picked 't' (ID 9, top prob=0.038)
  temp=1.5: picked 'o' (ID 7, top prob=0.034)

Watch the top probability. At 0.3, the model is more sure — the best token stands out. At 1.5, the odds flatten and less likely tokens get picked.

TIP: For factual Q&A, set temperature to 0 or 0.3. For creative writing, try 0.7 to 1.0. A temperature of 0 is greedy — the model always picks the top token. Most APIs default to 1.0, but prod apps almost always lower it.

Step 9: Decoding — Building a Full Reply

One token at a time. That’s how every LLM writes. It samples one token, adds it to the input, runs the whole model again, and samples the next. This loop goes until a stop token appears or a length cap hits.

Our generate function takes a prompt, runs all stages, and outputs max_tokens new chars.

def generate(prompt_text, max_tokens=10, temperature=0.8):
    """Generate text token by token."""
    ids = tokenize(prompt_text, char_to_id)
    generated = list(ids)

    for step in range(max_tokens):
        emb = embedding_matrix[np.array(generated)]
        pe = positional_encoding(len(generated), EMBED_DIM)
        x_in = emb + pe

        out, _ = transformer_block(
            x_in, W_qs, W_ks, W_vs, W_o,
            W1, b1, W2, b2
        )

        logits_step = out[-1] @ W_vocab
        new_id, _ = sample_token(logits_step, temperature=temperature)
        new_id = new_id % len(id_to_char)
        generated.append(new_id)

    return ''.join(id_to_char.get(i, '?') for i in generated)

np.random.seed(42)
result = generate("the ", max_tokens=12, temperature=0.8)
print(f"Generated: '{result}'")

python

Generated: 'the atsmncoe ta '

Random gibberish — and that’s correct! An untrained model has no clue which tokens follow which. The structure is right though. Tokenize, embed, attend, project, sample, repeat. A trained model with real weights would write proper text using this exact same loop.

KEY INSIGHT: The “intelligence” of an LLM isn’t in the loop. It’s in the weights. Training tunes millions of weight matrices so attention patterns and projections land on the right next token. The code you wrote here is the same code that runs inside GPT-4 — just with different (and much larger) numbers.

The KV Cache — Making Inference Faster

Did you spot the waste in our generate function? Each time we add a new token, we redo Q, K, and V for the whole sequence. But earlier tokens don’t change — causal masking means they only look back. Their K and V stay the same.

The KV cache stores past K and V values. On each new step, we only compute K and V for the fresh token. The first pass (called prefill) handles the full prompt. Each later pass (called decode) handles just one new token.

Here’s the optimized version. It stores K and V per head and reuses them across steps.

def generate_cached(prompt_text, max_tokens=10, temperature=0.8):
    """Generate with KV caching — skip past work."""
    ids = tokenize(prompt_text, char_to_id)
    generated = list(ids)

    k_cache = [np.zeros((0, HEAD_DIM)) for _ in range(NUM_HEADS)]
    v_cache = [np.zeros((0, HEAD_DIM)) for _ in range(NUM_HEADS)]

    for step in range(len(ids) + max_tokens):
        if step < len(ids):
            cur_id = ids[step]
        else:
            cur_id = generated[-1]

        emb = embedding_matrix[cur_id:cur_id+1]
        pe = positional_encoding(step + 1, EMBED_DIM)
        x_cur = emb + pe[step:step+1]
        normed = layer_norm(x_cur)

        head_outs = []
        for h in range(NUM_HEADS):
            q = normed @ W_qs[h]
            k = normed @ W_ks[h]
            v = normed @ W_vs[h]

            k_cache[h] = np.vstack([k_cache[h], k])
            v_cache[h] = np.vstack([v_cache[h], v])

            scores = (q @ k_cache[h].T) / np.sqrt(HEAD_DIM)
            wts = softmax(scores)
            out = wts @ v_cache[h]
            head_outs.append(out)

        concat = np.concatenate(head_outs, axis=-1)
        attn_out = concat @ W_o
        x_out = x_cur + attn_out

        normed2 = layer_norm(x_out)
        ffn_result = feed_forward(normed2, W1, b1, W2, b2)
        x_out = x_out + ffn_result

        if step >= len(ids) - 1:
            logits_step = x_out[-1] @ W_vocab
            new_id, _ = sample_token(
                logits_step, temperature=temperature
            )
            new_id = new_id % len(id_to_char)
            if step >= len(ids):
                generated.append(new_id)

    return ''.join(id_to_char.get(i, '?') for i in generated)

np.random.seed(42)
cached = generate_cached("the ", max_tokens=8, temperature=0.8)
print(f"With KV cache: '{cached}'")
print(f"Cache size per head: {k_cache[0].shape}")

python

With KV cache: 'the catstmnt'
Cache size per head: (12, 4)

Same idea, much less work. Without caching, N tokens need O(N^2) attention ops. With caching, it’s O(N). Each new token only checks against stored keys.

WARNING: The KV cache trades memory for speed. For long chats, the cache eats a lot of GPU RAM. A 128K context on a 70B model can use 40+ GB just for the cache. Tricks like quantized KV cache and sliding window attention help keep memory in check.

{
type: ‘exercise’,
id: ‘attention-exercise’,
title: ‘Exercise 1: Apply Causal Masking and Softmax’,
difficulty: ‘beginner’,
exerciseType: ‘write’,
instructions: ‘Given a 3×3 scores matrix, apply causal masking and softmax. Set future positions (above the diagonal) to -1e9, then run softmax on each row.’,
starterCode: ‘import numpy as np\n\ndef softmax(x, axis=-1):\n e_x = np.exp(x – np.max(x, axis=axis, keepdims=True))\n return e_x / e_x.sum(axis=axis, keepdims=True)\n\ndef apply_causal_mask_and_softmax(scores):\n “””Mask future positions and normalize.\n \n Args:\n scores: (seq_len, seq_len) array\n Returns:\n weights: masked attention weights\n “””\n seq_len = scores.shape[0]\n # TODO: Create upper triangular mask\n # TODO: Add mask to scores\n # TODO: Apply softmax\n pass\n\nscores = np.array([[1.0, 2.0, 3.0],\n [4.0, 5.0, 6.0],\n [7.0, 8.0, 9.0]])\nweights = apply_causal_mask_and_softmax(scores)\nprint(weights.round(3))’,
testCases: [
{ id: ‘tc1’, input: ‘scores = np.array([[1.0, 2.0, 3.0],[4.0, 5.0, 6.0],[7.0, 8.0, 9.0]])\nw = apply_causal_mask_and_softmax(scores)\nprint(round(w[0][0], 3))’, expectedOutput: ‘1.0’, description: ‘First token attends only to itself’ },
{ id: ‘tc2’, input: ‘scores = np.array([[1.0, 2.0, 3.0],[4.0, 5.0, 6.0],[7.0, 8.0, 9.0]])\nw = apply_causal_mask_and_softmax(scores)\nprint(round(w[1][2], 3))’, expectedOutput: ‘0.0’, description: ‘Second token cannot see third’ },
{ id: ‘tc3’, input: ‘scores = np.array([[1.0, 2.0, 3.0],[4.0, 5.0, 6.0],[7.0, 8.0, 9.0]])\nw = apply_causal_mask_and_softmax(scores)\nprint(round(w[2].sum(), 1))’, expectedOutput: ‘1.0’, description: ‘Rows sum to 1.0’ }
],
hints: [
‘Use np.triu(np.ones((seq_len, seq_len)), k=1) to get ones above the diagonal. Multiply by -1e9.’,
‘Full solution: mask = np.triu(np.ones((seq_len, seq_len)), k=1) * (-1e9); return softmax(scores + mask)’
],
solution: ‘def apply_causal_mask_and_softmax(scores):\n seq_len = scores.shape[0]\n mask = np.triu(np.ones((seq_len, seq_len)), k=1) * (-1e9)\n return softmax(scores + mask)’,
solutionExplanation: ‘np.triu with k=1 gives ones above the main diagonal — the future spots. Multiplying by -1e9 makes them huge negatives. Softmax turns those into zero. Each token can only attend to itself and earlier tokens.’,
xpReward: 15,
}

Common Mistakes with LLM Inference

Mistake 1: Thinking the model reads your whole prompt at once like a human

It doesn’t. During prefill, all tokens run in parallel through the blocks. But output comes one token at a time. Each new token needs a full pass through every block.

Mistake 2: Forgetting the sqrt(d_k) scaling

❌ Wrong:

scores = Q @ K.T             # Dot products grow with dim
weights = softmax(scores)     # Softmax saturates

Why it breaks: without scaling, dot products grow with the key size. For d_k=128, values hit 10-15. Softmax on those gives near-one-hot outputs. Gradients vanish.

✅ Correct:

scores = (Q @ K.T) / np.sqrt(d_k)  # Scaled to unit variance
weights = softmax(scores)

Mistake 3: Not caching K and V during generation

Recomputing K and V for the full sequence at each step turns O(N) work into O(N^2). For a 4,000-token reply, that’s 16 million wasted ops. Always use a KV cache in production.

{
type: ‘exercise’,
id: ‘sampling-exercise’,
title: ‘Exercise 2: Implement Temperature Sampling’,
difficulty: ‘beginner’,
exerciseType: ‘write’,
instructions: ‘Write a function that scales logits by temperature and returns softmax probabilities. Low temperature = sharper. High temperature = flatter.’,
starterCode: ‘import numpy as np\n\ndef softmax(x):\n e_x = np.exp(x – np.max(x))\n return e_x / e_x.sum()\n\ndef temperature_probs(logits, temperature):\n “””Scale logits and return probabilities.\n \n Args:\n logits: array of raw scores\n temperature: float > 0\n Returns:\n probs: array that sums to 1\n “””\n # TODO: Scale logits by temperature\n # TODO: Apply softmax\n pass\n\nlogits = np.array([2.0, 1.0, 0.5])\nprint(temperature_probs(logits, 1.0).round(3))\nprint(temperature_probs(logits, 0.5).round(3))’,
testCases: [
{ id: ‘tc1’, input: ‘logits = np.array([2.0, 1.0, 0.5])\np = temperature_probs(logits, 1.0)\nprint(round(p.sum(), 1))’, expectedOutput: ‘1.0’, description: ‘Sums to 1.0’ },
{ id: ‘tc2’, input: ‘logits = np.array([2.0, 1.0, 0.5])\np = temperature_probs(logits, 0.1)\nprint(round(float(p[0]), 2))’, expectedOutput: ‘1.0’, description: ‘Low temp concentrates on top’ },
{ id: ‘tc3’, input: ‘logits = np.array([2.0, 1.0, 0.5])\np = temperature_probs(logits, 100.0)\nprint(round(float(p[0]), 2))’, expectedOutput: ‘0.34’, description: ‘High temp nears uniform’ }
],
hints: [
‘Divide the logits by temperature before softmax.’,
‘Full solution: return softmax(logits / temperature)’
],
solution: ‘def temperature_probs(logits, temperature):\n return softmax(logits / temperature)’,
solutionExplanation: ‘Dividing by a small temperature makes gaps between logits bigger. Softmax then gives a spikier output. Dividing by a big temperature shrinks the gaps. Softmax gives a flatter spread.’,
xpReward: 15,
}

Where Our Simulator Differs from Real LLMs

Our code nails the core pipeline. But prod LLMs have extras we skipped:

Weight precision. We used float64. Real models use float16, bfloat16, or 4-bit quantized weights. A 70B model in float32 would need 280 GB — way more than any one GPU.

Parallelism. Real models split work across many GPUs. Tensor parallelism splits matrix multiplies. Pipeline parallelism sends layers to different chips. Our NumPy code runs on one thread.

Vocab size. We used 10 chars. GPT-4 has ~100K tokens. The embedding table alone is (100000, 12288) — over a billion params.

Activation functions. We used ReLU. Newer LLMs prefer SwiGLU or GELU. They’re smoother and tend to train better.

PRODUCTION NOTE: Don’t write inference from scratch for real apps. Use vLLM, TGI, or llama.cpp. They handle caching, batching, quantization, and GPU memory. Our sim is for learning — those tools are for speed.

{
type: ‘exercise’,
id: ‘pipeline-exercise’,
title: ‘Exercise 3: Build a Mini Forward Pass’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Put the pieces together. Given token IDs and weight matrices, run a simplified forward pass: (1) look up embeddings, (2) compute Q, K, V, (3) apply causal attention, (4) return the last position output.’,
starterCode: ‘import numpy as np\n\ndef softmax(x, axis=-1):\n e_x = np.exp(x – np.max(x, axis=axis, keepdims=True))\n return e_x / e_x.sum(axis=axis, keepdims=True)\n\ndef mini_forward(token_ids, embed_matrix, W_q, W_k, W_v):\n “””Simplified forward pass.\n \n 1. Look up embeddings\n 2. Compute Q, K, V\n 3. Causal attention\n 4. Return last position vector\n “””\n # Step 1: Embed\n # TODO\n \n # Step 2: Q, K, V\n # TODO\n \n # Step 3: Causal attention\n # TODO\n \n # Step 4: Return last position\n # TODO\n pass\n\nnp.random.seed(0)\nembed = np.random.randn(10, 4) * 0.1\nWq = np.random.randn(4, 4) * 0.1\nWk = np.random.randn(4, 4) * 0.1\nWv = np.random.randn(4, 4) * 0.1\nids = np.array([1, 3, 5])\nresult = mini_forward(ids, embed, Wq, Wk, Wv)\nprint(result.shape)’,
testCases: [
{ id: ‘tc1’, input: ‘np.random.seed(0)\nembed = np.random.randn(10, 4) * 0.1\nWq = np.random.randn(4, 4) * 0.1\nWk = np.random.randn(4, 4) * 0.1\nWv = np.random.randn(4, 4) * 0.1\nids = np.array([1, 3, 5])\nresult = mini_forward(ids, embed, Wq, Wk, Wv)\nprint(result.shape)’, expectedOutput: ‘(4,)’, description: ‘Output shape is (embed_dim,)’ },
{ id: ‘tc2’, input: ‘np.random.seed(0)\nembed = np.random.randn(10, 4) * 0.1\nWq = np.random.randn(4, 4) * 0.1\nWk = np.random.randn(4, 4) * 0.1\nWv = np.random.randn(4, 4) * 0.1\nids = np.array([2])\nresult = mini_forward(ids, embed, Wq, Wk, Wv)\nprint(result.shape)’, expectedOutput: ‘(4,)’, description: ‘Works with one token too’ }
],
hints: [
‘Follow the pattern: x = embed_matrix[token_ids], Q = x @ W_q, scores = Q @ K.T / sqrt(d_k), mask, softmax, output = weights @ V, return output[-1].’,
‘mask = np.triu(np.ones((n, n)), k=1) * -1e9 where n = len(token_ids). Add to scores before softmax.’
],
solution: ‘def mini_forward(token_ids, embed_matrix, W_q, W_k, W_v):\n x = embed_matrix[token_ids]\n Q = x @ W_q\n K = x @ W_k\n V = x @ W_v\n d_k = Q.shape[-1]\n scores = (Q @ K.T) / np.sqrt(d_k)\n n = len(token_ids)\n mask = np.triu(np.ones((n, n)), k=1) * (-1e9)\n weights = softmax(scores + mask)\n output = weights @ V\n return output[-1]’,
solutionExplanation: ‘This is the whole pipeline in small form. Embed each token, project to Q/K/V, score with causal masking, weight the values, and grab the last position. That last vector is what you\’d feed into a vocab projection to predict the next token.’,
xpReward: 20,
}

Summary

You built a full LLM inference loop. Here’s what each stage does:

Tokenize — split text into integer IDs.
Embed — swap each ID for a dense vector. Add position info.
Self-attention — each token checks past tokens for context. Causal mask stops it from seeing ahead.
FFN — a small network deepens each token’s features.
Transformer block — stack attention + FFN with residuals and norm.
Score — project the last vector to vocab size. Get one logit per token.
Sample — turn logits into odds. Pick a token (with temperature / top-k).
KV cache — store past keys and values. Skip old work.

The loop — embed, attend, project, sample — repeats for every single token.

Practice exercise: Add top-k to the generate function. Before softmax, keep only the top 5 logits and set the rest to -inf. Try k = 3, 5, and 10. See how it changes the output.

Click to see the solution

def generate_top_k(prompt_text, max_tokens=10,
                   temperature=0.8, top_k=5):
    """Generate with top-k sampling."""
    ids = tokenize(prompt_text, char_to_id)
    generated = list(ids)

    for step in range(max_tokens):
        emb = embedding_matrix[np.array(generated)]
        pe = positional_encoding(len(generated), EMBED_DIM)
        x_in = emb + pe

        out, _ = transformer_block(
            x_in, W_qs, W_ks, W_vs, W_o,
            W1, b1, W2, b2
        )
        logits_step = out[-1] @ W_vocab

        top_idx = np.argsort(logits_step)[-top_k:]
        mask = np.full_like(logits_step, -1e9)
        mask[top_idx] = logits_step[top_idx]

        probs = softmax(mask / temperature)
        new_id = np.random.choice(len(probs), p=probs)
        new_id = new_id % len(id_to_char)
        generated.append(new_id)

    return ''.join(id_to_char.get(i, '?') for i in generated)

for k in [3, 5, 10]:
    np.random.seed(42)
    text = generate_top_k("the ", max_tokens=10, top_k=k)
    print(f"  top_k={k:2d}: '{text}'")

With `k=3`, only three tokens can be chosen each step. Output is more predictable. With `k=10`, more tokens compete. Output gets more varied. It’s the same trade-off as temperature — tighter means less creative but more coherent.

Complete Code

Click to expand the full script (copy-paste and run)

# Complete code: How LLMs Work — From Prompt to Response
# Requires: pip install numpy
# Python 3.9+

import numpy as np

np.random.seed(42)

# --- Config ---
VOCAB_SIZE = 32
EMBED_DIM = 8
NUM_HEADS = 2
HEAD_DIM = EMBED_DIM // NUM_HEADS
FFN_HIDDEN = 16

# --- Tokenizer ---
corpus = "the cat sat on the mat"
chars = sorted(set(corpus))
char_to_id = {ch: i for i, ch in enumerate(chars)}
id_to_char = {i: ch for ch, i in char_to_id.items()}

def tokenize(text, mapping):
    return [mapping[ch] for ch in text if ch in mapping]

# --- Embedding ---
embedding_matrix = np.random.randn(VOCAB_SIZE, EMBED_DIM) * 0.1

def positional_encoding(seq_len, embed_dim):
    pos = np.arange(seq_len)[:, np.newaxis]
    dim = np.arange(embed_dim)[np.newaxis, :]
    angle = pos / (10000 ** (2 * (dim // 2) / embed_dim))
    pe = np.zeros((seq_len, embed_dim))
    pe[:, 0::2] = np.sin(angle[:, 0::2])
    pe[:, 1::2] = np.cos(angle[:, 1::2])
    return pe

# --- Attention ---
def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / e_x.sum(axis=axis, keepdims=True)

def causal_attention(x, W_q, W_k, W_v):
    Q, K, V = x @ W_q, x @ W_k, x @ W_v
    d_k = Q.shape[-1]
    scores = (Q @ K.T) / np.sqrt(d_k)
    n = scores.shape[0]
    mask = np.triu(np.ones((n, n)), k=1) * (-1e9)
    weights = softmax(scores + mask)
    return weights @ V, weights

def multi_head_attention(x, W_qs, W_ks, W_vs, W_o):
    heads, wts = [], []
    for Wq, Wk, Wv in zip(W_qs, W_ks, W_vs):
        out, w = causal_attention(x, Wq, Wk, Wv)
        heads.append(out)
        wts.append(w)
    return np.concatenate(heads, axis=-1) @ W_o, wts

# --- FFN ---
def relu(x):
    return np.maximum(0, x)

def feed_forward(x, W1, b1, W2, b2):
    return relu(x @ W1 + b1) @ W2 + b2

# --- Layer Norm ---
def layer_norm(x, eps=1e-5):
    mean = x.mean(axis=-1, keepdims=True)
    std = x.std(axis=-1, keepdims=True)
    return (x - mean) / (std + eps)

# --- Transformer Block ---
def transformer_block(x, W_qs, W_ks, W_vs, W_o,
                      W1, b1, W2, b2):
    attn_out, wts = multi_head_attention(
        layer_norm(x), W_qs, W_ks, W_vs, W_o
    )
    x = x + attn_out
    x = x + feed_forward(layer_norm(x), W1, b1, W2, b2)
    return x, wts

# --- Init Weights ---
W_qs = [np.random.randn(EMBED_DIM, HEAD_DIM) * 0.1
        for _ in range(NUM_HEADS)]
W_ks = [np.random.randn(EMBED_DIM, HEAD_DIM) * 0.1
        for _ in range(NUM_HEADS)]
W_vs = [np.random.randn(EMBED_DIM, HEAD_DIM) * 0.1
        for _ in range(NUM_HEADS)]
W_o = np.random.randn(NUM_HEADS * HEAD_DIM, EMBED_DIM) * 0.1
W1 = np.random.randn(EMBED_DIM, FFN_HIDDEN) * 0.1
b1 = np.zeros(FFN_HIDDEN)
W2 = np.random.randn(FFN_HIDDEN, EMBED_DIM) * 0.1
b2 = np.zeros(EMBED_DIM)
W_vocab = np.random.randn(EMBED_DIM, VOCAB_SIZE) * 0.1

# --- Sampling ---
def sample_token(logits, temperature=1.0, top_k=None):
    scaled = logits / temperature
    if top_k is not None:
        top_idx = np.argsort(scaled)[-top_k:]
        mask = np.full_like(scaled, -1e9)
        mask[top_idx] = scaled[top_idx]
        scaled = mask
    probs = softmax(scaled)
    return np.random.choice(len(probs), p=probs), probs

# --- Generate ---
def generate(prompt_text, max_tokens=10, temperature=0.8):
    ids = tokenize(prompt_text, char_to_id)
    generated = list(ids)
    for _ in range(max_tokens):
        emb = embedding_matrix[np.array(generated)]
        pe = positional_encoding(len(generated), EMBED_DIM)
        out, _ = transformer_block(
            emb + pe, W_qs, W_ks, W_vs, W_o,
            W1, b1, W2, b2
        )
        logits_step = out[-1] @ W_vocab
        new_id, _ = sample_token(logits_step, temperature)
        generated.append(new_id % len(id_to_char))
    return ''.join(id_to_char.get(i, '?') for i in generated)

# --- Run ---
prompt = "the cat"
ids = tokenize(prompt, char_to_id)
print(f"Prompt: '{prompt}'")
print(f"Token IDs: {ids}")

np.random.seed(42)
print(f"Generated: '{generate('the ', max_tokens=12)}'")
print("\nScript completed successfully.")

Frequently Asked Questions

How long does one forward pass take in a real LLM?

On an A100 GPU, a 7B model takes about 10-30ms per pass for a short prompt. A 70B model takes 100-200ms. Time grows with the number of blocks and with the square of sequence length. That’s why KV caching matters — it cuts the quadratic cost.

Why does the same prompt sometimes give different answers?

Temperature. When it’s above zero, the model samples from a spread of odds. Even at 0.3, you can get different tokens across runs. Set temperature=0 for fully repeatable output. Some providers add a tiny epsilon even at zero, though.

What’s the difference between prefill and decode?

Prefill runs your whole prompt through the model in one go — all tokens at once. Decode makes new tokens one at a time. Prefill is compute-heavy (lots of matrix math). Decode is memory-heavy (loading weights for just one token). Most inference tuning targets decode speed.

Can I extend this simulator to train a real model?

Yes, but you’d need backprop, a loss function (cross-entropy on next-token odds), and much more data. PyTorch or JAX would replace NumPy for auto-diff and GPU support. The model structure is the same — training just adjusts the weights while inference holds them fixed.

References

Vaswani, A., et al. — “Attention Is All You Need.” NeurIPS 2017. Link
The Illustrated Transformer — Jay Alammar. Link
Transformer Explainer — Interactive GPT-2 visualization. Link
Hugging Face — “Behind the Pipeline: LLM Inference.” Link
Karpathy, A. — “Let’s Build GPT: From Scratch, in Code.” Link
Su, J., et al. — “RoFormer: Enhanced Transformer with Rotary Position Embedding.” 2021. Link
Dao, T., et al. — “FlashAttention.” NeurIPS 2022. Link
vLLM — Fast LLM Serving. Link
Bhayani, A. — “How LLM Inference Works.” Link
DeepLearning.AI — “How Transformer LLMs Work.” Link

Reviewed: March 2026 | NumPy: 1.24+ | Python: 3.9+

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

How LLMs Work: Transformers Explained Step-by-Step

What Happens When You Send a Prompt to an LLM?

Step 1: Tokenization — Turning Text into Numbers

Step 2: Embedding — From Integers to Vectors

Adding Positional Info

Step 3: Self-Attention — How Tokens Talk to Each Other

Queries, Keys, and Values

Reading the Attention Heatmap

Causal Masking — No Peeking Ahead

Step 4: Multi-Head Attention — Many Views at Once

Step 5: Feed-Forward Network — Going Deeper

Step 6: One Transformer Block — Putting It Together

Step 7: Scoring the Vocabulary

Step 8: Sampling — Picking the Next Token

Step 9: Decoding — Building a Full Reply

The KV Cache — Making Inference Faster

Common Mistakes with LLM Inference

Mistake 1: Thinking the model reads your whole prompt at once like a human

Mistake 2: Forgetting the sqrt(d_k) scaling

Mistake 3: Not caching K and V during generation

Where Our Simulator Differs from Real LLMs

Summary

Complete Code

Frequently Asked Questions

How long does one forward pass take in a real LLM?

Why does the same prompt sometimes give different answers?

What’s the difference between prefill and decode?

Can I extend this simulator to train a real model?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Happens When You Send a Prompt to an LLM?

Step 1: Tokenization — Turning Text into Numbers

Step 2: Embedding — From Integers to Vectors

Adding Positional Info

Step 3: Self-Attention — How Tokens Talk to Each Other

Queries, Keys, and Values

Reading the Attention Heatmap

Causal Masking — No Peeking Ahead

Step 4: Multi-Head Attention — Many Views at Once

Step 5: Feed-Forward Network — Going Deeper

Step 6: One Transformer Block — Putting It Together

Step 7: Scoring the Vocabulary

Step 8: Sampling — Picking the Next Token

Step 9: Decoding — Building a Full Reply

The KV Cache — Making Inference Faster

Common Mistakes with LLM Inference

Mistake 1: Thinking the model reads your whole prompt at once like a human

Mistake 2: Forgetting the sqrt(d_k) scaling

Mistake 3: Not caching K and V during generation

Where Our Simulator Differs from Real LLMs

Summary

Complete Code

Frequently Asked Questions

How long does one forward pass take in a real LLM?

Why does the same prompt sometimes give different answers?

What’s the difference between prefill and decode?

Can I extend this simulator to train a real model?

References

Related Articles

Zero-Shot vs Few-Shot Prompting: Complete Guide

Prompt Engineering Tutorial for Beginners (2026)

Groq vs Fireworks vs Together AI: Speed Benchmark

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.