How LLMs Work in Python — Transformer Architecture Explained

Written by Selva Prabhakaran | 28 min read

You type a question and an LLM writes a thoughtful reply in half a second. Understanding how LLMs work — really work, at the level of matrix multiplications and attention weights — is one of those things that transforms how you think about AI entirely. Most “how LLMs work” explanations are either handwavy diagrams or dense academic papers. This one is neither: every concept comes with runnable Python code so you can see the numbers for yourself.

We’ll build the transformer architecture layer by layer, from tokenization all the way to text generation.

Prerequisites: Python 3.9+, NumPy 1.24+. For the later sections: PyTorch 2.0+ and the transformers library.

bash

pip install numpy torch transformers tiktoken

What Is an LLM, Really?

Strip away the marketing and an LLM does one thing: given a sequence of tokens, it predicts the probability of the next token.

That’s it. The apparent “intelligence” emerges from doing this one thing billions of times during training on internet-scale text. By the time training finishes, the model has compressed an enormous amount of information about language, facts, and reasoning into its weights.

The architecture that makes this work is the transformer, introduced in the 2017 paper “Attention Is All You Need.” Every major LLM — GPT-4, Claude, Llama, Gemini — is built on this architecture. Understanding the transformer means understanding all of them.

Here’s the full pipeline, from text to prediction:

python

Raw text
  → Tokenizer        (text → integer IDs)
  → Embedding table  (IDs → vectors)
  → + Positional encoding
  → N × Transformer blocks:
      Self-attention  (tokens look at each other)
      Feed-forward    (each token processed independently)
      LayerNorm + Residual connections
  → Linear projection
  → Softmax → Probability over next token

Each step transforms your input into a richer representation. Let’s walk through each one.

Step 1 — Tokenization: Turning Text into Numbers

Before the model touches any text, the tokenizer converts it into integers. You might expect one number per word — “cat” → 1, “the” → 2. But that approach breaks on rare words, typos, and other languages. Vocabulary sizes would balloon into the millions.

The solution most modern LLMs use is Byte Pair Encoding (BPE) — a subword tokenizer. BPE learns to split text into frequent subword units, so “tokenization” might become ["token", "ization"] and “unbelievably” might become ["un", "believably"]. Common words stay as single tokens; rare words get split into recognizable parts. I prefer BPE over character-level tokenization for this reason — it gives the model meaningful units to work with, rather than forcing it to learn everything from individual letters.

Here’s what that looks like using tiktoken, the same tokenizer GPT-4 uses. You’ll see the token IDs as integers, the string form of each subword (spaces typically attach to the following word), and the total count:

python

import tiktoken

# cl100k_base is the encoding used by GPT-3.5 and GPT-4
enc = tiktoken.encoding_for_model("gpt-4")

text = "Transformers power modern LLMs."
token_ids = enc.encode(text)
token_strings = [enc.decode([t]) for t in token_ids]

print("Token IDs:  ", token_ids)
print("Tokens:     ", token_strings)
print("Count:      ", len(token_ids))

Notice a few things when you run this. Spaces are typically attached to the following word. Punctuation gets its own token. And “LLMs” might be one token or two depending on the vocabulary.

The total number of distinct tokens is the vocabulary size — about 100,000 for GPT-4. Every token in the vocabulary gets a unique integer ID. Those IDs are what the model actually sees.

Tip: **Token count drives costs and context limits.** LLMs have a context window — a maximum number of tokens they can process at once. GPT-4 Turbo supports 128K tokens. As a rough guide, 1 token ≈ ¾ of an English word, so 128K tokens ≈ a 300-page novel. Every API call is billed by token count, not word count — always count tokens directly with a tokenizer, not from word estimates.

Step 2 — Embeddings: From Token IDs to Meaning

Token IDs are just integers. You can’t do useful math on them directly — the model can’t reason that “token 47” is semantically related to “token 48” just because the numbers are close.

What you need is a vector for each token — a list of floating-point numbers that encodes meaning in a way the model can manipulate with linear algebra. That’s what the embedding table does.

Think of it as a giant lookup table: for each of the ~100,000 token IDs, there’s a row of, say, 768 numbers. Those numbers start random and get trained alongside everything else. By the end of training, similar tokens end up with similar vectors — “cat” and “kitten” end up close together, “cat” and “democracy” end up far apart.

python

import numpy as np

# A tiny vocabulary: 5 tokens, 4-dimensional embeddings
# (GPT-2 uses 768 dims; GPT-4 uses ~12,288 dims)
embedding_table = np.array([
    [ 0.1,  0.2,  0.3,  0.1],   # token 0: "the"
    [ 0.8,  0.1, -0.3,  0.5],   # token 1: "cat"
    [ 0.7,  0.2, -0.2,  0.4],   # token 2: "dog"
    [-0.1,  0.9,  0.2, -0.3],   # token 3: "sat"
    [-0.5,  0.1,  0.8,  0.2],   # token 4: "on"
])

# Sentence: "the cat sat" → token IDs [0, 1, 3]
token_ids = [0, 1, 3]
embeddings = embedding_table[token_ids]

print("Shape:", embeddings.shape)   # (3 tokens, 4 dims)
print(embeddings)

which gives us:

python

Shape: (3, 4)
[[ 0.1  0.2  0.3  0.1]
 [ 0.8  0.1 -0.3  0.5]
 [-0.1  0.9  0.2 -0.3]]

Each row is one token’s current representation. Three tokens in, three vectors out. The embedding lookup is literally just indexing into a table — extremely fast.

Key Insight: **Embeddings are the model’s knowledge before it reads the context.** The embedding for “bank” encodes everything the model learned about that word across billions of training documents — both the financial sense and the river sense. The attention mechanism (next section) is what uses surrounding words to figure out which sense applies.

Step 3 — Positional Encoding: Where Am I in the Sequence?

After the embedding lookup, you have a set of vectors — one per token. But sets have no order. “The cat sat” and “Sat the cat” would produce the same three vectors. Word order matters, and the model needs to know it.

The transformer fixes this by adding a positional encoding vector to each token’s embedding. Position 0 gets a different “signature” than position 1, position 2, and so on. The original transformer uses sinusoidal functions for this — a mathematical choice that lets the model generalise to sequence lengths it hasn’t seen during training.

python

def positional_encoding(seq_len, d_model):
    """Generates sinusoidal positional encodings."""
    PE = np.zeros((seq_len, d_model))
    positions = np.arange(seq_len).reshape(-1, 1)       # (seq_len, 1)
    dims = np.arange(0, d_model, 2)                     # [0, 2, 4, ...]
    div_term = 10000 ** (dims / d_model)

    PE[:, 0::2] = np.sin(positions / div_term)          # even dims: sin
    PE[:, 1::2] = np.cos(positions / div_term)          # odd dims:  cos
    return PE

pe = positional_encoding(seq_len=3, d_model=4)
print(pe.round(4))

the result:

python

[[ 0.      1.      0.      1.    ]
 [ 0.8415  0.5403  0.01    1.    ]
 [ 0.9093 -0.4161  0.02    0.9998]]

Each row is a unique position fingerprint. You add this to the token embeddings. Now every vector encodes both what the token means and where it sits in the sequence.

Quick Check: Why does position 0 get all zeros in the sine columns and all ones in the cosine columns? It’s because sin(0) = 0 and cos(0) = 1 for any frequency. Position 0 is always the same, regardless of d_model.

Note: **Sinusoidal vs. learned vs. RoPE.** The original transformer uses the fixed formula above. GPT-2 uses learned positional embeddings — a trainable lookup table, one row per position. Most modern LLMs (Llama, Mistral, Gemma) use **RoPE** (Rotary Position Embedding), which encodes position directly into the Q and K vectors inside attention. The core idea — give each position a unique signature — is the same across all three.

Step 4 — Self-Attention: The Core Mechanism

This is where the transformer earns its reputation. Everything else is scaffolding.

Self-attention answers one question for every token: which other tokens in this sequence should I focus on, and how much?

Consider “The animal didn’t cross the street because it was too tired.” What does “it” refer to — “animal” or “street”? You know it’s “animal” because of “too tired.” Self-attention is the mechanism that resolves this kind of ambiguity. The word “it” learns to attend strongly to “animal” and weakly to “street.” No rules, no hand-coding — just learned from data.

The Q, K, V Projections

Self-attention uses three learned matrices to project each token’s embedding into three different spaces:

Query (Q): “What information am I looking for?”
Key (K): “What information do I contain?”
Value (V): “What do I contribute if you attend to me?”

For every token, you compute Q, K, and V by multiplying the embedding by three learned weight matrices. Then you measure how well each Query matches each Key — that gives you attention scores. You convert scores to weights with softmax, then take a weighted sum of Values.

The formula: Attention(Q, K, V) = softmax(Q @ K.T / √d_k) @ V

The √d_k scaling prevents the dot products from getting too large in high dimensions, which would push softmax into very flat or very peaked distributions.

Let’s build it from scratch:

python

def scaled_dot_product_attention(Q, K, V):
    """Core attention computation."""
    d_k = Q.shape[-1]

    # Score: how well does each query match each key?
    scores = Q @ K.T                                    # (seq_len, seq_len)

    # Scale to stabilise softmax
    scores = scores / np.sqrt(d_k)

    # Numerical stability: subtract row max before exp
    scores -= scores.max(axis=-1, keepdims=True)

    # Softmax: convert scores to weights (each row sums to 1)
    weights = np.exp(scores)
    weights /= weights.sum(axis=-1, keepdims=True)

    # Weighted sum of values
    output = weights @ V                                # (seq_len, d_k)
    return output, weights


# Three token embeddings, d_model = 4
X = np.array([
    [1.0, 0.0, 0.5, 0.2],   # token 0
    [0.0, 1.0, 0.3, 0.8],   # token 1
    [0.5, 0.5, 1.0, 0.0],   # token 2
])

# Use identity projection matrices for clarity
# (in a real model, these are learned during training)
W_Q = W_K = W_V = np.eye(4)
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights (each row sums to 1):")
print(weights.round(3))
print("\nOutput shape:", output.shape)

Output:

python

Attention weights (each row sums to 1):
[[0.404 0.247 0.349]
 [0.232 0.472 0.296]
 [0.314 0.284 0.403]]

Output shape: (3, 4)

Read that weight matrix row by row. Token 0 pays 40.4% attention to itself, 24.7% to token 1, and 34.9% to token 2. Token 1 pays most attention to itself (47.2%). These weights are the model’s learned relevance scores — which tokens matter for understanding each other.

With identity projections, this is measuring raw embedding similarity. In a trained model, the learned W_Q, W_K, W_V matrices transform embeddings into a space where semantically relevant tokens score high — not just similar-looking ones. That’s the magic that training instills.

Key Insight: **Attention weights are interpretable.** In a trained model, looking at the weights for “it” in “The cat sat on the mat because it was tired”, you’ll often see “it” attending strongly to “cat.” The model didn’t learn this by rule — it learned it from billions of examples of pronoun resolution. That’s remarkable.

Causal Masking — Why Decoders Don’t Look Ahead

The attention mechanism above lets every token attend to every other token — including future ones. For an encoder (like BERT), that’s fine and actually useful for tasks like text classification.

But a decoder-only LLM (GPT, Llama, Claude) is trained to predict the next token autoregressively — one token at a time, each prediction depending only on the tokens before it. When predicting token 5, the model can’t see token 6, because token 6 is the answer. Letting it peek would be like giving a student the exam during the test.

“Autoregressive” just means the model generates one token, appends it to the sequence, then generates the next. Each output feeds back as input. That’s the entire generation loop.

The fix is a causal mask — an upper-triangular matrix of −∞ values that forces all future positions to zero after softmax:

python

def causal_attention(Q, K, V):
    """Self-attention with causal (autoregressive) masking."""
    seq_len, d_k = Q.shape

    scores = Q @ K.T / np.sqrt(d_k)

    # Upper triangle = -inf, diagonal and below = visible
    mask = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)
    scores[mask] = -np.inf

    # Stable softmax: we track finite positions before subtracting the row max,
    # because subtracting from -inf would produce nan instead of 0
    finite  = np.isfinite(scores)
    row_max = np.where(finite, scores, -np.inf).max(axis=-1, keepdims=True)
    scores  = np.where(finite, scores - row_max, -np.inf)
    weights = np.where(finite, np.exp(scores), 0.0)
    weights /= weights.sum(axis=-1, keepdims=True)

    return weights @ V, weights


output, weights = causal_attention(Q, K, V)
print("Causal attention weights:")
print(weights.round(3))

Output:

python

Causal attention weights:
[[1.000 0.000 0.000]
 [0.330 0.670 0.000]
 [0.314 0.284 0.403]]

Token 0 can only see itself (100%). Token 1 sees tokens 0 and 1. Token 2 sees all three. This is exactly how generation works — each new token attends to everything before it, nothing after.

Encoder vs. decoder: the fork in the road.

Whether a model uses causal or bidirectional attention determines what it can do. Here’s the comparison you’ll see referenced constantly:

	Encoder-only	Decoder-only	Encoder-decoder
Attention	Bidirectional (sees all tokens)	Causal (sees past only)	Both
Trained to	Predict masked tokens	Predict next token	Translate / summarize
Examples	BERT, RoBERTa	GPT, Llama, Claude	T5, BART
Good for	Classification, NER, embeddings	Text generation, chatbots	Seq-to-seq tasks
Context limit	Both directions	Left-to-right only	Mixed

When you hear “decoder-only LLM,” that row is the one that matters for this article — causal masking, autoregressive generation, and everything that follows.

[TRY IT YOURSELF — Exercise 1]

The scaled_dot_product_attention function has no masking. Extend it to accept an optional causal parameter. When causal=True, apply the upper-triangular mask. When causal=False (default), run standard attention.

python

def attention_with_mask(Q, K, V, causal=False):
    seq_len, d_k = Q.shape
    scores = Q @ K.T / np.sqrt(d_k)

    if causal:
        # YOUR CODE HERE: apply causal mask
        pass

    scores -= scores.max(axis=-1, keepdims=True)
    weights = np.exp(scores)
    weights /= weights.sum(axis=-1, keepdims=True)
    return weights @ V, weights

# Test: causal=True should match causal_attention output
out, w = attention_with_mask(Q, K, V, causal=True)
print(w.round(3))
# Expected: [[1. 0. 0.], [0.330 0.670 0.], [0.314 0.284 0.403]]

Hint 1: np.triu(np.ones(...), k=1) creates the upper-triangular boolean mask. k=1 means start above the diagonal.

Hint 2: After setting scores[mask] = -np.inf, you need a stable softmax that treats −∞ as weight 0. See the finite variable pattern in causal_attention above.

Show solution

python

def attention_with_mask(Q, K, V, causal=False):
    seq_len, d_k = Q.shape
    scores = Q @ K.T / np.sqrt(d_k)

    if causal:
        mask = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)
        scores[mask] = -np.inf

    finite  = np.isfinite(scores)
    row_max = np.where(finite, scores, -np.inf).max(axis=-1, keepdims=True)
    scores  = np.where(finite, scores - row_max, -np.inf)
    weights = np.where(finite, np.exp(scores), 0.0)
    weights /= weights.sum(axis=-1, keepdims=True)
    return weights @ V, weights

The `k=1` in `np.triu` starts the mask one position above the diagonal — so each token can still attend to itself. After softmax, the −∞ positions become exactly 0.

Multi-Head Attention — Looking from Multiple Angles

You might be wondering: if one attention head already lets every token look at every other token, why add more? Because a single head learns one way to measure relevance. A sentence has multiple kinds of structure simultaneously — syntactic dependencies, coreference chains, topic relationships, sentiment. Multi-head attention runs several attention operations in parallel, each with different learned projections, so each head can specialise in a different kind of relationship.

The implementation splits the embedding dimension across heads. With d_model = 512 and n_heads = 8, each head works in d_head = 64 dimensions. After all heads run, you concatenate their outputs back to d_model.

python

def multi_head_attention(X, n_heads=2):
    """Multi-head attention with per-slice projections (for illustration)."""
    seq_len, d_model = X.shape
    d_head = d_model // n_heads     # 4 // 2 = 2

    head_outputs = []
    for h in range(n_heads):
        # Each head operates over a different slice of the embedding
        start = h * d_head
        X_h   = X[:, start:start + d_head]             # (seq_len, d_head)

        # In a real model, each head has its own W_Q, W_K, W_V
        out_h, _ = scaled_dot_product_attention(X_h, X_h, X_h)
        head_outputs.append(out_h)

    # Concatenate all head outputs back to d_model
    return np.concatenate(head_outputs, axis=-1)        # (seq_len, d_model)


mha_output = multi_head_attention(X, n_heads=2)
print("Multi-head output shape:", mha_output.shape)
print(mha_output.round(3))

Output:

python

Multi-head output shape: (3, 4)
[[0.615 0.385 0.619 0.319]
 [0.385 0.615 0.568 0.382]
 [0.500 0.500 0.664 0.272]]

The output is the same shape as the input — (seq_len, d_model). That’s a hard requirement for the residual connections you’ll see next.

I’ve found the head-pruning research genuinely surprising: you can remove 60–80% of attention heads from trained large models with minimal performance loss. The heads are highly redundant. But training with all of them seems to find better solutions than training with fewer heads from the start.

Tip: **Head count is a hyperparameter, not a magic number.** GPT-2 (117M) uses 12 heads. GPT-3 (175B) uses 96. More heads → more parallel perspectives on the data → better representations, in practice. But the marginal gain from each additional head diminishes quickly.

[TRY IT YOURSELF — Exercise 2]

In the multi_head_attention above, each head operates on a slice of the embedding. In real transformers, each head projects the full d_model-dimensional embedding down to d_head dimensions using its own W_Q, W_K, W_V matrices.

Implement a single attention head that does this full projection:

python

def single_head(X, d_head, seed=0):
    """One attention head with learned projections from full d_model."""
    seq_len, d_model = X.shape
    rng = np.random.default_rng(seed)

    # YOUR CODE: initialise W_Q, W_K, W_V of shape (d_model, d_head)
    # Project X → Q, K, V of shape (seq_len, d_head)
    # Run scaled_dot_product_attention and return the output
    pass

out = single_head(X, d_head=2)
print(out.shape)   # Should be (3, 2)

Hint 1: Initialise with small random values: rng.normal(0, scale, (d_model, d_head)) where scale = np.sqrt(1.0 / d_model). This is Xavier initialisation — it keeps activations in a reasonable range.

Hint 2: After initialising W_Q, W_K, W_V, compute Q = X @ W_Q, then pass Q, K, V into scaled_dot_product_attention. Return only the output, not the weights.

Show solution

python

def single_head(X, d_head, seed=0):
    seq_len, d_model = X.shape
    rng   = np.random.default_rng(seed)
    scale = np.sqrt(1.0 / d_model)

    W_Q = rng.normal(0, scale, (d_model, d_head))
    W_K = rng.normal(0, scale, (d_model, d_head))
    W_V = rng.normal(0, scale, (d_model, d_head))

    Q = X @ W_Q                         # (seq_len, d_head)
    K = X @ W_K
    V = X @ W_V

    output, _ = scaled_dot_product_attention(Q, K, V)
    return output                        # (seq_len, d_head)

Each head projects from `d_model` → `d_head` in a different direction, runs attention in that smaller space, and produces a `d_head`-dimensional output. The full MHA concatenates all heads back to `d_model`.

Feed-Forward Layers and LayerNorm

After attention mixes information between tokens, something needs to process each token’s updated vector on its own. That’s the feed-forward network (FFN) — and when you look at the code, you’ll see it never touches two tokens at once. Each token gets its own independent transformation. The FFN follows a simple expand-then-contract pattern:

Expand: project from d_model to d_ff (typically 4 × d_model)
Activate: ReLU or GELU nonlinearity
Contract: project back from d_ff to d_model

python

def feed_forward(X, W1, b1, W2, b2):
    """Two-layer FFN with ReLU activation."""
    hidden = np.maximum(0, X @ W1 + b1)    # expand + ReLU
    return hidden @ W2 + b2                # contract

np.random.seed(0)
d_model, d_ff = 4, 8
W1 = np.random.randn(d_model, d_ff) * 0.1
b1 = np.zeros(d_ff)
W2 = np.random.randn(d_ff, d_model) * 0.1
b2 = np.zeros(d_model)

ffn_output = feed_forward(X, W1, b1, W2, b2)
print("FFN output shape:", ffn_output.shape)    # (3, 4)

Output:

python

FFN output shape: (3, 4)

Why the expand-then-contract? The wider hidden layer gives the model capacity for complex, non-linear transformations of each token. Research by Geva et al. (2021) suggests that factual associations (“Paris is the capital of France”) are stored primarily in the FFN weights — the attention layers route information, but the FFN is where much of the “knowledge” lives.

Layer normalisation stabilises training by rescaling each token’s vector to mean 0 and standard deviation 1. You apply it before each sub-layer — that’s the “pre-norm” pattern most modern LLMs use:

python

def layer_norm(X, eps=1e-6):
    """Layer normalisation along the last dimension."""
    mean = X.mean(axis=-1, keepdims=True)
    std  = X.std(axis=-1, keepdims=True)
    return (X - mean) / (std + eps)

normed = layer_norm(X)
print("Before norm:", X[0])
print("After norm: ", normed[0].round(4))
print("Mean after: ", normed[0].mean().round(6))    # 0.0
print("Std after:  ", normed[0].std().round(4))     # 1.0

Output:

python

Before norm:  [1.  0.  0.5 0.2]
After norm:   [ 1.5267 -1.1288  0.1991 -0.5975]
Mean after:   0.0
Std after:    1.0

Without LayerNorm, gradients explode or vanish as you stack many layers. With it, training 96-layer stacks becomes feasible.

Quick Check: After layer normalisation, the mean is 0 and std is 1. If you normalise a vector and then apply LayerNorm again, do you get the same vector back? Yes — normalising an already-normalised vector is a no-op (within floating-point precision).

The Full Transformer Block

One transformer block combines everything above into a clean unit:

LayerNorm → Multi-Head Self-Attention → Residual Add
LayerNorm → Feed-Forward → Residual Add

The residual connection (adding the input back to the output at each sub-layer) is critical. It gives gradients a direct path through the network during backpropagation, preventing the vanishing gradient problem that plagued deep networks before residuals. Without them, training a 12-layer transformer would be nearly impossible.

Notice that this block takes an input of shape (batch, seq_len, d_model) and returns output of the exact same shape. That’s not a coincidence — you need this shape guarantee to stack blocks and to add residuals correctly. Here’s the PyTorch implementation:

python

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn  = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ffn   = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.drop  = nn.Dropout(dropout)

    def forward(self, x, causal_mask=None):
        # Sub-layer 1: self-attention with residual
        attn_out, _ = self.attn(x, x, x, attn_mask=causal_mask)
        x = self.norm1(x + self.drop(attn_out))

        # Sub-layer 2: feed-forward with residual
        ffn_out = self.ffn(x)
        x = self.norm2(x + self.drop(ffn_out))
        return x


block  = TransformerBlock(d_model=4, n_heads=2, d_ff=16)
x_in   = torch.tensor(X, dtype=torch.float32).unsqueeze(0)    # (1, 3, 4)
x_out  = block(x_in)
print("Block output shape:", x_out.shape)   # torch.Size([1, 3, 4])

Output:

python

Block output shape: torch.Size([1, 3, 4])

A real GPT-2 stacks 12 of these blocks. GPT-3 stacks 96. Each block refines the token representations — early blocks capture local syntax, deeper blocks capture semantics, long-range dependencies, and factual associations.

How an LLM Generates Text

You’ve built the forward pass. Now let’s see how generation actually works.

At inference time, the model takes a sequence of token IDs, runs them through all transformer blocks, and outputs a probability distribution over the vocabulary at the last position. The highest-probability token is the “obvious” next word.

But the model doesn’t always pick the most probable token. The temperature parameter controls how peaked or flat the distribution is:

temperature < 1.0 — sharper distribution, more repetitive output
temperature = 1.0 — raw probabilities, unchanged
temperature > 1.0 — flatter distribution, more creative (and less predictable) output

The formula: divide the logits by temperature before softmax.

python

# Toy example: 5 possible next tokens with raw logit scores
logits = np.array([3.5, 2.1, 1.8, 0.9, 0.3])

def softmax_with_temperature(logits, temperature=1.0):
    logits = logits / temperature
    logits -= logits.max()          # numerical stability
    probs = np.exp(logits)
    return probs / probs.sum()

print("temp=0.5:", softmax_with_temperature(logits, 0.5).round(3))
print("temp=1.0:", softmax_with_temperature(logits, 1.0).round(3))
print("temp=2.0:", softmax_with_temperature(logits, 2.0).round(3))

Output:

python

temp=0.5: [0.926 0.054 0.018 0.002 0.   ]
temp=1.0: [0.638 0.184 0.138 0.031 0.008]
temp=2.0: [0.362 0.214 0.193 0.135 0.096]

At temperature 0.5, the top token wins 92.6% of the time — nearly deterministic. At temperature 2.0, even the fifth-ranked token gets a 9.6% shot. In practice, I use temperature around 0.7 for structured output (code, JSON) and higher for open-ended creative tasks.

Key Insight: **Temperature doesn’t change what the model knows — only how confidently it expresses it.** A high-temperature model samples from the tail of the distribution: sometimes surprising and creative, sometimes wrong. A low-temperature model stays safe and repetitive. Most production systems land around 0.7–1.0.

[TRY IT YOURSELF — Exercise 3]

Top-k sampling is another generation strategy. Instead of sampling from the full vocabulary, you restrict to the top-k highest-probability tokens and redistribute probability mass among them. This prevents very unlikely tokens from ever being sampled, even at high temperatures.

Implement top_k_sample(logits, k, temperature=1.0):
1. Apply temperature scaling to logits
2. Keep only the top-k values; set the rest to −∞
3. Apply softmax over the remaining values
4. Return the sampled token index

python

def top_k_sample(logits, k, temperature=1.0):
    # YOUR CODE HERE
    pass

np.random.seed(42)
idx = top_k_sample(logits, k=3, temperature=1.0)
print("Sampled token index:", idx)
# Should be 0, 1, or 2 (the three highest-scoring tokens)

Hint 1: np.sort(logits)[-k] gives you the k-th largest value, which you can use as a cutoff threshold.

Hint 2: Use np.where(logits >= threshold, logits, -np.inf) to mask out below-threshold logits. Then apply the same softmax logic from softmax_with_temperature.

Show solution

python

def top_k_sample(logits, k, temperature=1.0):
    logits = np.array(logits, dtype=float) / temperature

    # Find the k-th largest value as the cutoff
    threshold = np.sort(logits)[-k]

    # Mask everything below the cutoff
    filtered = np.where(logits >= threshold, logits, -np.inf)

    # Softmax over remaining logits
    filtered -= filtered.max()
    probs = np.exp(filtered)
    probs /= probs.sum()

    return int(np.random.choice(len(probs), p=probs))

With `k=3` on our 5-logit example, only indices 0, 1, and 2 are eligible. The result will always be one of those three — no matter how high you set the temperature.

How LLMs Actually Learn

You’ve seen how the forward pass works — text in, probability distribution out. But you’re probably wondering: how do the weights end up containing anything useful? Here’s the short version.

During pretraining, the model sees billions of text sequences from the internet. For each sequence, it tries to predict the next token at every position. It compares its prediction to the real next token using cross-entropy loss — a number that measures how wrong the prediction was. Then backpropagation computes how much each weight contributed to that error, and gradient descent nudges every weight slightly in the direction that reduces the loss.

Do this enough times and something remarkable happens. The embedding vectors start grouping similar words together. The attention heads start resolving pronouns correctly. The FFN weights start storing factual associations. None of this was explicitly programmed — it emerged from predicting the next token, billions of times.

Pretraining produces a base model that can complete text fluently but doesn’t follow instructions. Fine-tuning on curated examples teaches it to follow instructions. RLHF (Reinforcement Learning from Human Feedback) further shapes it to give helpful, harmless, and honest responses — that’s how you get from a base GPT to ChatGPT.

Note: **Training vs. inference: completely different regimes.** During training, the model sees the full sequence with causal masking and adjusts its weights to reduce prediction error. During inference (when you’re using it), the weights are frozen and the model just does the forward pass — predict → sample → append → repeat. The causal masking code you wrote above is used in both, but only training updates the weights.

Running a Real LLM — GPT-2 in 15 Lines

You’ve built the pieces from scratch. Now let’s run a real pretrained model.

HuggingFace’s transformers library gives you access to the actual GPT-2 weights trained by OpenAI. Everything you’ve built above — tokenizer, embedding table, 12 transformer blocks, causal masking, softmax over vocabulary — is happening under the hood:

python

# pip install transformers torch
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model     = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()

prompt    = "The transformer architecture works by"
inputs    = tokenizer(prompt, return_tensors="pt")

torch.manual_seed(42)
with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=30,
        temperature=0.8,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

GPT-2 (117M parameters) downloads in seconds and runs on CPU. It’s not GPT-4, but it uses exactly the architecture you’ve built — 12 transformer blocks, 12 attention heads, 768-dimensional embeddings, 50,257-token vocabulary.

Note: **GPU memory scales sharply with model size.** GPT-2 small (117M) runs on CPU with ~500 MB RAM. GPT-2 XL (1.5B) benefits from a GPU (~6 GB VRAM). Llama 3 8B needs ~16 GB VRAM in float16. To run large models without a big GPU, look into `llama.cpp` for 4-bit quantised inference — it cuts memory roughly 4×.

Common Coding Mistakes

These are the bugs I see most often when people first implement attention from scratch.

Forgetting to scale by √d_k. Without the scaling factor, dot products grow large in high dimensions and softmax saturates — most of the weight goes to one token and everything else gets ~0. Attention stops being informative. Always divide by np.sqrt(d_k) before softmax.

python

# WRONG — unscaled attention collapses to near-one-hot
scores_wrong = Q @ K.T
weights_wrong = np.exp(scores_wrong - scores_wrong.max(axis=-1, keepdims=True))
weights_wrong /= weights_wrong.sum(axis=-1, keepdims=True)
print("Unscaled attention (row 1):", weights_wrong[1].round(3))
# Output: [0. 1. 0.] — nearly all weight on token 1

# CORRECT — scaled attention distributes weight meaningfully
scores_right = Q @ K.T / np.sqrt(Q.shape[-1])
weights_right = np.exp(scores_right - scores_right.max(axis=-1, keepdims=True))
weights_right /= weights_right.sum(axis=-1, keepdims=True)
print("Scaled attention   (row 1):", weights_right[1].round(3))
# Output: [0.232 0.472 0.296]

Using d_model instead of d_head for the scaling factor. In multi-head attention, each head operates on d_head = d_model / n_heads dimensions. The scaling factor should be √d_head, not √d_model. Using the wrong dimension underscales or overscales the dot products.

Omitting causal masking in decoder models. If you’re training a language model and forget the causal mask, the model can see future tokens during training. It will appear to train fine (low loss) because it’s essentially cheating. At inference, where there are no future tokens, performance collapses. Always verify your attention is actually causal during training.

Applying softmax before masking. The correct order is: compute scores → apply mask (set future positions to −∞) → apply softmax. If you apply softmax first, then zero out future positions, the remaining weights no longer sum to 1. Renormalising after the fact is ugly and error-prone.

Common Misconceptions

“LLMs understand language the way humans do.”
They don’t. An LLM learns statistical patterns between tokens at massive scale. It has no grounding in the physical world, no persistent memory across conversations, and no continuous experience. It produces output that looks like understanding because those patterns correlate strongly with correct answers — but the underlying mechanism is entirely different from human cognition.

“Hallucinations are a bug waiting to be fixed.”
They’re more like a feature working on the wrong objective. The model is doing exactly what it was trained to do: predict plausible next tokens. If a plausible-sounding continuation happens to be factually wrong, the model has no way to tell the difference. The fixes require retrieval-augmented generation, better training data curation, and output verification — not just better prompting.

“Bigger context windows solve memory.”
Not really. Attending over 128K tokens is computationally expensive (attention is O(n²) in sequence length). More importantly, research shows model performance on long-context tasks degrades when relevant information is in the middle of a long context — models attend to the beginning and end much more reliably. Longer windows help, but they don’t solve the problem completely.

Warning: **Don’t over-trust confident-sounding output.** LLMs generate fluent text even when they’re wrong. A confident wrong answer and a confident right answer look identical. For any factual claim you plan to act on — especially code, medical information, legal reasoning, or specific numbers — verify it independently.

Frequently Asked Questions

What’s the difference between a transformer and an LLM?
A transformer is an architecture. An LLM is a large model trained on language using that architecture. All major LLMs are transformers, but not all transformers are LLMs — BERT uses transformer encoder blocks and is trained on masked language modelling, not generation.

Do all LLMs use the same tokenizer?
No. GPT-4 uses cl100k_base (tiktoken, ~100K vocab). GPT-2 uses a smaller BPE tokenizer (~50K vocab). Llama 3 uses a SentencePiece tokenizer (~128K vocab). The same text will have different token counts across models — always count with the model’s own tokenizer.

Why do larger models perform better?
Scale in parameters improves memorisation, generalisation, and reasoning. But it’s not linear — going from 7B to 70B gives larger gains than 700M to 7B. The relationship between compute, data, and model size is described by scaling laws (Hoffmann et al., 2022 — the Chinchilla paper).

What’s the difference between temperature, top-k, and top-p?
Temperature scales all logits before softmax. Top-k restricts the candidate pool to the k highest-probability tokens. Top-p (nucleus sampling) restricts to the smallest set of tokens whose cumulative probability exceeds p. Most production APIs let you combine all three. I typically set temperature first, then use top-p = 0.9 to cut off the very long tail.

Can I fine-tune GPT-2 on my own data?
Yes. HuggingFace’s Trainer class handles most of the complexity. For small datasets, LoRA (Low-Rank Adaptation) fine-tuning is more efficient — it trains only a small set of adapter weights rather than the full model, which makes it feasible on a single consumer GPU.

What happens at the very end of the transformer — how does it pick a word?
After the final transformer block, the token representations are projected back to vocabulary size (one logit per token in the vocabulary) by a linear layer. Softmax converts those to probabilities. You sample from that distribution according to your temperature/top-k/top-p settings. That sampled token gets appended to the input, and the whole process repeats for the next position.

Summary

Here’s what you’ve built and understood in this article:

Tokenization converts text into integer IDs using BPE subword splitting
Embeddings map those IDs to dense vectors that encode semantic meaning
Positional encoding injects word-order information into each token’s vector
Self-attention lets every token attend to every other token and weight their relevance, using learned Q, K, V projections
Causal masking prevents decoder-only models from seeing future tokens during autoregressive training
Multi-head attention runs attention in parallel from multiple learned perspectives
Feed-forward layers apply non-linear transformations to each token independently
LayerNorm + residuals keep gradients stable across stacks of transformer blocks
Temperature and top-k control the randomness of text generation

The full code for all sections is in the complete script below.

Click to expand the complete runnable script

python

# Complete code: How LLMs Work in Python
# Requires: pip install numpy torch transformers tiktoken
# Python 3.9+

import numpy as np
import torch
import torch.nn as nn

# --- Embedding table ---
embedding_table = np.array([
    [ 0.1,  0.2,  0.3,  0.1],
    [ 0.8,  0.1, -0.3,  0.5],
    [ 0.7,  0.2, -0.2,  0.4],
    [-0.1,  0.9,  0.2, -0.3],
    [-0.5,  0.1,  0.8,  0.2],
])
token_ids  = [0, 1, 3]
embeddings = embedding_table[token_ids]
print("Embeddings shape:", embeddings.shape)

# --- Positional encoding ---
def positional_encoding(seq_len, d_model):
    PE        = np.zeros((seq_len, d_model))
    positions = np.arange(seq_len).reshape(-1, 1)
    dims      = np.arange(0, d_model, 2)
    div_term  = 10000 ** (dims / d_model)
    PE[:, 0::2] = np.sin(positions / div_term)
    PE[:, 1::2] = np.cos(positions / div_term)
    return PE

print("PE:\n", positional_encoding(3, 4).round(4))

# --- Token representations ---
X = np.array([
    [1.0, 0.0, 0.5, 0.2],
    [0.0, 1.0, 0.3, 0.8],
    [0.5, 0.5, 1.0, 0.0],
])

# --- Self-attention ---
def scaled_dot_product_attention(Q, K, V):
    d_k     = Q.shape[-1]
    scores  = Q @ K.T / np.sqrt(d_k)
    scores -= scores.max(axis=-1, keepdims=True)
    weights = np.exp(scores)
    weights /= weights.sum(axis=-1, keepdims=True)
    return weights @ V, weights

Q, K, V = X @ np.eye(4), X @ np.eye(4), X @ np.eye(4)
_, w = scaled_dot_product_attention(Q, K, V)
print("Attention weights:\n", w.round(3))

# --- Causal masking ---
def causal_attention(Q, K, V):
    seq_len, d_k = Q.shape
    scores  = Q @ K.T / np.sqrt(d_k)
    mask    = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)
    scores[mask] = -np.inf
    finite  = np.isfinite(scores)
    row_max = np.where(finite, scores, -np.inf).max(axis=-1, keepdims=True)
    scores  = np.where(finite, scores - row_max, -np.inf)
    weights = np.where(finite, np.exp(scores), 0.0)
    weights /= weights.sum(axis=-1, keepdims=True)
    return weights @ V, weights

_, cw = causal_attention(Q, K, V)
print("Causal weights:\n", cw.round(3))

# --- Multi-head attention ---
def multi_head_attention(X, n_heads=2):
    seq_len, d_model = X.shape
    d_head   = d_model // n_heads
    outputs  = []
    for h in range(n_heads):
        X_h      = X[:, h * d_head:(h + 1) * d_head]
        out_h, _ = scaled_dot_product_attention(X_h, X_h, X_h)
        outputs.append(out_h)
    return np.concatenate(outputs, axis=-1)

print("MHA shape:", multi_head_attention(X).shape)

# --- Layer norm ---
def layer_norm(X, eps=1e-6):
    mean = X.mean(axis=-1, keepdims=True)
    std  = X.std(axis=-1, keepdims=True)
    return (X - mean) / (std + eps)

print("LayerNorm X[0]:", layer_norm(X)[0].round(4))

# --- Temperature softmax ---
logits = np.array([3.5, 2.1, 1.8, 0.9, 0.3])

def softmax_with_temperature(logits, temperature=1.0):
    logits  = logits / temperature
    logits -= logits.max()
    probs   = np.exp(logits)
    return probs / probs.sum()

for t in [0.5, 1.0, 2.0]:
    print(f"temp={t}:", softmax_with_temperature(logits, t).round(3))

print("Script completed successfully.")

References

Vaswani, A. et al. — “Attention Is All You Need.” NeurIPS 2017. arXiv:1706.03762
Radford, A. et al. — “Language Models are Unsupervised Multitask Learners.” OpenAI Blog (2019). GPT-2 paper.
Hoffmann, J. et al. — “Training Compute-Optimal Large Language Models.” arXiv:2203.15556 (2022). Chinchilla scaling laws.
Geva, M. et al. — “Transformer Feed-Forward Layers Are Key-Value Memories.” EMNLP 2021. arXiv:2012.14913.
Michel, H. et al. — “Are Sixteen Heads Really Better Than One?” NeurIPS 2019. Head pruning research.
HuggingFace — Transformers documentation. huggingface.co/docs/transformers
Karpathy, A. — NanoGPT: a clean, minimal GPT implementation. github.com/karpathy/nanoGPT
Raschka, S. — “Understanding and Coding Self-Attention from Scratch.” sebastianraschka.com
Su, J. et al. — “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv:2104.09864.
Phuong, M. & Hutter, M. — “Formal Algorithms for Transformers.” DeepMind (2022). arXiv:2207.09238.

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Deep Learning — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

How LLMs Work in Python — Transformer Architecture Explained

What Is an LLM, Really?

Step 1 — Tokenization: Turning Text into Numbers

Step 2 — Embeddings: From Token IDs to Meaning

Step 3 — Positional Encoding: Where Am I in the Sequence?

Step 4 — Self-Attention: The Core Mechanism

The Q, K, V Projections

Causal Masking — Why Decoders Don’t Look Ahead

Multi-Head Attention — Looking from Multiple Angles

Feed-Forward Layers and LayerNorm

The Full Transformer Block

How an LLM Generates Text

How LLMs Actually Learn

Running a Real LLM — GPT-2 in 15 Lines

Common Coding Mistakes

Common Misconceptions

Frequently Asked Questions

Summary

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is an LLM, Really?

Step 1 — Tokenization: Turning Text into Numbers

Step 2 — Embeddings: From Token IDs to Meaning

Step 3 — Positional Encoding: Where Am I in the Sequence?

Step 4 — Self-Attention: The Core Mechanism

The Q, K, V Projections

Causal Masking — Why Decoders Don’t Look Ahead

Multi-Head Attention — Looking from Multiple Angles

Feed-Forward Layers and LayerNorm

The Full Transformer Block

How an LLM Generates Text

How LLMs Actually Learn

Running a Real LLM — GPT-2 in 15 Lines

Common Coding Mistakes

Common Misconceptions

Frequently Asked Questions

Summary

References

Related Articles

How to Fine-Tune LLMs with LoRA in Python — A Complete Guide

How to implement Linear Regression in TensorFlow

How to use tf.function to speed up Python code in Tensorflow

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.