machine learning +
Speculative Decoding: Faster LLM Inference (Python)
Transformer Attention from Scratch in NumPy (Python)
Build transformer attention from scratch in NumPy with runnable code. Scaled dot-product, multi-head attention, causal masking, and heatmaps step by step.
⚡ This post has interactive code — click ▶ Run or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.
Implement scaled dot-product attention, multi-head attention, and causal masking — with heatmaps at every step and runnable code.
Every large language model — GPT, Claude, Gemini — runs on the same core mechanism: attention. Strip away the billions of parameters, the RLHF fine-tuning, the safety filters. What’s left is one elegant idea. Each word looks at every other word and decides how much to care about it.
That’s attention. You can build it yourself in about 50 lines of NumPy.
By the end of this article, you won’t just understand attention in theory. You’ll have coded it, visualized it, and watched it work on real sentences. No PyTorch. No TensorFlow. Just NumPy arrays and matrix multiplication.
Before we write any code, here’s how the pieces connect.
You start with a sequence of words. Each word is a vector of numbers — its embedding. But embeddings alone don’t know about context. The word “bank” means different things in “river bank” and “bank account.”
Attention fixes this. Each word creates three vectors: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what information do I carry?”).
The Query of one word gets compared against the Keys of all words. High similarity means “pay attention to that word.”
The resulting scores weight the Values. This produces a new representation that blends context from the entire sequence.
That’s one attention head. Multi-head attention runs several heads in parallel — each learning to focus on different relationships. Causal masking then ensures that during generation, a word only attends to words before it. No peeking at the future.
We’ll build each piece step by step: dot-product similarity, scaled attention, multi-head attention, and causal masking. Each step gets a heatmap so you can see exactly what the math produces.
Prerequisites
- Python version: 3.9+
- Required libraries: NumPy (1.24+), Matplotlib (3.7+)
- Install:
pip install numpy matplotlib - Time to complete: 20-25 minutes
What Are Queries, Keys, and Values?
The biggest stumbling block with attention is the Q/K/V terminology. It sounds abstract. But there’s a concrete way to think about it.
Picture a library. You walk in with a question — that’s your Query. Every book on the shelf has a label describing its contents — that’s the Key. The actual information inside each book is the Value.
You compare your question against every label. The labels that match get high scores. Then you read the matching books and combine their information. That’s attention in one paragraph.
In a transformer, every word plays all three roles at the same time. Word A creates a Query and compares it against the Keys of words B, C, D. Word B does the same thing back. Every word is both asking and answering.
Here’s a quick reference:
| Vector | Role | Library Analogy | What It Does |
|---|---|---|---|
| Query (Q) | The question | Your search query | “What am I looking for?” |
| Key (K) | The label | Book spine labels | “What do I contain?” |
| Value (V) | The content | Book contents | “What information do I carry?” |
Let’s make this concrete with code. We’ll create a tiny sentence and build Q, K, V matrices from scratch using random embeddings and weight matrices.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
# 4 words, each as a 6-dimensional embedding
sentence = ["The", "cat", "sat", "down"]
seq_len = len(sentence)
d_model = 6
# Random embeddings (in real models, these are learned)
embeddings = np.random.randn(seq_len, d_model)
print(f"Sentence: {sentence}")
print(f"Embedding shape: {embeddings.shape}")python
Sentence: ['The', 'cat', 'sat', 'down']
Embedding shape: (4, 6)
Each word is a row of 6 numbers. Right now, these embeddings are context-free. “Cat” has the same vector whether it appears in “the cat sat” or “the cat burglar.” Attention will fix that.
To create Q, K, V, we multiply the embeddings by three separate weight matrices. Each weight matrix projects the embeddings into a different space — one for asking, one for describing, one for carrying information.
d_k = 4 # dimension of queries and keys
d_v = 4 # dimension of values
# Weight matrices (learned during training, random here)
W_Q = np.random.randn(d_model, d_k) * 0.5
W_K = np.random.randn(d_model, d_k) * 0.5
W_V = np.random.randn(d_model, d_v) * 0.5
# Project embeddings into Q, K, V spaces
Q = embeddings @ W_Q # (4, 6) @ (6, 4) = (4, 4)
K = embeddings @ W_K
V = embeddings @ W_V
print(f"Q shape: {Q.shape} (seq_len x d_k)")
print(f"K shape: {K.shape} (seq_len x d_k)")
print(f"V shape: {V.shape} (seq_len x d_v)")python
Q shape: (4, 4) (seq_len x d_k)
K shape: (4, 4) (seq_len x d_k)
V shape: (4, 4) (seq_len x d_v)
Each row of Q is one word’s question. Each row of K is one word’s label. Each row of V is one word’s content.
Key Insight: Q, K, and V are just three different linear projections of the same embeddings. The weight matrices W_Q, W_K, W_V learn to project each word into three spaces: one for asking, one for describing, and one for carrying information.
How Dot-Product Similarity Works
Why do we use the dot product? When two vectors point in the same direction, their dot product is large and positive. Opposite directions give a large negative value. Perpendicular vectors give zero.
In attention, we compute the dot product between one word’s Query and another word’s Key. A large dot product means “these two words are relevant to each other.” Let’s compute this for our sentence.
# Raw attention scores: Q @ K^T
# Entry (i, j) = how much word i attends to word j
raw_scores = Q @ K.T # (4, 4) @ (4, 4) = (4, 4)
print("Raw attention scores (Q @ K^T):")
print(raw_scores.round(3))python
Raw attention scores (Q @ K^T):
[[-0.622 -1.44 0.889 0.419]
[ 0.388 1.116 -0.422 -0.292]
[-0.38 -1.289 0.494 0.198]
[ 0.242 0.359 -0.367 -0.074]]
Row 0 is “The.” Its highest score (0.889) points to “sat” — meaning “The” finds “sat” most relevant. The lowest score (-1.440) means “cat” is least relevant to “The” in this random setup.
But raw dot products have a problem. They grow large as dimensions increase. And large values cause trouble when we apply softmax next.
Scaled Dot-Product Attention — The Full Formula
Here’s the issue. When d_k is large (say 512), dot products can reach the hundreds. Softmax converts scores into probabilities — but when inputs are large, it saturates. Almost all probability mass goes to one token. Gradients vanish and learning stalls.
The fix? Divide by the square root of d_k.
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]Where:
– $Q$ = query matrix (seq_len x d_k)
– $K$ = key matrix (seq_len x d_k)
– $V$ = value matrix (seq_len x d_v)
– \(d_k\) = dimension of queries/keys
– \(\sqrt{d_k}\) = scaling factor that prevents softmax saturation
If the math notation isn’t your thing, skip ahead — the code below does the same thing.
Let’s implement this step by step. First, a numerically stable softmax. Then the full scaled attention.
def softmax(x):
"""Numerically stable softmax along the last axis."""
exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
# Step 1: Scale the raw scores
scale = np.sqrt(d_k)
scaled_scores = raw_scores / scale
print(f"Scaling factor (sqrt({d_k})): {scale:.2f}")
print(f"\nScaled scores:")
print(scaled_scores.round(3))python
Scaling factor (sqrt(4)): 2.00
Scaled scores:
[[-0.311 -0.72 0.445 0.21 ]
[ 0.194 0.558 -0.211 -0.146]
[-0.19 -0.644 0.247 0.099]
[ 0.121 0.179 -0.184 -0.037]]
The scores halved because sqrt(4) = 2. With d_k = 512, scaling would reduce them by ~22.6x. That keeps the softmax in a healthy gradient range.
Now apply softmax. Each row becomes a probability distribution — it sums to 1.
# Step 2: Apply softmax to get attention weights
attention_weights = softmax(scaled_scores)
print("Attention weights (each row sums to 1):")
print(attention_weights.round(3))
print(f"\nRow sums: {attention_weights.sum(axis=1).round(3)}")python
Attention weights (each row sums to 1):
[[0.187 0.124 0.399 0.315]
[0.277 0.399 0.184 0.197]
[0.192 0.122 0.297 0.256]
[0.265 0.281 0.196 0.226]]
Row sums: [1. 1. 1. 1.]
Look at row 0 (“The”). It places 39.9% of attention on “sat” and 31.5% on “down.” The model decides which words matter most for each position.
The final step multiplies these weights by the Value matrix. This produces a weighted blend of all words’ values for each position.
# Step 3: Weighted sum of values
attention_output = attention_weights @ V
print("Attention output (context-aware representations):")
print(attention_output.round(3))
print(f"Shape: {attention_output.shape}")python
Attention output (context-aware representations):
[[ 0.044 0.313 -0.073 0.279]
[-0.045 0.236 0.018 0.271]
[ 0.04 0.306 -0.049 0.267]
[-0.015 0.263 -0.003 0.27 ]]
Shape: (4, 4)
Each row is now context-aware. “The” isn’t just its original embedding anymore — it’s a blend of information from all four words, weighted by relevance.
Let’s wrap everything into a clean function.
def scaled_dot_product_attention(Q, K, V):
"""
Compute scaled dot-product attention.
Returns output and attention weights.
"""
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
weights = softmax(scores)
output = weights @ V
return output, weightsTip: The softmax needs numerical stability. Subtracting the max value before computing `exp()` prevents overflow. Without this, large values cause `exp()` to return infinity. Every production implementation does this.
Why Scaling Matters — An Experiment
I claimed scaling is important. Let me prove it. We’ll compare attention weights with and without scaling as d_k grows. Without scaling, the distribution should become increasingly extreme.
def compare_scaling(d_k_values):
"""Show how attention degrades without scaling."""
np.random.seed(42)
seq_len = 4
print(f"{'d_k':>6} | {'Max wt (scaled)':>16} | {'Max wt (unscaled)':>18}")
print("-" * 47)
for d_k in d_k_values:
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
scaled = softmax(Q @ K.T / np.sqrt(d_k))
unscaled = softmax(Q @ K.T)
print(f"{d_k:>6} | {scaled.max():>16.4f} | {unscaled.max():>18.4f}")
compare_scaling([4, 16, 64, 256, 512])python
d_k | Max wt (scaled) | Max wt (unscaled)
-----------------------------------------------
4 | 0.3990 | 0.5765
16 | 0.3627 | 0.8435
64 | 0.3297 | 0.9983
256 | 0.2924 | 1.0000
512 | 0.2886 | 1.0000
The pattern is unmistakable. Without scaling, the max weight hits 1.0 at d_k=256 — all attention on a single token. With scaling, weights stay distributed. The model can learn nuanced relationships instead of hard one-hot attention.
[UNDER THE HOOD]
Why sqrt(d_k) specifically? When Q and K entries are independent with mean 0 and variance 1, the dot product Q·K has variance d_k. Dividing by sqrt(d_k) brings variance back to 1. This keeps softmax inputs in the range where gradients flow well.
Visualizing Attention Weights as a Heatmap
Numbers in a matrix are hard to read. A heatmap makes the attention pattern instantly visible. Each cell shows how much the row word attends to the column word. Brighter means stronger attention.
def plot_attention(weights, labels, title="Attention Weights"):
"""Plot attention weights as a heatmap."""
fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(weights, cmap='Blues', vmin=0, vmax=1)
ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels, fontsize=11)
ax.set_yticklabels(labels, fontsize=11)
ax.set_xlabel("Key (attending to)", fontsize=12)
ax.set_ylabel("Query (from)", fontsize=12)
ax.set_title(title, fontsize=13, fontweight='bold')
for i in range(len(labels)):
for j in range(len(labels)):
ax.text(j, i, f'{weights[i, j]:.2f}',
ha='center', va='center', fontsize=10,
color='white' if weights[i, j] > 0.5 else 'black')
fig.colorbar(im, ax=ax, shrink=0.8)
plt.tight_layout()
plt.show()
plot_attention(attention_weights, sentence, "Scaled Dot-Product Attention")With random weights the attention looks relatively uniform. In a trained model, you’d see sharp patterns — articles attending to their nouns, verbs linking to subjects.
{
type: ‘exercise’,
id: ‘attention-weights-ex1’,
title: ‘Exercise 1: Compute Attention Weights Manually’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function compute_attention_weights(Q, K) that returns the attention weight matrix using scaled dot-product attention (without the V multiplication). Steps: (1) compute raw scores with Q @ K.T, (2) scale by sqrt(d_k), (3) apply softmax. Return the weight matrix.’,
starterCode: ‘import numpy as np\n\ndef softmax(x):\n exp_x = np.exp(x – np.max(x, axis=-1, keepdims=True))\n return exp_x / np.sum(exp_x, axis=-1, keepdims=True)\n\ndef compute_attention_weights(Q, K):\n d_k = Q.shape[-1]\n # Step 1: raw scores\n # Step 2: scale\n # Step 3: softmax\n pass # your code here\n\n# Test\nQ_test = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])\nK_test = np.array([[1.0, 0.0], [0.0, 1.0], [0.5, 0.5]])\nweights = compute_attention_weights(Q_test, K_test)\nprint(weights.round(3))\nprint(“Shape:”, weights.shape)’,
testCases: [
{ id: ‘tc1’, input: ‘import numpy as np\ndef softmax(x):\n exp_x = np.exp(x – np.max(x, axis=-1, keepdims=True))\n return exp_x / np.sum(exp_x, axis=-1, keepdims=True)\nQ = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])\nK = np.array([[1.0, 0.0], [0.0, 1.0], [0.5, 0.5]])\nscores = Q @ K.T / np.sqrt(2)\nw = softmax(scores)\nprint(w.round(3))’, expectedOutput: ‘[[0.432 0.211 0.357]\n [0.211 0.432 0.357]\n [0.304 0.304 0.392]]’, description: ‘Attention weights should be correctly scaled and normalized’ },
{ id: ‘tc2’, input: ‘print(weights.shape)’, expectedOutput: ‘(3, 3)’, description: ‘Output shape should be (seq_len, seq_len)’, hidden: false }
],
hints: [
‘Remember: d_k is Q.shape[-1], the last dimension of the Q matrix.’,
‘Full answer: softmax(Q @ K.T / np.sqrt(Q.shape[-1]))’
],
solution: ‘def compute_attention_weights(Q, K):\n d_k = Q.shape[-1]\n scores = Q @ K.T / np.sqrt(d_k)\n return softmax(scores)’,
solutionExplanation: ‘We get d_k from Q\’s last dimension, compute dot-product scores, scale them down, and apply row-wise softmax. Scaling prevents saturation when d_k is large.’,
xpReward: 15,
}
Multi-Head Attention — Multiple Perspectives at Once
Single-head attention has a limitation. It compresses all word relationships into one set of weights. But language has many types of relationships at once. “Cat” relates to “sat” syntactically (subject-verb) and to “The” grammatically (article-noun). Can one attention head capture both?
Multi-head attention solves this by running several attention operations in parallel. Each “head” gets its own W_Q, W_K, W_V matrices. Each head learns different relationship patterns.
The process has four steps:
- Split the embedding dimension among
hheads (each gets d_model / h dims) - Run scaled dot-product attention per head independently
- Concatenate all head outputs back together
- Project through a final weight matrix
The math is identical to what we built — it just runs h times with smaller dimensions. Here’s the function. It’s longer, so I’ll walk through it after.
def multi_head_attention(embeddings, num_heads, d_model, causal=False):
"""
Multi-head attention with optional causal masking.
Args:
embeddings: (seq_len, d_model)
num_heads: number of attention heads
d_model: embedding dimension
causal: apply causal mask if True
Returns:
output: (seq_len, d_model)
all_weights: attention weights per head
"""
seq_len = embeddings.shape[0]
d_head = d_model // num_heads
mask = create_causal_mask(seq_len) if causal else None
head_outputs = []
all_weights = []
for h in range(num_heads):
# Each head: own projections
scale = 1.0 / np.sqrt(d_model)
W_Q = np.random.randn(d_model, d_head) * scale
W_K = np.random.randn(d_model, d_head) * scale
W_V = np.random.randn(d_model, d_head) * scale
Q_h = embeddings @ W_Q
K_h = embeddings @ W_K
V_h = embeddings @ W_V
out, w = masked_scaled_attention(Q_h, K_h, V_h, mask)
head_outputs.append(out)
all_weights.append(w)
# Concat heads: (seq_len, d_head * num_heads) = (seq_len, d_model)
concatenated = np.concatenate(head_outputs, axis=-1)
# Final output projection
W_O = np.random.randn(d_model, d_model) * (1.0 / np.sqrt(d_model))
return concatenated @ W_O, all_weightsThis function needs create_causal_mask and masked_scaled_attention, which we’ll define shortly. For now, focus on the structure.
Each head projects the full embeddings into a smaller d_head-dimensional space. It runs attention there, producing a (seq_len, d_head) output. We concatenate all heads to get back to d_model dimensions. The final W_O projection lets the model mix information across heads.
Let’s run it with 2 heads and visualize the different patterns each head discovers.
def create_causal_mask(seq_len):
"""Upper triangle = -inf, rest = 0."""
mask = np.zeros((seq_len, seq_len))
mask[np.triu_indices(seq_len, k=1)] = -np.inf
return mask
def masked_scaled_attention(Q, K, V, mask=None):
"""Scaled dot-product attention with optional mask."""
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
if mask is not None:
scores = scores + mask
weights = softmax(scores)
return weights @ V, weights
np.random.seed(42)
num_heads = 2
mha_output, head_weights = multi_head_attention(
embeddings, num_heads, d_model, causal=False
)
print(f"Multi-head output shape: {mha_output.shape}")
print(f"Number of heads: {len(head_weights)}")python
Multi-head output shape: (4, 6)
Number of heads: 2
The output keeps the same shape as the input. That’s by design — attention transforms representations without changing dimensions. Let’s visualize both heads side by side.
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for idx, (ax, w) in enumerate(zip(axes, head_weights)):
im = ax.imshow(w, cmap='Blues', vmin=0, vmax=1)
ax.set_xticks(range(seq_len))
ax.set_yticks(range(seq_len))
ax.set_xticklabels(sentence, fontsize=10)
ax.set_yticklabels(sentence, fontsize=10)
ax.set_xlabel("Key", fontsize=11)
ax.set_ylabel("Query", fontsize=11)
ax.set_title(f"Head {idx + 1}", fontsize=12, fontweight='bold')
for i in range(seq_len):
for j in range(seq_len):
ax.text(j, i, f'{w[i, j]:.2f}',
ha='center', va='center', fontsize=9,
color='white' if w[i, j] > 0.5 else 'black')
fig.suptitle("Multi-Head Attention — Two Different Perspectives",
fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()The two heads produce different patterns. Even with random weights, each head distributes attention differently. In a trained GPT with 12+ heads, you’d see striking specialization — one head tracking adjacent words, another linking distant dependencies.
Key Insight: Multi-head attention doesn’t add new information — it adds new perspectives. Each head sees the same words but learns different relationships. Concatenation combines these perspectives into a richer representation than any single head could produce.
Causal Masking — No Peeking at the Future
Here’s a question you might be wondering about. Everything so far lets every word attend to every other word — including words that come after it. That’s fine for understanding text (like BERT). But for generating text? It’s a problem.
When a model predicts the next word, it can’t know what that word is. It should only use what came before. Causal masking enforces this rule.
The idea is simple. Before softmax, set the scores for future positions to negative infinity. After softmax, those positions get zero weight. The word can only look backward.
mask = create_causal_mask(seq_len)
print("Causal mask:")
print(mask)python
Causal mask:
[[ 0. -inf -inf -inf]
[ 0. 0. -inf -inf]
[ 0. 0. 0. -inf]
[ 0. 0. 0. 0.]]
Row 0 (“The”) can only see itself. Row 1 (“cat”) sees “The” and itself. Row 3 (“down”) sees everything. This is exactly how GPT-style models work.
Let’s compare attention with and without the mask.
_, weights_no_mask = masked_scaled_attention(Q, K, V)
_, weights_masked = masked_scaled_attention(Q, K, V, mask=mask)
print("Without causal mask:")
print(weights_no_mask.round(3))
print("\nWith causal mask:")
print(weights_masked.round(3))python
Without causal mask:
[[0.187 0.124 0.399 0.315]
[0.277 0.399 0.184 0.197]
[0.192 0.122 0.297 0.256]
[0.265 0.281 0.196 0.226]]
With causal mask:
[[1. 0. 0. 0. ]
[0.387 0.613 0. 0. ]
[0.268 0.171 0.561 0. ]
[0.265 0.281 0.196 0.226]]
The difference is dramatic. Row 0 (“The”) gives 100% attention to itself — it can’t see anything else. Row 3 (“down”) is unchanged because it’s the last word and already sees everything.
Let’s visualize both side by side.
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
titles = ["No Mask (Bidirectional)", "Causal Mask (Autoregressive)"]
weight_sets = [weights_no_mask, weights_masked]
for ax, w, title in zip(axes, weight_sets, titles):
im = ax.imshow(w, cmap='Blues', vmin=0, vmax=1)
ax.set_xticks(range(seq_len))
ax.set_yticks(range(seq_len))
ax.set_xticklabels(sentence, fontsize=10)
ax.set_yticklabels(sentence, fontsize=10)
ax.set_xlabel("Key", fontsize=11)
ax.set_ylabel("Query", fontsize=11)
ax.set_title(title, fontsize=12, fontweight='bold')
for i in range(seq_len):
for j in range(seq_len):
ax.text(j, i, f'{w[i, j]:.2f}',
ha='center', va='center', fontsize=9,
color='white' if w[i, j] > 0.5 else 'black')
fig.suptitle("Effect of Causal Masking on Attention",
fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()The right heatmap shows the classic lower-triangular pattern. The upper triangle is all zeros — no future information leaks through.
Warning: The mask is added, not multiplied. We add -inf to the scores before softmax. Since exp(-inf) = 0, those positions get zero weight. Multiplying by zero instead won’t work — softmax would still distribute probability to those positions. Addition with -inf is the correct approach.
Quick check: If you also masked the diagonal (setting k=0 instead of k=1 in np.triu_indices), what would row 0 look like? Think about it before reading on. Answer: every entry would be -inf, softmax would produce NaN, and the model would break. Each word must be able to attend to at least itself.
{
type: ‘exercise’,
id: ‘causal-mask-ex2’,
title: ‘Exercise 2: Implement Causal Masking’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function apply_causal_attention(Q, K, V) that computes scaled dot-product attention WITH causal masking. Steps: (1) compute scaled scores, (2) create and apply causal mask, (3) softmax, (4) multiply by V. Return a tuple of (output, weights).’,
starterCode: ‘import numpy as np\n\ndef softmax(x):\n exp_x = np.exp(x – np.max(x, axis=-1, keepdims=True))\n return exp_x / np.sum(exp_x, axis=-1, keepdims=True)\n\ndef apply_causal_attention(Q, K, V):\n seq_len = Q.shape[0]\n d_k = Q.shape[-1]\n # Step 1: scaled scores\n # Step 2: create and apply causal mask\n # Step 3: softmax\n # Step 4: multiply by V\n pass # your code here\n\n# Test\nnp.random.seed(0)\nQ = np.random.randn(3, 2)\nK = np.random.randn(3, 2)\nV = np.random.randn(3, 2)\noutput, weights = apply_causal_attention(Q, K, V)\nprint(“Weights:”)\nprint(weights.round(3))\nprint(“Output shape:”, output.shape)’,
testCases: [
{ id: ‘tc1’, input: ”, expectedOutput: ‘Weights:\n[[1. 0. 0. ]\n [0.617 0.383 0. ]\n [0.2 0.199 0.601]]\nOutput shape: (3, 2)’, description: ‘Causal mask should zero out future positions’ },
{ id: ‘tc2’, input: ‘print(weights[0, 1] == 0.0 and weights[0, 2] == 0.0)’, expectedOutput: ‘True’, description: ‘First row should only attend to itself’, hidden: false }
],
hints: [
‘Create the mask: np.zeros((seq_len, seq_len)), then set upper triangle to -np.inf using np.triu_indices(seq_len, k=1).’,
‘Full sequence: scores = Q @ K.T / np.sqrt(d_k); scores += mask; weights = softmax(scores); output = weights @ V.’
],
solution: ‘def apply_causal_attention(Q, K, V):\n seq_len = Q.shape[0]\n d_k = Q.shape[-1]\n scores = Q @ K.T / np.sqrt(d_k)\n mask = np.zeros((seq_len, seq_len))\n mask[np.triu_indices(seq_len, k=1)] = -np.inf\n scores = scores + mask\n weights = softmax(scores)\n output = weights @ V\n return output, weights’,
solutionExplanation: ‘We create a causal mask where the upper triangle is -inf. Adding it to scores ensures future positions become zero after softmax. The first word only attends to itself, the second to the first two, and so on.’,
xpReward: 20,
}
Note: What about word order? You might have noticed that attention has no sense of position. Swap “cat sat” to “sat cat” and the scores don’t change. Real transformers fix this by adding positional encodings — sinusoidal functions or learned vectors — to the embeddings before attention. Each position gets a unique signature. We skip positional encoding here to focus on attention mechanics, but it’s the very first thing a full transformer adds.
Putting It All Together — A Longer Example
We’ve built each piece. Let’s combine them on a longer sentence with 3 heads and causal masking to see richer patterns.
np.random.seed(7)
longer_sentence = ["The", "cat", "sat", "on", "the", "warm", "mat"]
longer_seq_len = len(longer_sentence)
d_model_larger = 12
num_heads_test = 3
longer_embeddings = np.random.randn(longer_seq_len, d_model_larger)
output, all_weights = multi_head_attention(
longer_embeddings, num_heads_test, d_model_larger, causal=True
)
print(f"Input shape: {longer_embeddings.shape}")
print(f"Output shape: {output.shape}")
print(f"Heads: {len(all_weights)}")python
Input shape: (7, 12)
Output shape: (7, 12)
Heads: 3
Three heads, each with a (7, 7) attention matrix. Let’s visualize all three.
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for idx, (ax, w) in enumerate(zip(axes, all_weights)):
im = ax.imshow(w, cmap='Blues', vmin=0, vmax=1)
ax.set_xticks(range(longer_seq_len))
ax.set_yticks(range(longer_seq_len))
ax.set_xticklabels(longer_sentence, fontsize=9, rotation=45)
ax.set_yticklabels(longer_sentence, fontsize=9)
ax.set_title(f"Head {idx + 1}", fontsize=12, fontweight='bold')
ax.set_xlabel("Key", fontsize=10)
if idx == 0:
ax.set_ylabel("Query", fontsize=10)
fig.suptitle("3-Head Causal Attention on 'The cat sat on the warm mat'",
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()Each head learns a different pattern. The causal mask enforces the lower-triangular shape across all three. In a trained model with 12+ heads, Head 1 might track nearby words, Head 2 might link “mat” back to “sat,” and Head 3 might connect the two instances of “the.”
Common Mistakes and How to Fix Them
Mistake 1: Forgetting to scale before softmax
The code runs fine. No error. But the model struggles to learn because attention weights saturate.
# Wrong: unscaled
scores_wrong = Q @ K.T
weights_wrong = softmax(scores_wrong)
# Correct: scaled
scores_right = Q @ K.T / np.sqrt(Q.shape[-1])
weights_right = softmax(scores_right)
print(f"Max weight (wrong): {weights_wrong.max():.4f}")
print(f"Max weight (right): {weights_right.max():.4f}")python
Max weight (wrong): 0.5765
Max weight (right): 0.3990
The unscaled version already concentrates more weight on a single token. At higher dimensions this becomes much worse.
Mistake 2: Applying the mask after softmax
The mask must go before softmax. If you zero out weights after softmax, the remaining weights don’t sum to 1. The output is scaled incorrectly.
# Wrong: mask after softmax
weights_bad = softmax(Q @ K.T / np.sqrt(Q.shape[-1]))
mask_matrix = np.tril(np.ones((seq_len, seq_len)))
weights_bad = weights_bad * mask_matrix
print(f"Row sums (wrong): {weights_bad.sum(axis=1).round(3)}")
# Correct: mask before softmax
scores_c = Q @ K.T / np.sqrt(Q.shape[-1])
weights_good = softmax(scores_c + create_causal_mask(seq_len))
print(f"Row sums (right): {weights_good.sum(axis=1).round(3)}")python
Row sums (wrong): [0.187 0.676 0.611 0.968]
Row sums (right): [1. 1. 1. 1.]
Mistake 3: Scaling by d_v instead of d_k
The scaling factor is always sqrt(d_k). Not d_v, not d_model. In most implementations d_k equals d_v, so the mistake hides. But it’s conceptually wrong and breaks when they differ.
d_k_ex, d_v_ex = 8, 16
print(f"Correct: sqrt(d_k) = {np.sqrt(d_k_ex):.2f}")
print(f"Wrong: sqrt(d_v) = {np.sqrt(d_v_ex):.2f}")python
Correct: sqrt(d_k) = 2.83
Wrong: sqrt(d_v) = 4.00
When NOT to Build Attention from Scratch
I want to be upfront about when this NumPy approach is useful and when it isn’t.
Use this approach when:
– You’re learning how attention works and want to see every step
– You’re prototyping a custom attention variant before porting to PyTorch
– You need to explain attention to a colleague
– You’re debugging attention patterns in a trained model
Don’t use this approach when:
– You’re building a production model — use PyTorch’s nn.MultiheadAttention
– You need GPU acceleration — NumPy runs on CPU only
– You’re training on real data — autograd needs a framework
– Sequences exceed a few hundred tokens — the O(n^2) matrix explodes
For production work, PyTorch and JAX have optimized implementations like FlashAttention. These avoid materializing the full n x n matrix and run significantly faster. Our NumPy version is for understanding, not deployment.
Summary
You’ve built the core attention mechanism that powers every modern LLM — entirely in NumPy.
Here’s what we covered:
– Q, K, V projections — three linear transforms that create asking, describing, and carrying spaces
– Dot-product similarity — the engine measuring relevance between words
– Scaling by sqrt(d_k) — prevents softmax saturation as dimensions grow
– Softmax normalization — converts raw scores into attention probabilities
– Multi-head attention — parallel operations capturing different relationship types
– Causal masking — the lower-triangular pattern preventing future leakage
The same pattern scales from our 4-word example to GPT-4’s 128K-token context window. The math is identical — only the dimensions change.
Practice Exercise
Write a function that takes a list of words, an embedding dimension, and a head count. It should return the causal multi-head attention output and plot all head heatmaps.
Complete Code
Frequently Asked Questions
Why is the attention matrix O(n^2) in memory?
For a sequence of length n, the weight matrix has shape (n, n). At 100K tokens with float16, that’s 100K x 100K x 2 bytes = ~20 GB for one layer and one head. Techniques like FlashAttention compute attention without materializing this full matrix.
What’s the difference between self-attention and cross-attention?
In self-attention, Q, K, and V come from the same sequence. In cross-attention, Q comes from one sequence (decoder) while K and V come from another (encoder). Machine translation uses cross-attention so the decoder can query relevant source words.
How does positional encoding interact with attention?
Attention has no built-in sense of word order. “Cat sat the” and “the cat sat” would produce identical patterns without positional info. Positional encodings (sinusoidal or learned) are added to embeddings before attention, giving each position a unique signature.
Can I use this NumPy code for training a model?
No. Training requires gradients via backpropagation. NumPy doesn’t track computational graphs. For training, use PyTorch nn.MultiheadAttention or JAX. Our code covers the forward pass only.
How many heads does GPT use?
GPT-2 Small uses 12 heads with d_model=768 (d_head=64). GPT-3 uses 96 heads with d_model=12288 (d_head=128). More heads help up to a point — each head needs at least 32-64 dimensions to capture meaningful patterns.
References
- Vaswani, A., et al. — “Attention Is All You Need.” NeurIPS 2017. arXiv:1706.03762
- Jay Mody — “GPT in 60 Lines of NumPy.” Blog post
- Eli Bendersky — “Notes on implementing Attention.” Blog post
- Peter Bloem — “Transformers from Scratch.” Blog post
- Dive into Deep Learning — “Multi-Head Attention.” D2L Chapter 11.5
- UvA Deep Learning Tutorials — “Transformers and Multi-Head Attention.” Tutorial 6
- NumPy Documentation — “Linear Algebra.” Link
- Dao, T., et al. — “FlashAttention: Fast and Memory-Efficient Exact Attention.” NeurIPS 2022. arXiv:2205.14135
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
