Menu

Positional Embeddings: RoPE & ALiBi Explained (Python)

Build sinusoidal, RoPE, and ALiBi positional embeddings from scratch in NumPy. Runnable code, heatmaps, and a clear comparison of all three schemes.

Written by Selva Prabhakaran | 26 min read


This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser.

Build three position encoding schemes used in real transformers, plot their patterns, and see how each one tells a model where tokens sit.

A transformer sees its input as a bag, not a sequence. Shuffle the words in a sentence and the self-attention scores stay the same. The model has no clue that “not” comes before “good” — unless you inject position info.

That’s the problem positional embeddings solve. Over seven years of research, three key methods emerged. Sinusoidal encoding from the first transformer paper. Rotary Position Embeddings (RoPE) from Llama and Mistral. And ALiBi — which skips embeddings and biases attention scores instead.

You’ll build all three from scratch in pure NumPy. No PyTorch. No TensorFlow. Just arrays, matrix math, and heatmaps.

Here’s how these three methods differ.

Sinusoidal embeddings make a fixed lookup table. Each position gets its own vector built from sine and cosine waves at different speeds. You add this vector to the token embedding before attention.

RoPE takes a different angle — literally. It spins the query and key vectors by an angle tied to position. When two spun vectors get dot-producted in attention, only the gap between them matters.

ALiBi is the simplest. It doesn’t touch embeddings at all. After you get the raw attention scores, ALiBi takes away a penalty based on how far apart two tokens are. Each head uses a different penalty slope.

We’ll build each piece, plot it, then run a side-by-side face-off.

Prerequisites

  • Python version: 3.9+
  • Required libraries: NumPy (1.24+), Matplotlib (3.7+)
  • Install: pip install numpy matplotlib
  • Time to complete: 25-30 minutes

What Problem Do Positional Embeddings Solve?

Self-attention takes dot products between all pairs of tokens. It’s order-blind — it gives the same output no matter how you arrange the tokens.

Here’s proof. We’ll create a toy embedding matrix, run a simplified attention step, then shuffle the rows and run it again.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Three token embeddings (dim=4)
embeddings = np.array([
    [1.0, 0.0, 1.0, 0.0],   # token 0
    [0.0, 1.0, 0.0, 1.0],   # token 1
    [1.0, 1.0, 0.0, 0.0],   # token 2
])

# Attention scores: Q @ K^T (using embeddings as both Q and K)
scores_original = embeddings @ embeddings.T
print("Original order attention scores:")
print(scores_original)

Output:

python
Original order attention scores:
[[2. 0. 1.]
 [0. 2. 1.]
 [1. 1. 2.]]

Now shuffle the token order and run the same math.

# Shuffle: move token 2 to position 0
shuffled = embeddings[[2, 0, 1]]
scores_shuffled = shuffled @ shuffled.T
print("Shuffled order attention scores:")
print(scores_shuffled)

Output:

python
Shuffled order attention scores:
[[2. 1. 1.]
 [1. 2. 0.]
 [1. 0. 2.]]

The diagonal stays at 2.0 — each token always matches itself. But the off-diagonal scores just rearranged. The model sees the same relationships with zero awareness that order changed.

Key Insight: Self-attention is a set operation, not a sequence operation. Without positional info, “the cat sat on the mat” and “mat the on sat cat the” look identical to the model.

That’s why every transformer needs a positional encoding scheme.

Sinusoidal Positional Enencoding — The Original Approach

The “Attention Is All You Need” paper brought this scheme in 2017. The idea: give each position its own vector using sine and cosine waves at different speeds.

Two equations. Even dims get sine. Odd dims get cosine:

\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)\]
\[PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)\]

Where:
\(pos\) = the token’s position (0, 1, 2, …)
\(i$ = the dimension index (0, 1, ... up to \)d/2$)
– $d$ = the total embedding dimension

Why sines and cosines? For any fixed offset \(k$, the encoding at \)pos + k$ is a straight-line combo of the encoding at \(pos\). This gives the model a clean way to learn gaps between tokens.

If the math above isn’t your thing, skip ahead to the code — it does all the heavy lifting.

The function below builds the full encoding matrix. It makes a grid of positions and dims, works out the wave speed for each dim, then fills even columns with sine and odd columns with cosine.

def sinusoidal_encoding(seq_len, d_model):
    """Create sinusoidal positional encoding matrix.

    Returns shape (seq_len, d_model) with alternating sin/cos patterns.
    """
    position = np.arange(seq_len)[:, np.newaxis]       # (seq_len, 1)
    dim_indices = np.arange(d_model)[np.newaxis, :]     # (1, d_model)

    # Frequency term: 1 / 10000^(2i/d)
    angles = position / np.power(10000, (2 * (dim_indices // 2)) / d_model)

    # Even indices: sin, Odd indices: cos
    encoding = np.zeros_like(angles)
    encoding[:, 0::2] = np.sin(angles[:, 0::2])
    encoding[:, 1::2] = np.cos(angles[:, 1::2])

    return encoding

This function builds the matrix but prints nothing yet. Let’s call it and inspect the results.

pe = sinusoidal_encoding(seq_len=50, d_model=64)
print(f"Enencoding shape: {pe.shape}")
print(f"Position 0, first 8 dims: {np.round(pe[0, :8], 4)}")
print(f"Position 1, first 8 dims: {np.round(pe[1, :8], 4)}")

Output:

python
Enencoding shape: (50, 64)
Position 0, first 8 dims: [0.     1.     0.     1.     0.     1.     0.     1.    ]
Position 1, first 8 dims: [0.8415 0.5403 0.1433 0.9897 0.0245 0.9997 0.0042 1.    ]

Position 0 is all zeros for sine columns, all ones for cosine columns — since \(\sin(0)=0\) and \(\cos(0)=1\). Position 1 shows how each dim pair swings at a different speed.

The first pair changes fast. Later pairs change slowly. Let’s see this as a heatmap.

fig, ax = plt.subplots(figsize=(10, 6))
cax = ax.imshow(pe, aspect='auto', cmap='RdBu', interpolation='nearest')
ax.set_xlabel('Embedding Dimension')
ax.set_ylabel('Position')
ax.set_title('Sinusoidal Positional Enencoding (50 positions, dim=64)')
fig.colorbar(cax, ax=ax, label='Enencoding Value')
plt.tight_layout()
plt.show()

The heatmap shows it clearly: low dims swing fast (they tell nearby spots apart). High dims swing slowly (they tell far-apart spots apart). Together, every position gets a unique fingerprint.

Tip: Sinusoidal encoding is fixed and has zero learnable weights. Compute it once and add it to your embeddings. It works for any sequence length — even lengths the model never saw in training.

How similar are nearby positions? Let’s compute cosine likeness between every pair.

norms = np.linalg.norm(pe, axis=1, keepdims=True)
pe_normalized = pe / norms
similarity = pe_normalized @ pe_normalized.T

fig, ax = plt.subplots(figsize=(7, 6))
cax = ax.imshow(similarity, cmap='viridis', interpolation='nearest')
ax.set_xlabel('Position')
ax.set_ylabel('Position')
ax.set_title('Cosine Similarity Between Sinusoidal Enencodings')
fig.colorbar(cax, ax=ax, label='Similarity')
plt.tight_layout()
plt.show()

The bright line down the middle confirms each position is most like itself. Likeness fades as the gap grows. That’s the signal the model needs.


Exercise 1 — Build a Custom Sinusoidal Enencoding

typescript
{
  type: 'exercise',
  id: 'sinusoidal-custom',
  title: 'Exercise 1: Build Sinusoidal Enencoding for a Short Sequence',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Complete the function `build_pe(seq_len, d_model)` that returns a sinusoidal positional encoding matrix. Use the formulas: sin for even dimensions, cos for odd dimensions, with frequency `1/10000^(2i/d)`. Then call it with `seq_len=4, d_model=8` and print the value at position 0, dimension 0.',
  starterCode: 'import numpy as np\n\ndef build_pe(seq_len, d_model):\n    position = np.arange(seq_len)[:, np.newaxis]\n    dim_indices = np.arange(d_model)[np.newaxis, :]\n    angles = position / np.power(10000, (2 * (dim_indices // 2)) / d_model)\n    encoding = np.zeros_like(angles)\n    # TODO: fill even columns with sin, odd columns with cos\n    encoding[:, 0::2] = ___\n    encoding[:, 1::2] = ___\n    return encoding\n\npe = build_pe(4, 8)\nprint(round(float(pe[0, 0]), 4))\nprint(round(float(pe[1, 0]), 4))',
  testCases: [
    { id: 'tc1', input: '', expectedOutput: '0.0\n0.8415', description: 'sin(0)=0 at position 0, sin(1)=0.8415 at position 1' },
    { id: 'tc2', input: 'print(round(float(pe[0, 1]), 4))', expectedOutput: '1.0', hidden: false, description: 'cos(0)=1 at position 0, dim 1' }
  ],
  hints: [
    'Even columns use np.sin(angles[:, 0::2]), odd columns use np.cos(angles[:, 1::2]).',
    'encoding[:, 0::2] = np.sin(angles[:, 0::2]); encoding[:, 1::2] = np.cos(angles[:, 1::2])'
  ],
  solution: 'import numpy as np\n\ndef build_pe(seq_len, d_model):\n    position = np.arange(seq_len)[:, np.newaxis]\n    dim_indices = np.arange(d_model)[np.newaxis, :]\n    angles = position / np.power(10000, (2 * (dim_indices // 2)) / d_model)\n    encoding = np.zeros_like(angles)\n    encoding[:, 0::2] = np.sin(angles[:, 0::2])\n    encoding[:, 1::2] = np.cos(angles[:, 1::2])\n    return encoding\n\npe = build_pe(4, 8)\nprint(round(float(pe[0, 0]), 4))\nprint(round(float(pe[1, 0]), 4))',
  solutionExplanation: 'The key is using np.sin on even-indexed columns (0, 2, 4, ...) and np.cos on odd-indexed columns (1, 3, 5, ...). At position 0, sin(0) = 0.0 and cos(0) = 1.0. At position 1, sin(1) = 0.8415.',
  xpReward: 15,
}

Now that sinusoidal encoding is clear, let’s move to the approach that modern LLMs actually use.

Rotary Position Embeddings (RoPE) — How Modern LLMs Handle Position

Sinusoidal encoding adds position before attention. RoPE works in a whole different way: it rotates the query and key vectors right inside the attention step.

Why does that matter? When you add a position vector, the dot product between tokens depends on both content and where each token sits. With RoPE, the dot product depends on content and the gap between tokens. That’s a better fit for language — “the word two spots back” matters more than “the word at spot 47.”

Here’s the core idea. RoPE groups dims into pairs. Each pair is a 2D plane. For each pair, it spins by an angle tied to the token’s position. Different pairs spin at different speeds.

The rotation matrix for one 2D pair at position \(pos\) is:

\[R(pos, \theta_i) = \begin{pmatrix} \cos(pos \cdot \theta_i) & -\sin(pos \cdot \theta_i) \\ \sin(pos \cdot \theta_i) & \cos(pos \cdot \theta_i) \end{pmatrix}\]

Where \(\theta_i = 10000^{-2i/d}\) — the same rate plan as sinusoidal encoding.

[UNDER THE HOOD]
Why rotation encodes relative position: When you dot-product two rotated vectors \(R(m)x\) and \(R(n)y\), the angles subtract. The result equals \(x^T R(m-n) y\). Only the gap \(m-n\) shows up. Where each token sits on its own drops out. That’s why RoPE grabs relative gaps with no extra work. Skip this box if the math isn’t your thing — the code below works the same either way.

The first function works out cosine and sine values for each position-dim pair. The second one does the actual spin by making a “swapped-and-flipped” copy of the input, then mixing it with the original using cos/sin weights.

def rope_frequencies(seq_len, d_model, base=10000):
    """Precompute RoPE rotation components.

    Returns cos and sin arrays, each shape (seq_len, d_model).
    """
    dim_pairs = np.arange(0, d_model, 2)
    freqs = 1.0 / np.power(base, dim_pairs / d_model)    # (d_model/2,)

    positions = np.arange(seq_len)                          # (seq_len,)
    angles = np.outer(positions, freqs)                     # (seq_len, d_model/2)

    # Duplicate each column: [cos0, cos0, cos1, cos1, ...]
    cos_vals = np.repeat(np.cos(angles), 2, axis=1)        # (seq_len, d_model)
    sin_vals = np.repeat(np.sin(angles), 2, axis=1)        # (seq_len, d_model)

    return cos_vals, sin_vals

apply_rope does the spin. It swaps each dim pair and flips the sign on one element, then blends the original and spun versions with cos and sin.

def apply_rope(x, cos_vals, sin_vals):
    """Apply rotary embeddings to input tensor x.

    x: shape (seq_len, d_model)
    Rotates consecutive dimension pairs by position-dependent angles.
    """
    # [x0, x1, x2, x3, ...] -> [-x1, x0, -x3, x2, ...]
    x_rotated = np.stack([-x[:, 1::2], x[:, 0::2]], axis=-1)
    x_rotated = x_rotated.reshape(x.shape)

    return x * cos_vals + x_rotated * sin_vals

Two functions, and we’ve got the full RoPE mechanism. Let’s test it on toy data.

We’ll make random query and key matrices, apply RoPE, and check what happens at position 0 (where the spin angle is zero) versus position 3.

seq_len, d_model = 8, 16
cos_vals, sin_vals = rope_frequencies(seq_len, d_model)

np.random.seed(42)
q = np.random.randn(seq_len, d_model) * 0.5
k = np.random.randn(seq_len, d_model) * 0.5

q_rope = apply_rope(q, cos_vals, sin_vals)
k_rope = apply_rope(k, cos_vals, sin_vals)

print(f"Original Q[0,:4]: {np.round(q[0,:4], 4)}")
print(f"RoPE Q[0,:4]:     {np.round(q_rope[0,:4], 4)}")

Output:

python
Original Q[0,:4]: [ 0.2484  0.     -0.0696  0.7614]
RoPE Q[0,:4]:     [ 0.2484  0.     -0.0696  0.7614]

At position 0, the spin angle is zero. \(\cos(0) = 1\), \(\sin(0) = 0\). So RoPE is the identity — nothing changes.

print(f"Original Q[3,:4]: {np.round(q[3,:4], 4)}")
print(f"RoPE Q[3,:4]:     {np.round(q_rope[3,:4], 4)}")

At position 3, values shift. The first pair spun by $3 \times \theta_0$. Later pairs spin by smaller angles.

Key Insight: RoPE encodes position by spinning, not adding. The dot product of two RoPE-changed vectors depends only on the gap between them — not where each one sits. This is why Llama, Mistral, and Gemma all use RoPE.

Let’s plot the attention pattern with and without RoPE. We compute \(Q \cdot K^T\) both ways and display them as heatmaps.

scores_no_rope = q @ k.T / np.sqrt(d_model)
scores_with_rope = q_rope @ k_rope.T / np.sqrt(d_model)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].imshow(scores_no_rope, cmap='RdBu', interpolation='nearest')
axes[0].set_title('Attention Scores (No Positional Info)')
axes[0].set_xlabel('Key Position')
axes[0].set_ylabel('Query Position')

axes[1].imshow(scores_with_rope, cmap='RdBu', interpolation='nearest')
axes[1].set_title('Attention Scores (With RoPE)')
axes[1].set_xlabel('Key Position')
axes[1].set_ylabel('Query Position')

plt.tight_layout()
plt.show()

In the RoPE version, scores shift along the main line. Nearby spots get a similar push. Far-apart spots drift more. That’s relative position awareness at work.


Exercise 2 — Apply RoPE and Verify Relative Position

typescript
{
  type: 'exercise',
  id: 'rope-verify',
  title: 'Exercise 2: Verify RoPE Encodes Relative Position',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Create two identical vectors at positions 2 and 5 (distance=3), and two identical vectors at positions 10 and 13 (also distance=3). Apply RoPE to all four. Compute the dot product of the pair at (2,5) and the pair at (10,13). Print both dot products rounded to 4 decimal places. They should be equal — because RoPE encodes relative distance, not absolute position.',
  starterCode: 'import numpy as np\n\ndef rope_frequencies(seq_len, d_model, base=10000):\n    dim_pairs = np.arange(0, d_model, 2)\n    freqs = 1.0 / np.power(base, dim_pairs / d_model)\n    positions = np.arange(seq_len)\n    angles = np.outer(positions, freqs)\n    cos_vals = np.repeat(np.cos(angles), 2, axis=1)\n    sin_vals = np.repeat(np.sin(angles), 2, axis=1)\n    return cos_vals, sin_vals\n\ndef apply_rope(x, cos_vals, sin_vals):\n    x_rotated = np.stack([-x[:, 1::2], x[:, 0::2]], axis=-1)\n    x_rotated = x_rotated.reshape(x.shape)\n    return x * cos_vals + x_rotated * sin_vals\n\nd = 16\ncos_v, sin_v = rope_frequencies(20, d)\n\nnp.random.seed(99)\nvec = np.random.randn(d)  # same vector for all positions\n\n# TODO: place vec at positions 2, 5, 10, 13\n# Apply RoPE, compute dot products, print them\n',
  testCases: [
    { id: 'tc1', input: '', expectedOutput: '', description: 'Both dot products should be equal (same relative distance)' }
  ],
  hints: [
    'Use cos_v[pos] and sin_v[pos] to get the rotation for a single position. Apply RoPE to vec reshaped as (1, d).',
    'r2 = apply_rope(vec.reshape(1,-1), cos_v[2:3], sin_v[2:3]); r5 = apply_rope(vec.reshape(1,-1), cos_v[5:6], sin_v[5:6]); dot1 = (r2 @ r5.T)[0,0]'
  ],
  solution: 'import numpy as np\n\ndef rope_frequencies(seq_len, d_model, base=10000):\n    dim_pairs = np.arange(0, d_model, 2)\n    freqs = 1.0 / np.power(base, dim_pairs / d_model)\n    positions = np.arange(seq_len)\n    angles = np.outer(positions, freqs)\n    cos_vals = np.repeat(np.cos(angles), 2, axis=1)\n    sin_vals = np.repeat(np.sin(angles), 2, axis=1)\n    return cos_vals, sin_vals\n\ndef apply_rope(x, cos_vals, sin_vals):\n    x_rotated = np.stack([-x[:, 1::2], x[:, 0::2]], axis=-1)\n    x_rotated = x_rotated.reshape(x.shape)\n    return x * cos_vals + x_rotated * sin_vals\n\nd = 16\ncos_v, sin_v = rope_frequencies(20, d)\n\nnp.random.seed(99)\nvec = np.random.randn(d)\n\nr2 = apply_rope(vec.reshape(1, -1), cos_v[2:3], sin_v[2:3])\nr5 = apply_rope(vec.reshape(1, -1), cos_v[5:6], sin_v[5:6])\nr10 = apply_rope(vec.reshape(1, -1), cos_v[10:11], sin_v[10:11])\nr13 = apply_rope(vec.reshape(1, -1), cos_v[13:14], sin_v[13:14])\n\ndot_2_5 = float((r2 @ r5.T)[0, 0])\ndot_10_13 = float((r10 @ r13.T)[0, 0])\n\nprint(round(dot_2_5, 4))\nprint(round(dot_10_13, 4))',
  solutionExplanation: 'Both dot products are equal because RoPE encodes relative position. Positions (2,5) and (10,13) have the same distance of 3. The rotation angles at those positions differ in absolute value, but when you dot-product the rotated vectors, only the relative distance (3) determines the result.',
  xpReward: 20,
}

RoPE is the industry standard. But there’s an even simpler approach that works surprisingly well.

ALiBi — The Simplest Positional Scheme

ALiBi (Attention with Linear Biases) showed up at ICLR 2022. Its idea is dead simple: don’t change the embeddings at all. After you get the raw attention scores, subtract a penalty that grows with the gap between tokens.

The formula:

\[\text{ALiBi}(q_i, k_j) = q_i \cdot k_j - m \cdot |i - j|\]

Where:
\(q_i \cdot k_j\) = raw attention score
– $m$ = head-specific slope (fixed, not learned)
– $|i – j|$ = distance between positions

Each head gets its own slope $m$. The slopes form a pattern where each is half the last. Steep-slope heads focus on nearby tokens. Gentle-slope heads can look further away.

Why does this work? The bias builds in a “closer is better” rule. But each head has a different reach. Together, they cover all gaps.

The slope math follows the paper’s rule. Each head’s slope is a power of $2^{-8/H}$ where $H$ is the head count.

def alibi_slopes(num_heads):
    """Compute ALiBi slopes for each attention head.

    Uses the geometric sequence from the paper.
    Returns array of shape (num_heads,).
    """
    ratio = 2 ** (-8 / num_heads)
    slopes = np.array([ratio ** (i + 1) for i in range(num_heads)])
    return slopes

The bias matrix times each head’s slope by the gap matrix. The minus sign makes sure the bias is always a penalty.

def alibi_bias(seq_len, num_heads):
    """Build the full ALiBi bias tensor.

    Returns shape (num_heads, seq_len, seq_len).
    Each head has a different slope applied to the distance matrix.
    """
    slopes = alibi_slopes(num_heads)

    positions = np.arange(seq_len)
    distance = np.abs(positions[:, None] - positions[None, :])

    bias = -slopes[:, None, None] * distance[None, :, :]

    return bias

Two small functions — that’s the entire ALiBi setup. No rotation matrices. No sinusoidal lookup tables. Just distances multiplied by slopes.

Let’s see the bias matrices for 4 heads with a 20-token sequence.

num_heads = 4
seq_len = 20
bias = alibi_bias(seq_len, num_heads)

slopes = alibi_slopes(num_heads)
print(f"Head slopes: {np.round(slopes, 4)}")
print(f"Bias shape: {bias.shape}")

Output:

python
Head slopes: [0.5    0.25   0.125  0.0625]
Bias shape: (4, 20, 20)

Head 0 has slope 0.5 — it penalizes distance heavily. Head 3 has slope 0.0625. A token 16 positions away only gets a bias of -1.0 from head 3. That head can “see” far.

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for h in range(num_heads):
    im = axes[h].imshow(bias[h], cmap='RdBu', interpolation='nearest',
                         vmin=bias.min(), vmax=0)
    axes[h].set_title(f'Head {h} (m={slopes[h]:.4f})')
    axes[h].set_xlabel('Key Position')
    if h == 0:
        axes[h].set_ylabel('Query Position')

fig.colorbar(im, ax=axes, label='Bias Value', shrink=0.8)
fig.suptitle('ALiBi Bias Matrices — 4 Heads', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

Head 0 shows dark colors everywhere but the main line. Head 3 stays light even far from center. This mix is the key — the model gets both local and long-range attention with no learned weights.

Warning: ALiBi biases are always negative (or zero on the main line). If you forget the minus sign, you’ll boost attention to far-off tokens — the exact opposite of what you want.

Here’s how ALiBi changes attention in practice. We get raw scores, add the head-0 bias, then run softmax to see the weight spread.

raw_scores = q @ k.T / np.sqrt(d_model)

alibi_biases = alibi_bias(seq_len=8, num_heads=4)
adjusted_scores = raw_scores + alibi_biases[0]

def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / e_x.sum(axis=axis, keepdims=True)

attn_no_alibi = softmax(raw_scores)
attn_with_alibi = softmax(adjusted_scores)

print("Attention weights for token 4 (no ALiBi):")
print(np.round(attn_no_alibi[4], 4))
print("\nAttention weights for token 4 (with ALiBi, head 0):")
print(np.round(attn_with_alibi[4], 4))

Without ALiBi, token 4 spreads its focus across all spots. With the steep head-0 bias, focus clusters tightly near position 4. Far-off tokens get much less weight.

Tip: ALiBi shines when you need longer reach. A model trained on 1024-token runs can often handle 2048 or 4096 tokens at test time with no retraining. The bias scales to longer runs on its own.

Exercise 3 — Build ALiBi Bias and Check Its Properties

typescript
{
  type: 'exercise',
  id: 'alibi-build',
  title: 'Exercise 3: Build ALiBi Bias for 8 Heads',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Compute ALiBi slopes for 8 heads and build the bias matrix for a 10-token sequence. Print the slope of head 0 (rounded to 6 decimals) and the bias value at position (0, 9) for head 0 (this is the penalty for the maximum distance of 9).',
  starterCode: 'import numpy as np\n\ndef alibi_slopes(num_heads):\n    ratio = 2 ** (-8 / num_heads)\n    slopes = np.array([ratio ** (i + 1) for i in range(num_heads)])\n    return slopes\n\ndef alibi_bias(seq_len, num_heads):\n    slopes = alibi_slopes(num_heads)\n    positions = np.arange(seq_len)\n    distance = np.abs(positions[:, None] - positions[None, :])\n    bias = -slopes[:, None, None] * distance[None, :, :]\n    return bias\n\n# TODO: build bias for 8 heads, seq_len=10\n# Print slope of head 0 and bias at [0, 0, 9]\n',
  testCases: [
    { id: 'tc1', input: '', expectedOutput: '0.5\n-4.5', description: 'Head 0 slope is 0.5, max-distance bias is -4.5' }
  ],
  hints: [
    'slopes = alibi_slopes(8) gives you the slopes array. Print slopes[0]. For bias: b = alibi_bias(10, 8); print b[0, 0, 9].',
    'With 8 heads, ratio = 2^(-8/8) = 0.5. slopes[0] = 0.5^1 = 0.5. Bias at (0,9) = -0.5 * 9 = -4.5.'
  ],
  solution: 'import numpy as np\n\ndef alibi_slopes(num_heads):\n    ratio = 2 ** (-8 / num_heads)\n    slopes = np.array([ratio ** (i + 1) for i in range(num_heads)])\n    return slopes\n\ndef alibi_bias(seq_len, num_heads):\n    slopes = alibi_slopes(num_heads)\n    positions = np.arange(seq_len)\n    distance = np.abs(positions[:, None] - positions[None, :])\n    bias = -slopes[:, None, None] * distance[None, :, :]\n    return bias\n\nslopes = alibi_slopes(8)\nprint(round(float(slopes[0]), 6))\n\nb = alibi_bias(10, 8)\nprint(round(float(b[0, 0, 9]), 6))',
  solutionExplanation: 'With 8 heads, ratio = 2^(-8/8) = 2^(-1) = 0.5. slopes[0] = 0.5^1 = 0.5. Bias at (0,9) for head 0 = -0.5 * 9 = -4.5. The steepest head penalizes 9 positions of distance by 4.5 units.',
  xpReward: 15,
}

Now you’ve built all three. Let’s see how they compare on the same data.

Side-by-Side Comparison — All Three Schemes

We’ll make a 12-token run, get attention with each scheme, and show the heatmaps side by side. This uses all the functions we’ve built so far.

The code makes random Q and K arrays, then runs all three schemes: sinusoidal adds to embeddings, RoPE spins Q and K, ALiBi biases scores. Softmax turns each into attention weights.

seq_len_cmp = 12
d_model_cmp = 32
num_heads_cmp = 4

np.random.seed(123)
q_cmp = np.random.randn(seq_len_cmp, d_model_cmp) * 0.5
k_cmp = np.random.randn(seq_len_cmp, d_model_cmp) * 0.5

# Raw scores (shared baseline)
raw = q_cmp @ k_cmp.T / np.sqrt(d_model_cmp)

# Sinusoidal: add PE to both Q and K
pe_cmp = sinusoidal_encoding(seq_len_cmp, d_model_cmp)
scores_sin = (q_cmp + pe_cmp) @ (k_cmp + pe_cmp).T / np.sqrt(d_model_cmp)

# RoPE: rotate Q and K
cos_cmp, sin_cmp = rope_frequencies(seq_len_cmp, d_model_cmp)
q_rope_cmp = apply_rope(q_cmp, cos_cmp, sin_cmp)
k_rope_cmp = apply_rope(k_cmp, cos_cmp, sin_cmp)
scores_rope = q_rope_cmp @ k_rope_cmp.T / np.sqrt(d_model_cmp)

# ALiBi: bias the raw scores
alibi_cmp = alibi_bias(seq_len_cmp, num_heads_cmp)
scores_alibi = raw + alibi_cmp[1]

attn_sin = softmax(scores_sin)
attn_rope = softmax(scores_rope)
attn_alibi = softmax(scores_alibi)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

titles = ['Sinusoidal', 'RoPE', 'ALiBi (head 1)']
matrices = [attn_sin, attn_rope, attn_alibi]

for ax, title, mat in zip(axes, titles, matrices):
    im = ax.imshow(mat, cmap='viridis', interpolation='nearest',
                    vmin=0, vmax=mat.max())
    ax.set_title(title, fontsize=13)
    ax.set_xlabel('Key Position')
    ax.set_ylabel('Query Position')

fig.colorbar(im, ax=axes, label='Attention Weight', shrink=0.8)
fig.suptitle('Attention Patterns: Three Positional Enencoding Schemes',
             fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

What you’ll notice:

  • Sinusoidal shifts scores but doesn’t push a strong local bias. It gives position sense and lets the learned weights decide what to attend to.
  • RoPE makes a relative-gap pattern. Tokens at the same gap get the same shift.
  • ALiBi has the sharpest local focus. Attention clusters near the main line because the penalty directly pushes down far-off tokens.
FeatureSinusoidalRoPEALiBi
Where appliedAdded to embeddingsRotates Q and KBiases attention scores
Position typeAbsoluteRelative (via rotation)Relative (via linear bias)
Learnable?NoNoNo
ParametersZeroZeroZero
Length extrapolationWeakModerate (with scaling)Strong
Used inOriginal Transformer, BERTLlama, Mistral, GemmaBLOOM, MPT
ComputationOne-time addPer-layer rotationPer-layer bias add
Key Insight: All three schemes have zero learnable weights. They encode position through math alone — waves, spins, or penalties.

When to Use Which Scheme

Choosing between these three isn’t always clear. Here are quick rules.

Pick sinusoidal when you’re building a basic transformer to learn. It’s the most studied and easiest to debug. But it locks in where each token sits, which hurts on longer runs.

Pick RoPE when you need solid results on inputs of mixed length. It’s the top choice in modern open-source LLMs. If I were starting a new project today, RoPE would be my default.

Pick ALiBi when the model will see longer inputs at test time than in training. The bias stretches on its own. It’s also the simplest to code.

When NOT to use these: Don’t swap the scheme on a pretrained model. Its weights fit one scheme. Changing it breaks what the model learned.

Warning: Don’t mix schemes. Adding sinusoidal embeddings AND using RoPE would double-encode position. The model gets mixed signals. Stick with one method.

Common Mistakes and How to Fix Them

Mistake 1: Applying RoPE to only Q (not K)

The most common RoPE bug. If you only spin the query, the dot product won’t encode the gap between tokens.

Wrong:

q_rotated = apply_rope(q, cos_vals, sin_vals)
scores = q_rotated @ k.T  # k is NOT rotated

Why it fails: The spin trick only works when both Q and K get spun.

Correct:

q_rotated = apply_rope(q, cos_vals, sin_vals)
k_rotated = apply_rope(k, cos_vals, sin_vals)
scores = q_rotated @ k_rotated.T

Mistake 2: Positive ALiBi biases (missing negative sign)

Wrong:

bias = slopes[:, None, None] * distance[None, :, :]

Why it fails: Plus-sign biases boost far-off attention — the opposite of what ALiBi does.

Correct:

bias = -slopes[:, None, None] * distance[None, :, :]

Mistake 3: Swapped sin/cos in sinusoidal encoding

Wrong:

encoding[:, 0::2] = np.cos(angles[:, 0::2])  # Swapped!
encoding[:, 1::2] = np.sin(angles[:, 1::2])  # Swapped!

Why it fails: The paper uses sin for even dims, cos for odd. Swapping them changes how the encoding works.

Correct:

encoding[:, 0::2] = np.sin(angles[:, 0::2])
encoding[:, 1::2] = np.cos(angles[:, 1::2])

Practice Exercise

Test what you’ve learned by building a metric that scores how well an encoding tells close and far positions apart.

Click to see the exercise and solution

**Task:** Write `position_distinguishability(encoding_matrix)` that returns the average cosine likeness for adjacent positions (distance=1) minus the average for distant positions (distance >= seq_len//2). Higher scores mean better separation.

def position_distinguishability(enc):
    """Measure how well an encoding separates close vs distant positions."""
    norms = np.linalg.norm(enc, axis=1, keepdims=True)
    normalized = enc / (norms + 1e-10)
    sim = normalized @ normalized.T

    seq_len = enc.shape[0]
    adjacent = np.mean([sim[i, i+1] for i in range(seq_len - 1)])

    half = seq_len // 2
    distant = []
    for i in range(seq_len):
        for j in range(seq_len):
            if abs(i - j) >= half:
                distant.append(sim[i, j])
    distant_avg = np.mean(distant)

    return adjacent - distant_avg

pe_test = sinusoidal_encoding(50, 64)
score = position_distinguishability(pe_test)
print(f"Sinusoidal distinguishability: {score:.4f}")

A plus score means nearby spots are more alike than far ones. Try different `d_model` values and see how dim count changes the score.

Summary

Positional embeddings fix a core problem: self-attention is order-blind. Without position info, “dog bites man” and “man bites dog” look the same.

Three schemes have led since 2017:

Sinusoidal encoding makes a sine/cosine wave lookup table. Zero weights. Good as a baseline. Weak spot: it locks in where each token sits, which hurts on longer runs.

RoPE spins Q and K vectors so the dot product picks up the gap between them. The top choice in Llama, Mistral, and Gemma. Start here for new projects.

ALiBi adds a distance penalty straight to the attention scores. Simplest to code. Best at handling longer runs. Zero extra cost beyond a matrix add.

All three have zero learnable weights. The code here is pure NumPy — port it to any framework by swapping the array calls.

Complete Code

Click to expand the full script (copy-paste and run)
# Complete code from: Positional Embeddings in Python
# Requires: pip install numpy matplotlib
# Python 3.9+

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Sinusoidal Enencoding ---
def sinusoidal_encoding(seq_len, d_model):
    position = np.arange(seq_len)[:, np.newaxis]
    dim_indices = np.arange(d_model)[np.newaxis, :]
    angles = position / np.power(10000, (2 * (dim_indices // 2)) / d_model)
    encoding = np.zeros_like(angles)
    encoding[:, 0::2] = np.sin(angles[:, 0::2])
    encoding[:, 1::2] = np.cos(angles[:, 1::2])
    return encoding

# --- RoPE ---
def rope_frequencies(seq_len, d_model, base=10000):
    dim_pairs = np.arange(0, d_model, 2)
    freqs = 1.0 / np.power(base, dim_pairs / d_model)
    positions = np.arange(seq_len)
    angles = np.outer(positions, freqs)
    cos_vals = np.repeat(np.cos(angles), 2, axis=1)
    sin_vals = np.repeat(np.sin(angles), 2, axis=1)
    return cos_vals, sin_vals

def apply_rope(x, cos_vals, sin_vals):
    x_rotated = np.stack([-x[:, 1::2], x[:, 0::2]], axis=-1)
    x_rotated = x_rotated.reshape(x.shape)
    return x * cos_vals + x_rotated * sin_vals

# --- ALiBi ---
def alibi_slopes(num_heads):
    ratio = 2 ** (-8 / num_heads)
    slopes = np.array([ratio ** (i + 1) for i in range(num_heads)])
    return slopes

def alibi_bias(seq_len, num_heads):
    slopes = alibi_slopes(num_heads)
    positions = np.arange(seq_len)
    distance = np.abs(positions[:, None] - positions[None, :])
    bias = -slopes[:, None, None] * distance[None, :, :]
    return bias

def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / e_x.sum(axis=axis, keepdims=True)

# --- Demo: permutation invariance ---
embeddings = np.array([
    [1.0, 0.0, 1.0, 0.0],
    [0.0, 1.0, 0.0, 1.0],
    [1.0, 1.0, 0.0, 0.0],
])
print("Original attention scores:")
print(embeddings @ embeddings.T)
shuffled = embeddings[[2, 0, 1]]
print("\nShuffled attention scores:")
print(shuffled @ shuffled.T)

# Sinusoidal heatmap
pe = sinusoidal_encoding(50, 64)
fig, ax = plt.subplots(figsize=(10, 6))
ax.imshow(pe, aspect='auto', cmap='RdBu', interpolation='nearest')
ax.set_xlabel('Embedding Dimension')
ax.set_ylabel('Position')
ax.set_title('Sinusoidal Positional Enencoding')
plt.tight_layout()
plt.show()

# Cosine similarity
norms = np.linalg.norm(pe, axis=1, keepdims=True)
similarity = (pe / norms) @ (pe / norms).T
fig, ax = plt.subplots(figsize=(7, 6))
ax.imshow(similarity, cmap='viridis', interpolation='nearest')
ax.set_title('Cosine Similarity Between Sinusoidal Enencodings')
plt.tight_layout()
plt.show()

# RoPE face-off
seq_len, d_model = 8, 16
cos_vals, sin_vals = rope_frequencies(seq_len, d_model)
q = np.random.randn(seq_len, d_model) * 0.5
k = np.random.randn(seq_len, d_model) * 0.5
q_rope = apply_rope(q, cos_vals, sin_vals)
k_rope = apply_rope(k, cos_vals, sin_vals)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].imshow(q @ k.T / np.sqrt(d_model), cmap='RdBu')
axes[0].set_title('No Positional Info')
axes[1].imshow(q_rope @ k_rope.T / np.sqrt(d_model), cmap='RdBu')
axes[1].set_title('With RoPE')
plt.tight_layout()
plt.show()

# ALiBi bias
bias = alibi_bias(20, 4)
slopes_vis = alibi_slopes(4)
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
for h in range(4):
    axes[h].imshow(bias[h], cmap='RdBu', vmin=bias.min(), vmax=0)
    axes[h].set_title(f'Head {h} (m={slopes_vis[h]:.4f})')
plt.tight_layout()
plt.show()

# Side-by-side face-off
seq_len_cmp, d_model_cmp, num_heads_cmp = 12, 32, 4
np.random.seed(123)
q_cmp = np.random.randn(seq_len_cmp, d_model_cmp) * 0.5
k_cmp = np.random.randn(seq_len_cmp, d_model_cmp) * 0.5
raw = q_cmp @ k_cmp.T / np.sqrt(d_model_cmp)

pe_cmp = sinusoidal_encoding(seq_len_cmp, d_model_cmp)
attn_sin = softmax((q_cmp + pe_cmp) @ (k_cmp + pe_cmp).T / np.sqrt(d_model_cmp))
cos_cmp, sin_cmp = rope_frequencies(seq_len_cmp, d_model_cmp)
attn_rope = softmax(apply_rope(q_cmp, cos_cmp, sin_cmp) @ apply_rope(k_cmp, cos_cmp, sin_cmp).T / np.sqrt(d_model_cmp))
attn_alibi = softmax(raw + alibi_bias(seq_len_cmp, num_heads_cmp)[1])

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, title, mat in zip(axes, ['Sinusoidal', 'RoPE', 'ALiBi'],
                           [attn_sin, attn_rope, attn_alibi]):
    ax.imshow(mat, cmap='viridis')
    ax.set_title(title)
plt.tight_layout()
plt.show()

print("Script completed successfully.")

Frequently Asked Questions

Can I combine RoPE with sinusoidal encoding?

Don’t. Both encode position — mixing them sends mixed signals. RoPE already puts the gap info into the Q/K dot product. Adding sinusoidal vectors on top makes two rival signals. Pick one and stick with it.

Does ALiBi work with causal masking?

Yes. They do different jobs on the same score matrix. Causal masking sets future spots to $-\infty$. ALiBi takes away a gap-based penalty. Apply both: ALiBi bias first, then the causal mask. They don’t clash.

Why don’t modern models use learned positional embeddings?

Some do — BERT and GPT-2 use them. But learned embeddings have a fixed max length. Train with 512 spots and you simply don’t have a vector for spot 513. The other three dodge this because they compute position from formulas.

How does RoPE handle sequences longer than training length?

Out of the box, RoPE’s fast-spinning parts wrap around and hurt output quality. Methods like NTK-aware scaling and YaRN stretch the base rate, letting models reach 2-4x their training length.

Which scheme is cheapest to compute?

ALiBi — just a matrix add per layer. Sinusoidal is a one-time add before the first layer. RoPE costs the most because it spins Q and K at every layer. In practice, all three are tiny next to the attention step itself.

References

  1. Vaswani, A. et al. — “Attention Is All You Need.” NeurIPS 2017. arXiv:1706.03762
  2. Su, J. et al. — “RoFormer: Enhanced Transformer with Rotary Position Embedding.” 2021. arXiv:2104.09864
  3. Press, O., Smith, N.A., Lewis, M. — “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.” ICLR 2022. arXiv:2108.12409
  4. Touvron, H. et al. — “Llama 2: Open Foundation and Fine-Tuned Chat Models.” 2023. arXiv:2307.09288
  5. Jiang, A.Q. et al. — “Mistral 7B.” 2023. arXiv:2310.06825
  6. EleutherAI Blog — “Rotary Embeddings: A Relative Revolution.” Link
  7. NumPy Documentation — numpy.outer, numpy.power, numpy.repeat. Link
  8. Peng, B. et al. — “YaRN: Efficient Context Window Extension of Large Language Models.” 2023. arXiv:2309.00071
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science