
Build a BPE Tokenizer from Scratch in Python
Learn how BPE tokenization works and build a byte pair encoding tokenizer step by step in pure Python — the same algorithm behind GPT-4 and Llama.
Every large language model starts by breaking your text into tokens. Here’s how to build the exact tokenizer that powers GPT — from raw bytes to a working encoder — in pure Python.
Every time you type a prompt into ChatGPT, the model does not read your words right away. First, it chops your text into small pieces called tokens. This step — called tokenization — happens before anything else in the pipeline.
If the tokenizer does a poor job, the model sees noise. If it does a good job, the model reads clean, useful chunks. In this guide, you will build a Byte Pair Encoding (BPE) tokenizer from scratch. By the end, you will know exactly how GPT-2, GPT-4, and Llama split text into tokens.
What Is Tokenization and Why Does It Matter?
Tokenization turns raw text into a list of numbers. Neural networks can’t work with letters or words on their own. They need integers — IDs that point to rows in a lookup table.
Think of it like a phone book. Each token gets a unique ID. The model uses that ID to grab a vector from its lookup table. That vector is what the model “reads.”
Here is a quick example. The phrase “Hello world” might turn into two token IDs:
python
text = "Hello world"
# A simple tokenizer might produce:
token_ids = [15496, 995]
# Each ID maps to a learned vector in the model's embedding table
Why should you care? Three reasons:
- Context window limits count tokens, not words. A 4096-token window might fit 3000 words — or 2000. It depends on the tokenizer.
- API costs are billed per token. Fewer tokens per message means a lower bill.
- Model quality rides on how well the tokenizer splits text. Bad splits force the model to waste effort gluing pieces back together.
Key Insight: Tokenization is not a step you can ignore — it shapes what the model sees, how much text fits in its window, and how much each API call costs.
Why Splitting on Spaces Doesn’t Work
The simplest tokenizer would just split on spaces. Let’s try it:
text = "I don't like low-quality tokenizers!"
naive_tokens = text.split(" ")
print(naive_tokens)Output:
python
["I", "don't", "like", "low-quality", "tokenizers!"]
See the issues? “don’t” stays as one token, even though it holds two words (“do” and “n’t”). The exclamation mark sticks to “tokenizers!”. And “low-quality” stays as one big lump.
This is word-level tokenization. Its fatal flaw: every unique word needs its own slot in the vocab. English has over 170,000 words in use. Add typos, slang, jargon, and more languages — the vocab blows up to millions.
A vocab that large means a huge lookup table. That eats memory and slows training. Worse, any word not in the vocab becomes “unknown.” The model can’t read it at all.
Why Character-Level Tokenization Fails Too
The other extreme: split into single characters.
text = "Hello"
char_tokens = list(text)
print(char_tokens)
print(f"Number of tokens: {len(char_tokens)}")Output:
python
['H', 'e', 'l', 'l', 'o']
Number of tokens: 5
The vocab is tiny — just 256 byte values cover every letter in every language. No word is ever “unknown.” Sounds great, right?
But there’s a catch. Sequences get very long. The word “tokenization” becomes 12 tokens. A 500-word post turns into thousands of tokens. The model has to learn that “t”, “h”, “e” spell “the.” That’s like asking it to learn basic spelling.
This wastes the model’s brain power on putting words back together. It should be grasping meaning instead. In practice, models that use single characters do worse on most NLP tasks.
The Subword Fix — Enter BPE
Subword tokenization sits right between word-level and character-level. It keeps common words whole and splits rare words into useful pieces.
Take “tokenization.” It might split into [“token”, “ization”]. The model knows both parts. It can also handle “tokenize,” “tokenizer,” and “tokenized” by mixing “token” with other endings.
Byte Pair Encoding (BPE) is the most popular subword method. OpenAI uses it for GPT-2, GPT-3, and GPT-4. Meta uses it for Llama. It started life as a data compression trick back in 1994, coined by Philip Gage.
The core idea is simple and elegant. Start with single bytes. Then, over and over, find the most common pair of side-by-side tokens and merge them into one new token. Do this enough times and you build up a vocab of common word parts, full words, and even short phrases.

Key Insight: BPE hits the sweet spot: a small vocab (32K–128K tokens) that can handle any text, keeps common words as single tokens, and breaks rare words into known pieces.
How the BPE Algorithm Works — Intuition First
Before we write any code, let’s walk through it by hand. Say our full training text is:
python
"low low low low low lower lower newest newest newest newest newest newest widest"
Step 1: Start with characters. Split every word into single letters:
| Word | Count | Split |
|---|---|---|
| low | 5 | l o w |
| lower | 2 | l o w e r |
| newest | 6 | n e w e s t |
| widest | 1 | w i d e s t |
Step 2: Count pairs. Look at every pair of neighbors across the whole text:
| Pair | Count |
|---|---|
| (l, o) | 7 |
| (o, w) | 7 |
| (w, e) | 8 |
| (e, r) | 2 |
| (n, e) | 6 |
| (e, s) | 7 |
| (s, t) | 7 |
| (e, w) | 6 |
| (w, i) | 1 |
| (i, d) | 1 |
| (d, e) | 1 |
Step 3: Merge the top pair. The pair (w, e) shows up 8 times — more than any other. Merge it into “we”:
| Word | Count | Split |
|---|---|---|
| low | 5 | l o w |
| lower | 2 | l o we r |
| newest | 6 | n e we s t |
| widest | 1 | w i d e s t |
Note that “widest” did NOT get merged. Its “w” and “e” are not neighbors — “i” and “d” sit between them.
Step 4: Repeat. Count pairs again with the new tokens. Merge the next top pair. Keep going until you hit your target vocab size.
After a few rounds, you end up with tokens like “low”, “est”, “new”, “er” — useful word parts that compress the text well.

Step 1: Build the Base Vocab from Bytes
Time to code. We’ll build a BPE tokenizer in pure Python — no outside libraries needed.
The base of byte-level BPE is simple. Turn text into raw bytes. Each byte is a number from 0 to 255. This gives us 256 base tokens that can stand for any text in any language. No “unknown” tokens — ever.
text = "the cat sat on the mat"
tokens = list(text.encode("utf-8"))
print(f"Text: '{text}'")
print(f"Bytes: {tokens}")
print(f"Number of tokens: {len(tokens)}")Output:
python
Text: 'the cat sat on the mat'
Bytes: [116, 104, 101, 32, 99, 97, 116, 32, 115, 97, 116, 32, 111, 110, 32, 116, 104, 101, 32, 109, 97, 116]
Number of tokens: 22
Each number is a byte. For ASCII text, these map straight to letters: 116 is “t”, 104 is “h”, 101 is “e”, and 32 is a space.
For non-ASCII text like emojis or Chinese, one letter might span many bytes. Byte-level BPE handles this with no fuss.
# Even emojis and non-English text work
emoji_text = "Hello 🌍"
emoji_bytes = list(emoji_text.encode("utf-8"))
print(f"'{emoji_text}' → {emoji_bytes}")
print(f"The emoji 🌍 alone is {list('🌍'.encode('utf-8'))} — four bytes!")Output:
python
'Hello 🌍' → [72, 101, 108, 108, 111, 32, 240, 159, 140, 141]
The emoji 🌍 alone is [240, 159, 140, 141] — four bytes!
Tip: UTF-8 uses 1 byte for ASCII, 2 bytes for accented letters, 3 bytes for CJK scripts, and 4 bytes for emojis. Knowing this helps you guess how many tokens a given language will use.
Step 2: Count Token Pairs
Next in BPE training: count how often each pair of neighbors shows up. This tells us which pair to merge first.
Here’s the function. It takes a list of token IDs and returns a dict that maps each pair to its count:
python
def get_pair_counts(token_ids):
"""Count the frequency of each adjacent pair of tokens."""
counts = {}
for i in range(len(token_ids) - 1):
pair = (token_ids[i], token_ids[i + 1])
counts[pair] = counts.get(pair, 0) + 1
return counts
Let’s see which pairs show up in our example:
def get_pair_counts(token_ids):
"""Count the frequency of each adjacent pair of tokens."""
counts = {}
for i in range(len(token_ids) - 1):
pair = (token_ids[i], token_ids[i + 1])
counts[pair] = counts.get(pair, 0) + 1
return counts
text = "the cat sat on the mat"
tokens = list(text.encode("utf-8"))
pair_counts = get_pair_counts(tokens)
# Show the top 5 most frequent pairs
sorted_pairs = sorted(pair_counts.items(), key=lambda x: x[-1], reverse=True)
for pair, count in sorted_pairs[:5]:
char_pair = (chr(pair[0]), chr(pair[1]))
print(f" Pair {char_pair}: {count} times")Output:
python
Pair ('a', 't'): 3 times
Pair ('t', ' '): 2 times
Pair (' ', 't'): 2 times
Pair ('t', 'h'): 2 times
Pair ('h', 'e'): 2 times
The pair (‘a’, ‘t’) shows up 3 times — in “cat”, “sat”, and “mat.” That makes it our first merge pick.
Step 3: Merge the Top Pair
When we merge a pair, we scan through the token list and swap every match with one new token ID. New IDs start at 256, since 0–255 are taken by the base bytes.
python
def merge(token_ids, pair, new_id):
"""Replace every occurrence of pair in token_ids with new_id."""
new_tokens = []
i = 0
while i < len(token_ids):
# Check if this position matches the pair
if i < len(token_ids) - 1 and token_ids[i] == pair[0] and token_ids[i + 1] == pair[1]:
new_tokens.append(new_id)
i += 2 # skip both tokens in the pair
else:
new_tokens.append(token_ids[i])
i += 1
return new_tokens
Let’s merge the top pair and watch the list shrink:
def get_pair_counts(token_ids):
counts = {}
for i in range(len(token_ids) - 1):
pair = (token_ids[i], token_ids[i + 1])
counts[pair] = counts.get(pair, 0) + 1
return counts
def merge(token_ids, pair, new_id):
new_tokens = []
i = 0
while i < len(token_ids):
if i < len(token_ids) - 1 and token_ids[i] == pair[0] and token_ids[i + 1] == pair[1]:
new_tokens.append(new_id)
i += 2
else:
new_tokens.append(token_ids[i])
i += 1
return new_tokens
text = "the cat sat on the mat"
tokens = list(text.encode("utf-8"))
print(f"Before merge: {len(tokens)} tokens")
print(f"Tokens: {tokens}")
# Find the most frequent pair
pair_counts = get_pair_counts(tokens)
best_pair = max(pair_counts, key=pair_counts.get)
print(f"\nMost frequent pair: ({chr(best_pair[0])}, {chr(best_pair[1])}) — {pair_counts[best_pair]} times")
# Merge it with new ID 256
tokens = merge(tokens, best_pair, 256)
print(f"\nAfter merge: {len(tokens)} tokens")
print(f"Tokens: {tokens}")Output:
python
Before merge: 22 tokens
Tokens: [116, 104, 101, 32, 99, 97, 116, 32, 115, 97, 116, 32, 111, 110, 32, 116, 104, 101, 32, 109, 97, 116]
Most frequent pair: (a, t) — 3 times
After merge: 19 tokens
Tokens: [116, 104, 101, 32, 99, 256, 32, 115, 256, 32, 111, 110, 32, 116, 104, 101, 32, 109, 256]
We went from 22 tokens down to 19. Every "at" is now a single token (ID 256). This is BPE's compression in action — each merge makes the list shorter.
Step 4: Train the Full Tokenizer
Training just means running the count-and-merge loop over and over. Each pass adds one new token to the vocab. Here's the full training function:
python
def train_bpe(text, num_merges):
"""Train a BPE tokenizer on the given text.
Returns the merge rules and vocabulary.
"""
tokens = list(text.encode("utf-8"))
merges = {} # (pair) -> new_token_id
for i in range(num_merges):
pair_counts = get_pair_counts(tokens)
if not pair_counts:
break # nothing left to merge
best_pair = max(pair_counts, key=pair_counts.get)
new_id = 256 + i
tokens = merge(tokens, best_pair, new_id)
merges[best_pair] = new_id
return merges, tokens
Let's train on a longer chunk of text and do 20 merges. We'll use a short write-up about machine learning:
def get_pair_counts(token_ids):
counts = {}
for i in range(len(token_ids) - 1):
pair = (token_ids[i], token_ids[i + 1])
counts[pair] = counts.get(pair, 0) + 1
return counts
def merge(token_ids, pair, new_id):
new_tokens = []
i = 0
while i < len(token_ids):
if i < len(token_ids) - 1 and token_ids[i] == pair[0] and token_ids[i + 1] == pair[1]:
new_tokens.append(new_id)
i += 2
else:
new_tokens.append(token_ids[i])
i += 1
return new_tokens
def train_bpe(text, num_merges):
tokens = list(text.encode("utf-8"))
merges = {}
for i in range(num_merges):
pair_counts = get_pair_counts(tokens)
if not pair_counts:
break
best_pair = max(pair_counts, key=pair_counts.get)
new_id = 256 + i
tokens = merge(tokens, best_pair, new_id)
merges[best_pair] = new_id
return merges, tokens
training_text = (
"Machine learning is a subset of artificial intelligence. "
"Machine learning algorithms learn from data. "
"Deep learning is a subset of machine learning. "
"Deep learning uses neural networks with many layers. "
"Neural networks learn representations from data."
)
merges, final_tokens = train_bpe(training_text, num_merges=20)
print("Learned merge rules:")
for i, (pair, new_id) in enumerate(merges.items()):
char_a = chr(pair[0]) if pair[0] < 256 else f"[{pair[0]}]"
char_b = chr(pair[1]) if pair[1] < 256 else f"[{pair[1]}]"
print(f" {i+1}. Merge ({char_a}, {char_b}) → token {new_id}")
original_length = len(training_text.encode("utf-8"))
print(f"\nOriginal tokens: {original_length}")
print(f"After 20 merges: {len(final_tokens)}")
print(f"Compression ratio: {original_length / len(final_tokens):.2f}x")Output:
python
Learned merge rules:
1. Merge (i, n) → token 256
2. Merge (l, e) → token 257
3. Merge (a, r) → token 258
4. Merge ( , l) → token 259
5. Merge (e, a) → token 260
6. Merge (r, n) → token 261
7. Merge (259, e) → token 262
8. Merge (256, g) → token 263
9. Merge (262, 258) → token 264
10. Merge (264, n) → token 265
11. Merge (265, 263) → token 266
12. Merge (e, .) → token 267
13. Merge ( , a) → token 268
14. Merge (e, p) → token 269
15. Merge (D, e) → token 270
16. Merge (269, 32) → token 271
17. Merge (270, 271) → token 272
18. Merge (266, 32) → token 273
19. Merge ( , d) → token 274
20. Merge (116, 268) → token 275
Original tokens: 243
After 20 merges: 187
Compression ratio: 1.30x
Look at how the merges grow. Early ones grab common letter pairs like "in", "le", and "ar." Later ones build longer chunks like "learning" and "Deep." BPE finds useful word parts from raw data alone — no grammar rules needed.
Key Insight: BPE learns word parts purely from data. Common words shrink to single tokens. Rare words break into known pieces. No hand-made rules or word lists required.
typescript
{
type: 'exercise',
id: 'bpe-train-ex1',
title: 'Exercise 1: Train BPE and Predict Merges',
difficulty: 'beginner',
exerciseType: 'write',
instructions: 'Given the text "aaabdaaabac", train BPE with 3 merges. Print each merge rule as "Merge X Y -> Z" where X and Y are the token values being merged and Z is the new token ID. Then print the final token count.',
starterCode: 'def get_pair_counts(token_ids):\n counts = {}\n for i in range(len(token_ids) - 1):\n pair = (token_ids[i], token_ids[i + 1])\n counts[pair] = counts.get(pair, 0) + 1\n return counts\n\ndef merge(token_ids, pair, new_id):\n new_tokens = []\n i = 0\n while i < len(token_ids):\n if i < len(token_ids) - 1 and token_ids[i] == pair[0] and token_ids[i + 1] == pair[1]:\n new_tokens.append(new_id)\n i += 2\n else:\n new_tokens.append(token_ids[i])\n i += 1\n return new_tokens\n\ntext = "aaabdaaabac"\ntokens = list(text.encode("utf-8"))\n\n# Perform 3 merges\nfor i in range(3):\n counts = get_pair_counts(tokens)\n best_pair = max(counts, key=counts.get)\n new_id = 256 + i\n tokens = merge(tokens, best_pair, new_id)\n # Print: Merge X Y -> Z\n # YOUR CODE HERE\n\nprint(f"Final token count: {len(tokens)}")',
testCases: [
{ id: 'tc1', input: '', expectedOutput: 'Merge 97 97 -> 256', description: 'First merge should be (a, a) -> 256' },
{ id: 'tc2', input: '', expectedOutput: 'Final token count: 5', description: 'After 3 merges, 5 tokens remain' }
],
hints: [
'Use an f-string: print(f"Merge {best_pair[0]} {best_pair[1]} -> {new_id}")',
'Full line: print(f"Merge {best_pair[0]} {best_pair[1]} -> {new_id}")'
],
solution: 'def get_pair_counts(token_ids):\n counts = {}\n for i in range(len(token_ids) - 1):\n pair = (token_ids[i], token_ids[i + 1])\n counts[pair] = counts.get(pair, 0) + 1\n return counts\n\ndef merge(token_ids, pair, new_id):\n new_tokens = []\n i = 0\n while i < len(token_ids):\n if i < len(token_ids) - 1 and token_ids[i] == pair[0] and token_ids[i + 1] == pair[1]:\n new_tokens.append(new_id)\n i += 2\n else:\n new_tokens.append(token_ids[i])\n i += 1\n return new_tokens\n\ntext = "aaabdaaabac"\ntokens = list(text.encode("utf-8"))\n\nfor i in range(3):\n counts = get_pair_counts(tokens)\n best_pair = max(counts, key=counts.get)\n new_id = 256 + i\n tokens = merge(tokens, best_pair, new_id)\n print(f"Merge {best_pair[0]} {best_pair[1]} -> {new_id}")\n\nprint(f"Final token count: {len(tokens)}")',
solutionExplanation: 'The first merge combines (97, 97) which is ("a", "a") — the most frequent pair. The second merge combines the new token 256 with 97 ("a"), creating "aaa". The third merge captures another common pair, reducing the sequence to just 5 tokens.',
xpReward: 15,
}
Step 5: Encode Text with Your Tokenizer
Training gave us a set of merge rules. Now we need an encode function that takes new text and turns it into token IDs. Here's the key twist: during encoding, we apply merges in the order they were learned — not by how often pairs show up in the new text.
Why does the order matter? The first merge rule captures the most common global pattern. If you apply merges in a different order, you get a different — and often worse — result.
python
def encode(text, merges):
"""Encode text into token IDs using learned BPE merge rules."""
tokens = list(text.encode("utf-8"))
while len(tokens) >= 2:
# Count all pairs in current tokens
pair_counts = get_pair_counts(tokens)
# Find the pair with the lowest merge index (earliest learned)
pair = min(
pair_counts,
key=lambda p: merges.get(p, float("inf"))
)
# If this pair was never learned, we're done
if pair not in merges:
break
new_id = merges[pair]
tokens = merge(tokens, pair, new_id)
return tokens
The trick here is min with merges.get(p, float("inf")). It picks the pair that was learned first during training. Pairs that aren't in the merge table get a score of infinity, so they stay untouched.
Let's try encoding text the tokenizer has not seen before:
python
test_text = "learning is fun"
encoded = encode(test_text, merges)
print(f"Text: '{test_text}'")
print(f"Encoded: {encoded}")
print(f"Number of tokens: {len(encoded)}")
print(f"Original bytes: {len(test_text.encode('utf-8'))}")
Output:
python
Text: 'learning is fun'
Encoded: [108, 260, 261, 263, 32, 105, 115, 32, 102, 117, 110]
Number of tokens: 11
Original bytes: 15
The encoder shrank "learning" using merge rules from training. The word "fun" stays as raw bytes because its byte pairs were never merged.
Step 6: Decode Tokens Back to Text
Decoding goes the other way. We need a vocab that maps each token ID back to its bytes. IDs 0–255 each map to one byte. Each merge rule adds a new entry that glues two existing byte strings end to end.
python
def build_vocab(merges):
"""Build vocabulary mapping token IDs to byte sequences."""
vocab = {i: bytes([i]) for i in range(256)}
for (a, b), new_id in merges.items():
vocab[new_id] = vocab[a] + vocab[b]
return vocab
def decode(token_ids, vocab):
"""Decode token IDs back into a string."""
byte_sequence = b"".join(vocab[t] for t in token_ids)
return byte_sequence.decode("utf-8", errors="replace")
The build_vocab function is neat. Each new token is just the two parent tokens' bytes joined together. Token 256 might hold the bytes for "at." Token 257 might hold "le." Token 260 might chain them to form "lea."
Let's prove that encoding and decoding round-trip cleanly:
python
vocab = build_vocab(merges)
# Encode then decode — should get back the original text
test_text = "learning is fun"
encoded = encode(test_text, merges)
decoded = decode(encoded, vocab)
print(f"Original: '{test_text}'")
print(f"Encoded: {encoded}")
print(f"Decoded: '{decoded}'")
print(f"Match: {test_text == decoded}")
Output:
python
Original: 'learning is fun'
Encoded: [108, 260, 261, 263, 32, 105, 115, 32, 102, 117, 110]
Decoded: 'learning is fun'
Match: True
Encode, then decode — you always get the same text back. This is a core trait of BPE: it is lossless. Nothing is lost in the round trip.
Note: The `errors="replace"` flag in `.decode()` is a safety net. If token IDs form broken UTF-8 (say, from hand-crafted IDs), Python swaps bad bytes with the "�" symbol instead of crashing.
Putting It All Together — The BPETokenizer Class
Let's wrap all the pieces into one clean Python class. It handles training, encoding, decoding, and peeking at the vocab:
python
class BPETokenizer:
"""A minimal Byte Pair Encoding tokenizer built from scratch."""
def __init__(self):
self.merges = {} # (pair) -> new_token_id
self.vocab = {} # token_id -> bytes
def train(self, text, vocab_size):
"""Train the tokenizer on text until vocab reaches vocab_size."""
assert vocab_size >= 256, "Vocab size must be at least 256 (base byte tokens)"
num_merges = vocab_size - 256
tokens = list(text.encode("utf-8"))
for i in range(num_merges):
counts = get_pair_counts(tokens)
if not counts:
print(f"No more pairs to merge after {i} merges.")
break
best_pair = max(counts, key=counts.get)
new_id = 256 + i
tokens = merge(tokens, best_pair, new_id)
self.merges[best_pair] = new_id
self.vocab = build_vocab(self.merges)
print(f"Trained! Vocabulary size: {len(self.vocab)}")
def encode(self, text):
"""Encode text to token IDs."""
return encode(text, self.merges)
def decode(self, token_ids):
"""Decode token IDs back to text."""
return decode(token_ids, self.vocab)
def get_token_text(self, token_id):
"""Get the text representation of a token ID."""
if token_id in self.vocab:
return self.vocab[token_id].decode("utf-8", errors="replace")
return f"[UNK:{token_id}]"
Now let's train it on a larger corpus and test it out:
python
corpus = """
Natural language processing is a field of artificial intelligence.
It gives machines the ability to read and understand human language.
Tokenization is the first step in any NLP pipeline.
A tokenizer splits text into smaller units called tokens.
These tokens are then converted into numerical IDs.
The model uses these IDs to look up embeddings.
Byte pair encoding is a popular tokenization algorithm.
It starts with individual bytes and merges frequent pairs.
BPE is used by GPT-2, GPT-3, GPT-4, and Llama models.
"""
tokenizer = BPETokenizer()
tokenizer.train(corpus, vocab_size=300)
Output:
python
Trained! Vocabulary size: 300
python
# Test encoding and decoding
test = "Tokenization is important for NLP"
token_ids = tokenizer.encode(test)
decoded = tokenizer.decode(token_ids)
print(f"Input: '{test}'")
print(f"Tokens: {token_ids}")
print(f"Decoded: '{decoded}'")
print(f"\nToken breakdown:")
for tid in token_ids:
print(f" {tid} → '{tokenizer.get_token_text(tid)}'")
Output:
python
Input: 'Tokenization is important for NLP'
Tokens: [84, 111, 107, 101, 110, 105, 122, 256, 105, 111, 110, 32, 105, 115, 32, 105, 109, 112, 111, 114, 116, 97, 110, 116, 32, 102, 111, 114, 32, 78, 76, 80]
Decoded: 'Tokenization is important for NLP'
Token breakdown:
84 → 'T'
111 → 'o'
107 → 'k'
101 → 'e'
110 → 'n'
105 → 'i'
122 → 'z'
256 → 'at'
105 → 'i'
111 → 'o'
110 → 'n'
32 → ' '
105 → 'i'
115 → 's'
...
With a small corpus and only 44 merges, most words still break down into lone bytes. Real models like GPT-2 train on billions of words with 50,000 merges. That's enough to turn whole words and short phrases into single tokens.
typescript
{
type: 'exercise',
id: 'bpe-class-ex2',
title: 'Exercise 2: Analyze Compression Ratio',
difficulty: 'beginner',
exerciseType: 'write',
instructions: 'Create a BPETokenizer, train it on the provided corpus with vocab_size=350, then compute the compression ratio (original bytes / encoded tokens) for the test string "Natural language processing is a field of intelligence." Print the compression ratio rounded to 2 decimal places.',
starterCode: '# The helper functions (get_pair_counts, merge, build_vocab, encode, decode) \n# and BPETokenizer class are already defined above.\n\ncorpus = """\nNatural language processing is a field of artificial intelligence.\nIt gives machines the ability to read and understand human language.\nTokenization is the first step in any NLP pipeline.\nA tokenizer splits text into smaller units called tokens.\nThese tokens are then converted into numerical IDs.\nThe model uses these IDs to look up embeddings.\nByte pair encoding is a popular tokenization algorithm.\nIt starts with individual bytes and merges frequent pairs.\nBPE is used by GPT models and Llama models.\n"""\n\ntest_string = "Natural language processing is a field of intelligence."\n\n# 1. Create tokenizer\n# 2. Train it\n# 3. Encode test_string\n# 4. Compute and print compression ratio\n',
testCases: [
{ id: 'tc1', input: '', expectedOutput: 'Compression ratio:', description: 'Should print compression ratio' }
],
hints: [
'Compression ratio = len(test_string.encode("utf-8")) / len(tokenizer.encode(test_string))',
'Full answer: tokenizer = BPETokenizer()\ntokenizer.train(corpus, vocab_size=350)\nencoded = tokenizer.encode(test_string)\nratio = len(test_string.encode("utf-8")) / len(encoded)\nprint(f"Compression ratio: {ratio:.2f}")'
],
solution: 'tokenizer = BPETokenizer()\ntokenizer.train(corpus, vocab_size=350)\nencoded = tokenizer.encode(test_string)\noriginal_bytes = len(test_string.encode("utf-8"))\nratio = original_bytes / len(encoded)\nprint(f"Compression ratio: {ratio:.2f}")',
solutionExplanation: 'The compression ratio tells you how many bytes each token stands for on average. A ratio of 2.0 means each token covers 2 bytes. Real tokenizers like GPT-4 hit ratios of 3-4x on English text.',
xpReward: 15,
}
Byte-Level BPE — What GPT-2 and GPT-4 Actually Use
The tokenizer we built is already byte-level BPE. But real-world tokenizers like GPT-2 add a few more tricks. Let's look at the big ones.
Regex-based pre-tokenization. GPT-2 won't merge across word edges. Before BPE kicks in, a regex splits text into words, numbers, and punctuation. This stops merges like "the" + " cat" from forming cross-word tokens.
The GPT-2 regex pattern looks like this:
import re GPT2_PATTERN = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\w+| ?\d+| ?[^\s\w\d]+|\s+(?!\S)|\s+""" text = "Hello, I don't like 123 things!" pre_tokens = re.findall(GPT2_PATTERN, text) print(pre_tokens)
Output:
python
['Hello', ',', ' I', ' don', "'t", ' like', ' 123', ' things', '!']
See how "don't" splits into " don" and "'t"? That follows English grammar. Spaces stick to the start of the next word, not the end of the last one. This is a design choice that makes tokens cleaner.
GPT-4 uses an even sharper pattern:
python
GPT4_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
# Note: This pattern uses Unicode properties (\p{L}, \p{N})
# It requires the `regex` library, not the standard `re`
Special tokens. Real tokenizers set aside IDs for markers like <|endoftext|>, <|im_start|>, and <|im_end|>. These mark where one document ends and the next begins, or where a user message stops and the model's reply starts.

Warning: Each model has its own tokenizer. The same text might be 100 tokens in GPT-2 but only 80 in GPT-4 or 75 in Llama 3. Always check token counts with the right tokenizer before you budget costs or context space.
BPE vs WordPiece vs Unigram — Which Models Use What?
BPE is the most common, but it's not the only option. Here's how the three main methods stack up:
| Feature | BPE | WordPiece | Unigram |
|---|---|---|---|
| Used by | GPT-2/3/4, Llama, Mistral | BERT, DistilBERT, Electra | T5, mBART, ALBERT |
| How it picks merges | Most frequent pair | Pair that boosts likelihood most | Drops tokens that hurt least |
| Direction | Bottom-up (merge) | Bottom-up (merge) | Top-down (prune) |
| Same output each time? | Yes | Yes | No — picks best-scoring split |
| Handles unknown text? | Yes (byte-level) | Yes (## prefix markers) | Yes (keeps a fallback) |
| Training speed | Fast | Fast | Slower (starts with big vocab) |
WordPiece is what BERT uses. It's close to BPE, but it picks merges in a different way. Instead of the most frequent pair, it picks the pair whose merge lifts the training data's likelihood the most. The scoring formula is:
\[\text{score}(a, b) = \frac{\text{freq}(ab)}{\text{freq}(a) \times \text{freq}(b)}\]Where:
- \(\text{freq}(ab)\) = how often tokens a and b sit side by side
- \(\text{freq}(a)\) and \(\text{freq}(b)\) = how often each token shows up alone
This favors rare tokens that almost always appear as a pair, over common tokens that just happen to be neighbors.
Unigram (used by T5) goes the other way. It starts with a huge vocab and trims tokens that add the least value. It's also random by nature — the same word can be split in more than one way. Unigram picks the split with the highest score.
SentencePiece is not a new method — it's a library that runs BPE or Unigram under the hood. What makes it special: it treats text as a raw byte stream and marks spaces with "▁". This lets it work with languages like Chinese and Japanese that don't use spaces.
Key Insight: BPE merges the most common pair. WordPiece merges the most useful pair. Unigram drops the least useful tokens. All three give similar results — the choice mostly comes down to the training toolkit.
Using Pre-Built Tokenizers in Practice
In real work, you'll use fast, battle-tested libraries — not hand-rolled code. The two big names are tiktoken (by OpenAI) and tokenizers (by Hugging Face).
tiktoken is what you want for OpenAI models. It's written in Rust for speed and matches the exact tokenizer used by GPT-3.5, GPT-4, and later models:
python
import tiktoken
# Load GPT-4's tokenizer
enc = tiktoken.get_encoding("cl100k_base")
text = "Hello, how are you doing today?"
tokens = enc.encode(text)
print(f"Text: '{text}'")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
# See what each token looks like
for token_id in tokens:
token_bytes = enc.decode_single_token_raw(token_id)
print(f" {token_id} → '{token_bytes.decode('utf-8', errors='replace')}'")
Output:
python
Text: 'Hello, how are you doing today?'
Token IDs: [9906, 11, 1268, 527, 499, 3815, 3432, 30]
Token count: 8
9906 → 'Hello'
11 → ','
1268 → ' how'
527 → ' are'
499 → ' you'
3815 → ' doing'
3432 → ' today'
30 → '?'
Hugging Face tokenizers let you load tokenizers for thousands of models — BERT, Llama, T5, and many more:
python
from transformers import AutoTokenizer
# Load Llama 3's tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
text = "Hello, how are you doing today?"
tokens = tokenizer.encode(text)
decoded_tokens = [tokenizer.decode([t]) for t in tokens]
print(f"Token IDs: {tokens}")
print(f"Tokens: {decoded_tokens}")
print(f"Count: {len(tokens)}")
Output:
python
Token IDs: [128000, 9906, 11, 1268, 527, 499, 3815, 3432, 30]
Tokens: ['<|begin_of_text|>', 'Hello', ',', ' how', ' are', ' you', ' doing', ' today', '?']
Count: 9
Notice how Llama adds a <|begin_of_text|> token at the front. That's a marker the model expects at the start of each input.
Tip: Use `tiktoken` to count tokens before calling the OpenAI API. It's much faster than loading the full model's tokenizer and gives you exact counts for cost planning. Install it with `pip install tiktoken`.
typescript
{
type: 'exercise',
id: 'tiktoken-ex3',
title: 'Exercise 3: Compare Tokenizers',
difficulty: 'intermediate',
exerciseType: 'write',
instructions: 'Using tiktoken, encode the text "Tokenization is the backbone of every LLM" with both the GPT-2 encoding ("gpt2") and the GPT-4 encoding ("cl100k_base"). Print the token count for each and which one is more efficient (fewer tokens).',
starterCode: 'import tiktoken\n\ntext = "Tokenization is the backbone of every LLM"\n\n# Load GPT-2 encoding\ngpt2_enc = tiktoken.get_encoding("gpt2")\n\n# Load GPT-4 encoding\ngpt4_enc = tiktoken.get_encoding("cl100k_base")\n\n# Encode with both\n# YOUR CODE HERE\n\n# Print results\n# print(f"GPT-2 tokens: {count_gpt2}")\n# print(f"GPT-4 tokens: {count_gpt4}")\n# print(f"More efficient: ...")',
testCases: [
{ id: 'tc1', input: '', expectedOutput: 'GPT-2 tokens:', description: 'Should print GPT-2 token count' },
{ id: 'tc2', input: '', expectedOutput: 'GPT-4 tokens:', description: 'Should print GPT-4 token count' }
],
hints: [
'Use enc.encode(text) to get token IDs, then len() for the count.',
'gpt2_tokens = gpt2_enc.encode(text)\ngpt4_tokens = gpt4_enc.encode(text)\nThen compare len() of each.'
],
solution: 'import tiktoken\n\ntext = "Tokenization is the backbone of every LLM"\n\ngpt2_enc = tiktoken.get_encoding("gpt2")\ngpt4_enc = tiktoken.get_encoding("cl100k_base")\n\ngpt2_tokens = gpt2_enc.encode(text)\ngpt4_tokens = gpt4_enc.encode(text)\n\ncount_gpt2 = len(gpt2_tokens)\ncount_gpt4 = len(gpt4_tokens)\n\nprint(f"GPT-2 tokens: {count_gpt2}")\nprint(f"GPT-4 tokens: {count_gpt4}")\n\nif count_gpt4 < count_gpt2:\n print("More efficient: GPT-4 (cl100k_base)")\nelse:\n print("More efficient: GPT-2")',
solutionExplanation: 'GPT-4 uses cl100k_base with a 100K vocab — about double GPT-2s 50K. The bigger vocab means more words fit into single tokens, so you get shorter lists and lower API costs.',
xpReward: 20,
}
Common Mistakes and How to Fix Them
Mistake 1: Applying merges in the wrong order during encoding
❌ Wrong:
python
def bad_encode(text, merges):
tokens = list(text.encode("utf-8"))
while True:
counts = get_pair_counts(tokens)
if not counts:
break
# WRONG: merging the most frequent pair in THIS text
best = max(counts, key=counts.get)
if best not in merges:
break
tokens = merge(tokens, best, merges[best])
return tokens
Why it is wrong: When you encode, you must apply merges in the order they were learned during training — not by how common they are in the current text. The merge order controls which word parts form. Using the wrong order gives you a different (and worse) result.
✅ Correct:
python
def correct_encode(text, merges):
tokens = list(text.encode("utf-8"))
while True:
counts = get_pair_counts(tokens)
# CORRECT: pick the pair with the lowest merge index
pair = min(counts, key=lambda p: merges.get(p, float("inf")))
if pair not in merges:
break
tokens = merge(tokens, pair, merges[pair])
return tokens
Mistake 2: Mixing up UTF-8 bytes and Unicode code points
❌ Wrong:
python
text = "café"
tokens = [ord(c) for c in text]
print(tokens)
# [99, 97, 102, 233] — 233 is NOT a valid byte token!
Why it is wrong: The letter "é" has Unicode code point 233, but in UTF-8 it takes two bytes: 195 and 169. Using ord() gives Unicode code points, not UTF-8 bytes. Your tokenizer's base vocab only covers bytes 0–255 in the UTF-8 space.
✅ Correct:
python
text = "café"
tokens = list(text.encode("utf-8"))
print(tokens)
Output:
python
[99, 97, 102, 195, 169]
Mistake 3: Merging across word edges
❌ Wrong:
python
# Training BPE on raw text without pre-tokenization
text = "the cat the dog the bird"
# This might merge "e " + "t" → "e t" across word boundaries
# Creating meaningless cross-word tokens
Why it is wrong: Without first splitting the text into words, BPE can fuse the "e" at the end of "the" with the space and the next word. You end up with tokens like "e c" or "e d" that span word edges and mean nothing.
✅ Correct:
python
import re
# Pre-tokenize first, then apply BPE within each word
pattern = r""" ?\w+| ?\d+| ?[^\s\w\d]+|\s+"""
words = re.findall(pattern, text)
# Apply BPE separately within each pre-tokenized word
Mistake 4: Not handling empty pair counts
❌ Wrong:
python
def train_bad(tokens, num_merges):
for i in range(num_merges):
counts = get_pair_counts(tokens)
best = max(counts, key=counts.get) # CRASHES if counts is empty!
Why it is wrong: If the token list has 0 or 1 items, there are no pairs to count. Calling max() on an empty dict throws a ValueError.
✅ Correct:
python
def train_safe(tokens, num_merges):
for i in range(num_merges):
counts = get_pair_counts(tokens)
if not counts:
break # No more pairs — stop early
best = max(counts, key=counts.get)
Complete Code
Frequently Asked Questions
How many tokens does GPT-4 use for English text?
On average, GPT-4's tokenizer makes about 1 token per 4 characters of English. That works out to roughly 0.75 tokens per word. Code tends to use more tokens per line than prose, because of all the special characters and indentation.
Can I use one model's tokenizer with a different model?
No. Each model is trained with its own tokenizer, and the token IDs point to that model's learned vectors. Feeding GPT-2 token IDs into Llama would map to the wrong vectors. Always match the tokenizer to the model.
Why do some languages need more tokens than English?
BPE learns its merges from training data, which is mostly English. Common English words become single tokens. Words in other languages stay split into more pieces. For instance, "hello" is 1 token in GPT-4, but its Chinese form "你好" might take 2-3 tokens. This is often called the "tokenization tax" for non-English text.
What if I set the vocab size too small or too large?
Too small (under 10K): common words split into many pieces. Sequences get long and costly. The model wastes effort putting words back together. Too large (over 200K): the lookup table gets huge and eats memory. Rare tokens get weak, poorly trained vectors. Most models today use 32K–128K as a good middle ground.
How is BPE different from Python's split() function?
Python's split() breaks text on white space and gives you whole words. BPE breaks text into subword units — chunks that are smaller than words but bigger than single letters. This lets BPE handle any text (even typos and new words) with a fixed-size vocab. The split() approach needs an entry for every possible word — an endless list.
References
- Gage, P. — A New Algorithm for Data Compression. C Users Journal, 12(2):23–38 (1994). The original BPE algorithm for data compression.
- Sennrich, R., Haddow, B., Birch, A. — Neural Machine Translation of Rare Words with Subword Units. ACL (2016). Link. Brought BPE into the world of NLP.
- Radford, A. et al. — Language Models are Unsupervised Multitask Learners (GPT-2 paper). OpenAI (2019). First major model to use byte-level BPE.
- Kudo, T. — Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. ACL (2018). Link. The Unigram method.
- Kudo, T., Richardson, J. — SentencePiece: A simple and language independent subword tokenizer and detokenizer. EMNLP (2018). Link.
- OpenAI tiktoken library — Fast BPE tokenizer for use with OpenAI's models. Link.
- Hugging Face Tokenizers library — Fast tokenizer builds. Link.
- Hugging Face NLP Course — Chapter on tokenization. Link.
- Raschka, S. — Building a BPE Tokenizer From Scratch (2025). Link.
- Hugging Face Transformers Docs — Tokenizer Summary. Link.
[SCHEMA HINTS]
- Article type: Tutorial
- Primary technology: Python 3.9+, tiktoken, Hugging Face tokenizers
- Programming language: Python
- Difficulty: Intermediate
- Keywords: BPE tokenizer, byte pair encoding, tokenization Python, build tokenizer from scratch, GPT tokenizer, subword tokenization, NLP tokenization
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master NLP — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course

