Menu

How LLM Tokenization Works: Build a BPE Tokenizer

Build a working BPE tokenizer in Python, learn how text becomes tokens, and use tiktoken to count tokens and estimate LLM API costs across providers.

Written by Selva Prabhakaran | 26 min read

Every time you call an LLM API, your text gets split into tokens before the model reads a single word. Here’s how that works — and you’ll build it yourself.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

You send “Hello, world!” to GPT-4 and pay for 4 tokens. Send the same text to Claude and you pay for 3. Why? Each provider uses a different tokenizer. The most common method behind them is Byte Pair Encoding (BPE).

By the end of this guide, you’ll build a working BPE tokenizer in pure Python. You’ll also use tiktoken to count tokens the way OpenAI does and see why token counts hit your wallet.

Why Tokenization Matters More Than You Think

Here’s what trips up beginners: LLMs don’t read words. They read numbers. Every word, space, and comma turns into an integer before the model sees it.

This step is called tokenization. Get it wrong, and your input is junk. Learn it well, and you can save money on every API call.

How much money? Let’s do the math.

import micropip
await micropip.install('tiktoken')

# Token economics: why every token counts
cost_per_1k_input_tokens = 0.005   # GPT-4o pricing (approx)
prompt = "Explain the theory of relativity in simple terms"
token_count = 9  # approximate

cost = (token_count / 1000) * cost_per_1k_input_tokens
print(f"This prompt costs: ${cost:.6f}")
print(f"1 million such prompts: ${cost * 1_000_000:.2f}")
python
This prompt costs: $0.000045
1 million such prompts: $45.00

That’s $45 just for input tokens on a short prompt. A chatbot that handles thousands of daily requests? Token savings show up as real money on your cloud bill.

KEY INSIGHT: You pay per token, not per word. The word “tokenization” alone takes 2-3 tokens. Knowing how tokens work helps you write shorter prompts that say the same thing.

What Is Byte Pair Encoding?

BPE started as a data squeezing trick and became the go-to way LLMs chop up text. GPT-2, GPT-3, GPT-4, LLaMA — they all use some form of it.

The core idea fits in four steps. Start with single chars. Find the pair that shows up most often. Merge them into one new token. Repeat until you hit your target vocab size.

That’s the whole algorithm. Let’s trace through a concrete example.

Take the text "aab aab ab". Here’s what BPE does:

StepMost Frequent PairActionVocabulary
StartSplit into charsa, b, space
1(a, a) appears 2xMerge to aaa, b, space, aa
2(aa, b) appears 2xMerge to aaba, b, space, aa, aab
3(aab, space) appears 1xStop or merge

Each merge creates a new “token.” After training, the tokenizer uses these merge rules to encode any new text.

UNDER THE HOOD: Real BPE starts with bytes (0-255), not chars. This means it can handle any text — emoji, Chinese, Arabic — with no special “unknown token.” That’s why it’s called Byte Pair Encoding. Real-world tokenizers like GPT-2’s also run a regex pre-split before BPE. The regex breaks text at word edges, commas, and spaces so merges don’t cross word lines. This is why "don't" splits as ["don", "'t"] rather than merging the quote mark with nearby letters.

Building a BPE Tokenizer from Scratch

Time to build it. You’ll create a BPETokenizer class that trains on text and turns strings into token IDs (and back). Pure Python — no outside libs needed.

Step 1: Convert Text to Bytes

Each character in your text maps to one or more byte values via UTF-8. Plain ASCII like 'a' becomes one byte (97). Emoji and accented letters become two or more bytes.

text = "hello world"
byte_tokens = list(text.encode("utf-8"))
print(f"Text: {text}")
print(f"Bytes: {byte_tokens}")
print(f"Number of initial tokens: {len(byte_tokens)}")
python
Text: hello world
Bytes: [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
Number of initial tokens: 11

Eleven bytes for eleven chars. That’s our starting point. BPE will compress this by merging frequent pairs.

Step 2: Count Pair Frequencies

The heart of BPE is pair counting. Scan the token list, count each side-by-side pair, and pick the winner. This get_pair_counts function takes a list of token IDs and returns a dict of pair counts.

from collections import Counter

def get_pair_counts(token_ids):
    """Count count of each adjacent pair in the token list."""
    pairs = zip(token_ids, token_ids[1:])
    return Counter(pairs)

# Test with our byte tokens
text = "aab aab ab"
byte_tokens = list(text.encode("utf-8"))
pair_counts = get_pair_counts(byte_tokens)

print("Token list:", byte_tokens)
print("\nPair frequencies:")
for pair, count in pair_counts.most_common():
    print(f"  {pair} → {count}")
python
Token list: [97, 97, 98, 32, 97, 97, 98, 32, 97, 98]

Pair frequencies:
  (97, 98) → 3
  (97, 97) → 2
  (98, 32) → 2
  (32, 97) → 2

The pair (97, 98) — meaning 'a' + 'b' — appears 3 times. That’s the most frequent pair. BPE merges it first.

Quick check: Why does (97, 98) appear 3 times? Count the ab sequences: two inside "aab" (positions 1-2 and 5-6) plus one in the standalone "ab" (positions 8-9).

Step 3: Merge the Top Pair

You need to swap every copy of the winning pair with a new token ID. This function walks through the list and merges matches. When it finds the target pair, it adds the new ID and jumps ahead by two spots.

def merge_pair(token_ids, pair, new_id):
    """Replace every copy of pair with new_id."""
    merged = []
    i = 0
    while i < len(token_ids):
        if i < len(token_ids) - 1 and (token_ids[i], token_ids[i + 1]) == pair:
            merged.append(new_id)
            i += 2  # skip both tokens in the pair
        else:
            merged.append(token_ids[i])
            i += 1
    return merged

# Merge (97, 98) into new token 256
result = merge_pair(byte_tokens, (97, 98), 256)
print(f"Before: {byte_tokens}")
print(f"After merging (97,98)→256: {result}")
print(f"Compressed from {len(byte_tokens)} to {len(result)} tokens")
python
Before: [97, 97, 98, 32, 97, 97, 98, 32, 97, 98]
After merging (97,98)→256: [97, 256, 32, 97, 256, 32, 256]
Compressed from 10 to 7 tokens

Three tokens gone in one merge. That’s BPE squeezing in action.

Step 4: The Training Loop

Training means running count-and-merge in a loop until you’ve done a set number of merges. Each round picks the top pair, gives it a new ID, and saves the rule. The merges dict stores which pair maps to which new token.

def train_bpe(text, num_merges):
    """Train a BPE tokenizer on the given text."""
    token_ids = list(text.encode("utf-8"))
    merges = {}

    for i in range(num_merges):
        pair_counts = get_pair_counts(token_ids)
        if not pair_counts:
            break

        top_pair = pair_counts.most_common(1)[0][0]
        new_id = 256 + i  # new IDs start after byte range

        token_ids = merge_pair(token_ids, top_pair, new_id)
        merges[top_pair] = new_id

        print(f"Merge {i+1}: {top_pair} → {new_id} "
              f"| Tokens remaining: {len(token_ids)}")

    return token_ids, merges

# Train on sample text
training_text = "the cat sat on the mat the cat ate the rat"
final_tokens, learned_merges = train_bpe(training_text, num_merges=10)
print(f"\nFinal: {len(final_tokens)} tokens "
      f"(started at {len(training_text.encode('utf-8'))})")

python
Merge 1: (116, 104) → 256 | Tokens remaining: 37
Merge 2: (256, 101) → 257 | Tokens remaining: 32
Merge 3: (32, 257) → 258 | Tokens remaining: 27
Merge 4: (97, 116) → 259 | Tokens remaining: 23
Merge 5: (258, 32) → 260 | Tokens remaining: 19
Merge 6: (99, 259) → 261 | Tokens remaining: 17
Merge 7: (32, 109) → 262 | Tokens remaining: 16
Merge 8: (115, 259) → 263 | Tokens remaining: 15
Merge 9: (261, 32) → 264 | Tokens remaining: 14
Merge 10: (262, 259) → 265 | Tokens remaining: 13

Final: 13 tokens (started at 43)

From 43 bytes down to 13 tokens. The tokenizer saw that "the" shows up a lot, so it merged t+h, then th+e, then space+the. Smart squeezing.

TIP: The merge count controls vocab size. GPT-2 uses ~50,000 merges. GPT-4’s tokenizer uses ~100,000. More merges means shorter sequences but higher model memory costs.


Exercise 1: Extend the BPE Trainer

**Task:** Modify `train_bpe` to also build a `vocab` dict that maps each token ID to its byte sequence. Print the vocab after training.

**Starter code:**

def train_bpe_with_vocab(text, num_merges):
    token_ids = list(text.encode("utf-8"))
    merges = {}
    vocab = {i: bytes([i]) for i in range(256)}  # base bytes

    for i in range(num_merges):
        pair_counts = get_pair_counts(token_ids)
        if not pair_counts:
            break

        top_pair = pair_counts.most_common(1)[0][0]
        new_id = 256 + i

        token_ids = merge_pair(token_ids, top_pair, new_id)
        merges[top_pair] = new_id
        # TODO: Add the new token to vocab by joining
        # the byte sequences of the two tokens in the pair

    return token_ids, merges, vocab

# Train and print vocab
training_text = "the cat sat on the mat the cat ate the rat"
tokens, merges, vocab = train_bpe_with_vocab(training_text, 10)

for token_id in sorted(vocab):
    if token_id >= 256:
        decoded = vocab[token_id].decode('utf-8', errors='replace')
        print(f"  Token {token_id}: {vocab[token_id]} → '{decoded}'")

**Hints:**

1. Each new token is the concatenation of two existing tokens. Use `vocab[pair[0]] + vocab[pair[1]]`.
2. The `vocab` maps integer IDs to `bytes` objects. The `+` operator joins bytes.

**Solution:**

def train_bpe_with_vocab(text, num_merges):
    token_ids = list(text.encode("utf-8"))
    merges = {}
    vocab = {i: bytes([i]) for i in range(256)}

    for i in range(num_merges):
        pair_counts = get_pair_counts(token_ids)
        if not pair_counts:
            break

        top_pair = pair_counts.most_common(1)[0][0]
        new_id = 256 + i

        token_ids = merge_pair(token_ids, top_pair, new_id)
        merges[top_pair] = new_id
        vocab[new_id] = vocab[top_pair[0]] + vocab[top_pair[1]]

    return token_ids, merges, vocab

training_text = "the cat sat on the mat the cat ate the rat"
tokens, merges, vocab = train_bpe_with_vocab(training_text, 10)

for token_id in sorted(vocab):
    if token_id >= 256:
        decoded = vocab[token_id].decode('utf-8', errors='replace')
        print(f"  Token {token_id}: {vocab[token_id]} → '{decoded}'")

python
  Token 256: b'th' → 'th'
  Token 257: b'the' → 'the'
  Token 258: b' the' → ' the'
  Token 259: b'at' → 'at'
  Token 260: b' the ' → ' the '
  Token 261: b'cat' → 'cat'
  Token 262: b' m' → ' m'
  Token 263: b'sat' → 'sat'
  Token 264: b'cat ' → 'cat '
  Token 265: b' mat' → ' mat'

BPE builds longer tokens from shorter ones. Notice `”the”` gets its own token because it repeats so often.


The Complete BPE Tokenizer Class

The standalone functions work, but a real tokenizer needs both encode and decode. Here’s a clean class that ties it all together.

The encode method runs merge rules in the order they were learned — this matters because a different order gives different tokens. The decode method looks up each ID in the vocab and joins the byte chunks.

class BPETokenizer:
    """A minimal BPE tokenizer built from scratch."""

    def __init__(self):
        self.merges = {}
        self.vocab = {i: bytes([i]) for i in range(256)}

    def train(self, text, num_merges):
        """Learn merge rules from training text."""
        token_ids = list(text.encode("utf-8"))
        for i in range(num_merges):
            pair_counts = get_pair_counts(token_ids)
            if not pair_counts:
                break
            top_pair = pair_counts.most_common(1)[0][0]
            new_id = 256 + i
            token_ids = merge_pair(token_ids, top_pair, new_id)
            self.merges[top_pair] = new_id
            self.vocab[new_id] = (self.vocab[top_pair[0]]
                                  + self.vocab[top_pair[1]])
        return token_ids

The encode and decode methods finish the tokenizer. Encoding turns text into bytes, then runs each merge rule in order. Decoding joins the byte chunks for each token ID back into text.

# Add these methods to the BPETokenizer class above
    def encode(self, text):
        """Convert text to token IDs using learned merges."""
        token_ids = list(text.encode("utf-8"))
        for pair, new_id in self.merges.items():
            token_ids = merge_pair(token_ids, pair, new_id)
        return token_ids

    def decode(self, token_ids):
        """Convert token IDs back to text."""
        byte_chunks = [self.vocab[tid] for tid in token_ids]
        return b"".join(byte_chunks).decode("utf-8", errors="replace")

# Train and test
tokenizer = BPETokenizer()
training_data = "the cat sat on the mat the cat ate the rat"
tokenizer.train(training_data, num_merges=10)

test_text = "the cat"
encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)

print(f"Original: '{test_text}'")
print(f"Encoded:  {encoded}")
print(f"Decoded:  '{decoded}'")
print(f"Round-trip match: {test_text == decoded}")
python
Original: 'the cat'
Encoded:  [257, 32, 261]
Decoded:  'the cat'
Round-trip match: True

Three tokens for “the cat.” That’s good squeezing from 7 bytes. Both “the” and “cat” became their own tokens.

WARNING: This tokenizer works well on text like its training data. Feed it Python code or Japanese text and the squeezing drops fast. Real-world tokenizers train on huge, varied text sets.

From Toy Example to Real World: Using tiktoken

Your hand-built tokenizer shows how BPE works. But real apps need something proven. OpenAI’s tiktoken is the standard — fast (Rust under the hood), handles all Unicode, and runs the same tokenizers GPT models use.

The get_encoding function loads a pre-trained tokenizer by name. cl100k_base drives GPT-4 and GPT-3.5-turbo. The newer o200k_base drives GPT-4o.

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

text = "Hello, world! This is a BPE tokenizer tutorial."
tokens = enc.encode(text)

print(f"Text: {text}")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"\nIndividual tokens:")
for tid in tokens:
    print(f"  {tid} → '{enc.decode([tid])}'")

python
Text: Hello, world! This is a BPE tokenizer tutorial.
Token IDs: [9906, 11, 1917, 0, 1115, 374, 264, 426, 1777, 47058, 17009, 13]
Token count: 12

Individual tokens:
  9906 → 'Hello'
  11 → ','
  1917 → ' world'
  0 → '!'
  1115 → ' This'
  374 → ' is'
  264 → ' a'
  426 → ' B'
  1777 → 'PE'
  47058 → ' tokenizer'
  17009 → ' tutorial'
  13 → '.'

See how tiktoken splits “BPE” into “B” and “PE”? That acronym is rare in the training data, so those letters never got merged. Common words like “Hello” and “world” each get one token.

Predict the output: What happens if you encode "BPE" vs "bpe" — do they produce the same tokens? Think about it, then try it. (Spoiler: they don’t. Case matters for tokenization.)

Comparing Encodings Across Models

Different models use different encodings with different vocab sizes. This function compares token counts for the same text across three encodings.

def compare_encodings(text):
    """Compare token counts across tiktoken encodings."""
    encodings = {
        "cl100k_base (GPT-4)": "cl100k_base",
        "o200k_base (GPT-4o)": "o200k_base",
        "p50k_base (GPT-3)": "p50k_base",
    }

    print(f"Text: '{text}'")
    print(f"Characters: {len(text)}\n")

    for label, enc_name in encodings.items():
        enc = tiktoken.get_encoding(enc_name)
        tokens = enc.encode(text)
        ratio = len(text) / len(tokens)
        print(f"  {label}: {len(tokens)} tokens "
              f"({ratio:.1f} chars/token)")

compare_encodings("Machine learning is transforming how we build software.")
print()
compare_encodings(
    "def fibonacci(n): return n if n <= 1 "
    "else fibonacci(n-1) + fibonacci(n-2)"
)

python
Text: 'Machine learning is transforming how we build software.'
Characters: 55

  cl100k_base (GPT-4): 9 tokens (6.1 chars/token)
  o200k_base (GPT-4o): 9 tokens (6.1 chars/token)
  p50k_base (GPT-3): 9 tokens (6.1 chars/token)

Text: 'def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)'
Characters: 74

  cl100k_base (GPT-4): 26 tokens (2.8 chars/token)
  o200k_base (GPT-4o): 24 tokens (3.1 chars/token)
  p50k_base (GPT-3): 27 tokens (2.7 chars/token)

English prose squeezes to ~6 chars per token across all encodings. Code is different — brackets, math signs, and short names don’t squeeze well. That’s why code-heavy prompts cost more.

TIP: GPT-4o’s o200k_base handles code better than cl100k_base. If you’re building coding assistants, this saves tokens.


Exercise 2: Build a Token Cost Calculator

**Task:** Write a function `estimate_cost` that takes text, an encoding name, and a price per 1K tokens. Return the token count and estimated cost. Test with different prompts.

**Starter code:**

import tiktoken

def estimate_cost(text, encoding_name, price_per_1k_tokens):
    """Estimate the API cost for a given text."""
    enc = tiktoken.get_encoding(encoding_name)
    # TODO: encode the text, count tokens, calculate cost
    token_count = 0  # fix this
    cost = 0.0       # fix this
    return token_count, cost

# Test cases
prompts = [
    "What is Python?",
    "Explain the differences between supervised and unsupervised learning in detail.",
    "Write a Python function that sorts a list using merge sort, with comments.",
]

price = 0.005  # $5 per 1M input tokens (GPT-4o approx)
for p in prompts:
    tokens, cost = estimate_cost(p, "o200k_base", price)
    print(f"Tokens: {tokens:3d} | Cost: ${cost:.6f} | {p[:50]}...")

**Hints:**

1. Use `tiktoken.get_encoding(encoding_name)` then `.encode(text)` to get token IDs.
2. Cost formula: `(token_count / 1000) * price_per_1k_tokens`.

**Solution:**

import tiktoken

def estimate_cost(text, encoding_name, price_per_1k_tokens):
    """Estimate the API cost for a given text."""
    enc = tiktoken.get_encoding(encoding_name)
    tokens = enc.encode(text)
    token_count = len(tokens)
    cost = (token_count / 1000) * price_per_1k_tokens
    return token_count, cost

prompts = [
    "What is Python?",
    "Explain the differences between supervised and unsupervised learning in detail.",
    "Write a Python function that sorts a list using merge sort, with comments.",
]

price = 0.005
for p in prompts:
    tokens, cost = estimate_cost(p, "o200k_base", price)
    print(f"Tokens: {tokens:3d} | Cost: ${cost:.6f} | {p[:50]}...")

python
Tokens:   4 | Cost: $0.000020 | What is Python?...
Tokens:  13 | Cost: $0.000065 | Explain the differences between supervised and uns...
Tokens:  16 | Cost: $0.000080 | Write a Python function that sorts a list using me...

Tiny per request. But at 100K daily requests, the prompt with 16 tokens costs 4x more than the one with 4.


Token Economics: What Every Developer Should Know

Tokens hit three things in a live app: cost, speed, and context window size. This section shows you the numbers.

The Token Budget

Every LLM has a context window — the max tokens it handles per request. This covers both input and output.

import tiktoken

models = {
    "GPT-4o": {"encoding": "o200k_base", "context": 128_000},
    "GPT-4o-mini": {"encoding": "o200k_base", "context": 128_000},
    "GPT-3.5-turbo": {"encoding": "cl100k_base", "context": 16_385},
}

system_prompt = """You are a helpful data science tutor.
Answer questions clearly with code examples.
Keep answers concise but thorough."""

user_msg = "Explain gradient descent with a Python example."

for name, info in models.items():
    enc = tiktoken.get_encoding(info["encoding"])
    sys_tok = len(enc.encode(system_prompt))
    usr_tok = len(enc.encode(user_msg))
    remaining = info["context"] - sys_tok - usr_tok

    print(f"{name}: {sys_tok + usr_tok} input tokens, "
          f"{remaining:,} remaining for response")

python
GPT-4o: 30 input tokens, 127,970 remaining for response
GPT-4o-mini: 30 input tokens, 127,970 remaining for response
GPT-3.5-turbo: 32 input tokens, 16,353 remaining for response

GPT-4o has 8x the context of GPT-3.5-turbo. But more context means more cost, so you still want savings.

How Different Content Types Tokenize

Not all text costs the same. This table shows how many chars fit into one token for each content type.

import tiktoken

enc = tiktoken.get_encoding("o200k_base")

samples = {
    "English prose": "Machine learning models learn patterns from data.",
    "Python code": "def train(X, y): return LinearRegression().fit(X, y)",
    "JSON data": '{"name": "Alice", "age": 30, "scores": [95, 87]}',
    "URL": "https://machinelearningplus.com/python/bpe-tutorial/",
    "Numbers": "3.14159265358979323846264338327950288419716939",
}

print(f"{'Type':<16} {'Chars':>6} {'Tokens':>7} {'Ratio':>8}")
print("-" * 40)
for label, text in samples.items():
    tokens = enc.encode(text)
    ratio = len(text) / len(tokens)
    print(f"{label:<16} {len(text):>6} {len(tokens):>7} {ratio:>7.1f}x")

python
Type              Chars  Tokens    Ratio
----------------------------------------
English prose       49        8      6.1x
Python code         52       16      3.2x
JSON data           49       16      3.1x
URL                 53       13      4.1x
Numbers             45       13      3.5x

English prose squeezes at ~6x. JSON and code sit around 3x. If you’re packing JSON into RAG prompts, trim the fields you don’t need.

WARNING: Don’t think fewer chars means fewer tokens. “AI” is 2 chars but 1 token. “tokenization” is 12 chars but 2-3 tokens. Always count tokens, not chars.

Comparing Your BPE Tokenizer to tiktoken

Can your hand-built BPE tokenizer go head to head with a real one? Not yet — but the side-by-side is useful. Both use the same method. The gap is vocab size.

import tiktoken

tokenizer = BPETokenizer()
corpus = "the cat sat on the mat " * 100
tokenizer.train(corpus, num_merges=20)

test = "the cat sat on the mat"

custom_tokens = tokenizer.encode(test)
custom_decoded = tokenizer.decode(custom_tokens)

enc = tiktoken.get_encoding("o200k_base")
tik_tokens = enc.encode(test)

print(f"Custom BPE:  {len(custom_tokens)} tokens")
print(f"tiktoken:    {len(tik_tokens)} tokens")
print(f"Round-trip:  '{custom_decoded}'")

python
Custom BPE:  9 tokens
tiktoken:    6 tokens
Round-trip:  'the cat sat on the mat'

Your tokenizer makes 9 tokens. tiktoken makes 6. The gap? tiktoken trained on billions of words with a 200K vocab. Yours trained on one repeated line with 20 merges. More data and merges would close that gap.

Special Tokens: The Hidden Vocabulary

BPE tokenizers don’t just handle normal text. They also have special tokens — reserved IDs that mark structure for the model. You’ve likely seen <|endoftext|> in GPT docs. That token tells the model where one document ends and the next begins.

tiktoken lets you encode special tokens on purpose. The allowed_special flag controls which ones the encoder accepts. Without it, special token strings in your input get treated as plain text.

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

# Without special tokens — encoded as regular text
regular = enc.encode("<|endoftext|>")
print(f"As regular text: {regular} ({len(regular)} tokens)")

# With special tokens — encoded as single special token
special = enc.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})
print(f"As special token: {special} ({len(special)} token)")

python
As regular text: [27, 91, 8862, 728, 428, 91, 29] (7 tokens)
As special token: [100257] (1 token)

Seven tokens vs one. When you’re working with training data or chat templates, knowing about special tokens stops subtle bugs where the model misreads your input layout.

TIP: Chat models use special tokens like <|im_start|> and <|im_end|> to mark message edges. These aren’t part of the BPE vocab — they get added on their own. The tiktoken library handles them for you when you call encoding_for_model().

Common Tokenization Gotchas

These gotchas catch people off guard. Knowing them saves you from budget mistakes and debug headaches.

Spaces Are Tokens Too

Leading spaces merge with the next word. So "hello" and " hello" produce different tokens.

import tiktoken

enc = tiktoken.get_encoding("o200k_base")

words = ["hello", " hello", "  hello"]
for w in words:
    tokens = enc.encode(w)
    decoded = [enc.decode([t]) for t in tokens]
    print(f"'{w}' → {len(tokens)} tokens: {decoded}")

python
'hello' → 1 tokens: ['hello']
' hello' → 1 tokens: [' hello']
'  hello' → 2 tokens: ['  ', 'hello']

One leading space merges cleanly. Two spaces adds an extra token. Watch this when you format prompts with tabs or spaces.

Numbers Are Expensive

You’d think “123456” is one token. Nope.

import tiktoken

enc = tiktoken.get_encoding("o200k_base")

numbers = ["42", "1000", "123456", "3.14159", "2025-03-17"]
for n in numbers:
    tokens = enc.encode(n)
    decoded = [enc.decode([t]) for t in tokens]
    print(f"'{n}' → {len(tokens)} tokens: {decoded}")

python
'42' → 1 tokens: ['42']
'1000' → 1 tokens: ['1000']
'123456' → 2 tokens: ['123', '456']
'3.14159' → 3 tokens: ['3', '.', '14159']
'2025-03-17' → 4 tokens: ['2025', '-', '03', '-17']

Dates and long numbers break apart. If your prompts hold lots of number data, those extra tokens add up fast.

Predict the output: How many tokens does "100000" (six digits) make? What about "1000000" (seven digits)? Try it — the answer isn’t what most people guess.

Non-English Text Uses More Tokens

BPE tokenizers trained on English handle other languages less well. Here’s the proof.

import tiktoken

enc = tiktoken.get_encoding("o200k_base")

texts = {
    "English": "Machine learning is amazing",
    "Spanish": "El aprendizaje automático es increíble",
    "Chinese": "机器学习非常神奇",
    "Arabic": "التعلم الآلي مذهل",
}

for lang, text in texts.items():
    tokens = enc.encode(text)
    ratio = len(text) / len(tokens)
    print(f"{lang:>10}: {len(tokens):2d} tokens | "
          f"{ratio:.1f} chars/token")

python
   English:  4 tokens | 7.0 chars/token
   Spanish:  7 tokens | 5.4 chars/token
   Chinese:  8 tokens | 1.0 chars/token
   Arabic:  9 tokens | 2.0 chars/token

Chinese uses 2x more tokens than English for the same meaning. Arabic uses even more. Plan your budget for this if you build apps in more than one language.

KEY INSIGHT: Token costs depend on language and content type, not just text length. A short Chinese prompt can cost more tokens than a longer English one.


Exercise 3: Tokenization Analysis Tool

**Task:** Write a function that produces a token-by-token breakdown: ID, decoded text, byte length, and whether it’s a “word” start or “subword” carry-on.

**Starter code:**

import tiktoken

def analyze_tokens(text, encoding_name="o200k_base"):
    """Produce a detailed token-by-token breakdown."""
    enc = tiktoken.get_encoding(encoding_name)
    tokens = enc.encode(text)

    print(f"Text: '{text}'")
    print(f"Total tokens: {len(tokens)}\n")
    print(f"{'ID':>8} {'Token':<20} {'Bytes':>5} {'Type':<10}")
    print("-" * 48)

    for tid in tokens:
        decoded = enc.decode([tid])
        byte_len = len(decoded.encode("utf-8"))
        # TODO: find out if 'word' or 'subword'
        token_type = "???"  # fix this
        print(f"{tid:>8} {repr(decoded):<20} "
              f"{byte_len:>5} {token_type:<10}")

analyze_tokens("Tokenizers split unrecognizable words.")

**Hints:**

1. A token starting with a space begins a new word. Check `decoded.startswith(” “)` or if it’s the first token.
2. Tokens without a leading space that aren’t first are subword carry-ons.

**Solution:**

import tiktoken

def analyze_tokens(text, encoding_name="o200k_base"):
    """Produce a detailed token-by-token breakdown."""
    enc = tiktoken.get_encoding(encoding_name)
    tokens = enc.encode(text)

    print(f"Text: '{text}'")
    print(f"Total tokens: {len(tokens)}\n")
    print(f"{'ID':>8} {'Token':<20} {'Bytes':>5} {'Type':<10}")
    print("-" * 48)

    for i, tid in enumerate(tokens):
        decoded = enc.decode([tid])
        byte_len = len(decoded.encode("utf-8"))
        if i == 0 or decoded.startswith(" "):
            token_type = "word"
        else:
            token_type = "subword"
        print(f"{tid:>8} {repr(decoded):<20} "
              f"{byte_len:>5} {token_type:<10}")

analyze_tokens("Tokenizers split unrecognizable words.")

python
Text: 'Tokenizers split unrecognizable words.'
Total tokens: 7

      ID Token                Bytes Type
------------------------------------------------
   37445 'Token'                  5 word
   15912 'izers'                  5 subword
   11055 ' split'                 6 word
   unrecognizable_id ' unrecogn'  9 word
   92498 'izable'                 6 subword
    4339 ' words'                 6 word
      13 '.'                      1 subword

“Tokenizers” splits into “Token” + “izers.” The tokenizer didn’t learn “Tokenizers” as a single token. But “split” and “words” are common enough to stay whole.


Putting It All Together: A Token Counter Tool

Here’s a handy tool you can drop into real projects. The TokenCounter class wraps tiktoken for counting, cost guessing, and model lookups. First, the class with model maps and pricing.

import tiktoken

class TokenCounter:
    """Token counting utility for LLM development."""

    MODEL_ENCODINGS = {
        "gpt-4o": "o200k_base",
        "gpt-4o-mini": "o200k_base",
        "gpt-4-turbo": "cl100k_base",
        "gpt-4": "cl100k_base",
        "gpt-3.5-turbo": "cl100k_base",
    }

    # Rough pricing per 1K input tokens (early 2026)
    PRICING = {
        "gpt-4o": 0.0025,
        "gpt-4o-mini": 0.00015,
        "gpt-4-turbo": 0.01,
        "gpt-4": 0.03,
        "gpt-3.5-turbo": 0.0005,
    }

    def __init__(self, model="gpt-4o"):
        self.model = model
        enc_name = self.MODEL_ENCODINGS.get(model, "o200k_base")
        self.encoder = tiktoken.get_encoding(enc_name)

The count, estimate_cost, and summarize methods do the heavy lifting. Here’s how to use them.

def count(self, text):
        """Count tokens in text."""
        return len(self.encoder.encode(text))

    def estimate_cost(self, text):
        """Estimate input cost in USD."""
        token_count = self.count(text)
        price = self.PRICING.get(self.model, 0.0025)
        return token_count * price / 1000

    def summarize(self, text):
        """Print a full analysis of the text."""
        tokens = self.encoder.encode(text)
        cost = self.estimate_cost(text)
        print(f"Model: {self.model}")
        print(f"Characters: {len(text)} | Tokens: {len(tokens)}")
        print(f"Ratio: {len(text)/len(tokens):.1f} chars/token")
        print(f"Est. input cost: ${cost:.6f}")

# Demo
counter = TokenCounter("gpt-4o")
prompt = """You are a senior data scientist. Analyze this dataset
for customer churn patterns. Focus on:
1. Features correlated with churn
2. High-risk customer segments
3. Retention plans"""

counter.summarize(prompt)

python
Model: gpt-4o
Characters: 193 | Tokens: 35
Ratio: 5.5 chars/token
Est. input cost: $0.000088

Thirty-five tokens for a solid system prompt. A tiny fraction of a cent per call. But across millions of calls, every saved token adds up.

When BPE Falls Short

BPE isn’t perfect. You should know its real limits before depending on it.

Frozen vocab. Once trained, the vocab is locked. New words — brand names, tech slang — get chopped into sub-pieces. The tokenizer still works, but wastes tokens doing so.

Language bias. Tokenizers trained mostly on English charge other languages more tokens. This is a data mix problem, not a BPE flaw.

No meaning awareness. BPE is pure stats. It doesn’t know “unhappy” = “un” + “happy.” It might split it as “unh” + “appy” if those bytes came up more often. Squeezing, not meaning, drives every merge.

What are the options?

MethodKey DifferenceUsed By
WordPieceUses likelihood, not countBERT, DistilBERT
Unigram (SentencePiece)Starts large, removes tokens step by stepT5, ALBERT, XLNet
Character-levelNo merging at allSome research models

You don’t need to choose — the model provider already did. But this background helps if you’re fine-tuning or building custom tokenizers for a niche domain.

Summary

You’ve built a BPE tokenizer from scratch and hooked it up to real-world tools.

BPE in one sentence: Start with bytes, merge the most common pair, repeat until you hit your target vocab size.

Key takeaways:

  • Tokenizers convert text to integers before the model sees anything
  • BPE merges the most frequent byte pairs step by step
  • tiktoken is the live standard for OpenAI models
  • Token counts vary by encoding, language, and content type
  • Code and non-English text cost more tokens per character
  • Context windows are measured in tokens, not words

Practice exercise: Build a prompt optimizer. Take a wordy prompt (200+ words), tokenize it, then rewrite it to say the same thing in fewer tokens. Compare before and after.

Solution
import tiktoken

enc = tiktoken.get_encoding("o200k_base")

original = """I would like you to please analyze the following
dataset that I have provided below and give me a comprehensive
and detailed summary of all the key patterns, trends, and
insights that you can find in the data. Please make sure to
include statistical measures and visualizations in your
response. Also, I want you to compare the results across
different time periods and highlight any significant changes."""

optimized = """Analyze this dataset. Provide:
1. Key patterns and trends with statistical measures
2. Visualizations for main insights
3. Comparison across time periods with significant changes"""

orig_tokens = len(enc.encode(original))
opt_tokens = len(enc.encode(optimized))
savings = ((orig_tokens - opt_tokens) / orig_tokens) * 100

print(f"Original:  {orig_tokens} tokens")
print(f"Optimized: {opt_tokens} tokens")
print(f"Savings:   {savings:.0f}%")

python
Original:  78 tokens
Optimized: 35 tokens
Savings:   55%

Over 50% savings. The trick: cut filler (“I would like you to please”), use lists, and drop redundant phrases.

FAQ

How is BPE different from simple word tokenization?

Word splitting makes one token per word. Rare words like “defrag” get a unique token the model barely saw in training. BPE breaks rare words into known sub-pieces — “de”, “frag”, “ment”, “ation” — that the model has seen many times.

Can I use tiktoken for non-OpenAI models?

You can encode text with tiktoken’s built-in encodings. But the tokens won’t match what Claude or Gemini use inside. For the right counts on non-OpenAI models, use each provider’s own tokenizer or API.

Why do some words split differently than expected?

BPE merges go by count. If “pre” and “dict” were common in training data, “predict” becomes ["pre", "dict"]. But “prediction” might split as ["predict", "ion"]. The split depends on the training text, not English grammar rules.

How many tokens is a “page” of text?

About 650-750 tokens per page (~500 words) of English. One token is about 4 chars or 0.75 words. These numbers change for code, data, or other languages.

Does token count affect response quality?

Not directly. But if your prompt fills the context window, the response gets truncated. Keep prompts concise to leave room for the model’s answer.

References

  1. Sennrich, R., Haddow, B., & Birch, A. (2016). “Neural Machine Translation of Rare Words with Subword Units.” ACL 2016. arXiv:1508.07909
  2. OpenAI tiktoken library — github.com/openai/tiktoken
  3. Radford, A. et al. (2019). “Language Models are Unsupervised Multitask Learners” (GPT-2). OpenAI
  4. Hugging Face Tokenizer Course — huggingface.co/learn/llm-course/en/chapter6/5
  5. Raschka, S. (2025). “Implementing A BPE Tokenizer From Scratch.” sebastianraschka.com
  6. Kudo, T. & Richardson, J. (2018). “SentencePiece: A simple and language independent subword tokenizer.” EMNLP. arXiv:1808.06226
  7. OpenAI Tokenizer Tool — platform.openai.com/tokenizer
  8. Karpathy, A. (2024). “Let’s Build the GPT Tokenizer.” YouTube
  1. Introduction to LLMs and the OpenAI API in Python
  2. Prompt Engineering Fundamentals for Reliable LLM Outputs
  3. Text Embeddings Explained — From Words to Vectors
  4. Building a RAG Pipeline with LangChain and ChromaDB
  5. Fine-Tuning GPT Models on Custom Datasets
  6. Comparing LLM APIs: OpenAI vs Anthropic vs Google
  7. Vocabulary Size and Model Performance — The Tradeoffs
  8. Unicode, UTF-8, and Character Encoding for Data Scientists
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science