Train a Custom Tokenizer with HuggingFace (Python)

Learn to train custom BPE and WordPiece tokenizers with HuggingFace for medical, legal, and domain-specific NLP. Includes evaluation metrics and code.

Written by Selva Prabhakaran | 23 min read

A custom tokenizer learns your domain’s words — medical terms, legal jargon, code tokens — so your NLP model stops chopping them into random pieces.

You’ve seen it happen. You feed a medical report into a pretrained model. It chops “cardiomyopathy” into seven random bits. The model never saw that word in training. So it guesses. Badly.

That’s not a model problem. It’s a tokenizer problem. You can fix it.

In this guide, you’ll train a custom tokenizer with HuggingFace. You’ll learn BPE and WordPiece training. You’ll test your tokenizer with real metrics. And you’ll learn when custom training is worth the work.

Why Does Your Domain Need a Custom Tokenizer?

BERT, GPT-2, and T5 all learned their vocab from general text. Wikipedia, books, web crawls. They handle plain English fine.

But domain text? That’s where they break.

Watch what happens when BERT’s tokenizer meets a medical term:

python

from transformers import AutoTokenizer

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

medical_term = "electroencephalography"
tokens = bert_tokenizer.tokenize(medical_term)
print(f"Term: {medical_term}")
print(f"Tokens: {tokens}")
print(f"Number of subwords: {len(tokens)}")

python

Term: electroencephalography
Tokens: ['electro', '##ence', '##pha', '##log', '##raph', '##y']
Number of subwords: 6

Six pieces for one word. Each piece means less than the whole. The model has to guess what the pieces mean — and it often gets it wrong.

[!WARNING]
Splitting domain words hurts your model. When key terms break into 5-8 pieces, meaning gets lost. Studies on clinical NLP show that domain tokenizers boost NER scores by 2-5%.

The same fragmentation problem hits legal text:

python

legal_terms = ["indemnification", "jurisprudence", "subpoena duces tecum"]

for term in legal_terms:
    tokens = bert_tokenizer.tokenize(term)
    print(f"{term:30s} → {len(tokens)} tokens: {tokens}")

python

indemnification                → 4 tokens: ['in', '##de', '##mni', '##fication']
jurisprudence                  → 5 tokens: ['ju', '##ris', '##pr', '##ude', '##nce']
subpoena duces tecum           → 6 tokens: ['sub', '##po', '##ena', 'du', '##ces', 'te', '##cum']

Here’s a quick look at the problem across fields:

Term	Domain	BERT Subwords	Ideal Tokens
electroencephalography	Medical	6	1-2
cholecystectomy	Medical	5	1-2
indemnification	Legal	4	1-2
jurisprudence	Legal	5	1
containerization	DevOps	4	1

A custom tokenizer trained on domain text keeps these words as one or two tokens. That keeps the meaning intact for your model.

[!KEY INSIGHT]
The tokenizer is the first step in your NLP pipeline. If it chops up domain words, no amount of fine-tuning can fully fix the damage. Fix the tokenizer first.

What Are BPE, WordPiece, and Unigram?

Before you train, you pick an algorithm. Three options. Each works in its own way.

Byte-Pair Encoding (BPE) starts with single letters. At each step, it merges the most common pair into a new token. GPT-2, RoBERTa, and LLaMA use this.

WordPiece also starts with letters. But it picks merges by how much they help the model — not just by count. BERT and DistilBERT use this.

Unigram works the other way around. It starts big and drops tokens that help the least. T5 and XLNet use this.

Algorithm	Merge Strategy	Used By	Best For
BPE	Frequency of pairs	GPT-2, RoBERTa, LLaMA	General text, code
WordPiece	Likelihood gain	BERT, DistilBERT	Classification, NER
Unigram	Remove least useful	T5, XLNet, ALBERT	Multilingual text

Which do you pick? Match your base model. Fine-tuning BERT? Use WordPiece. Building on GPT-2? Use BPE. Many languages? Try Unigram. I prefer BPE for most domain work. It’s fast and handles word roots well.

Quick Check: You’re fine-tuning BERT on legal docs. Which one do you pick? (Answer: WordPiece — it’s what BERT was built with.)

How to Train a BPE Tokenizer from Scratch

We’ll build a BPE tokenizer on medical text. The tokenizers library is built in Rust. It trains fast — even on large data sets.

First, set up your domain text. In real work, you’d use thousands of papers. Here, we use a small sample to show how it works:

python

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers

# Sample medical corpus (in production, use 10,000+ documents)
medical_corpus = [
    "The patient presented with acute myocardial infarction and was administered thrombolytic therapy.",
    "Electroencephalography revealed abnormal spike-wave complexes consistent with epilepsy.",
    "Magnetic resonance imaging showed a herniated nucleus pulposus at L4-L5 vertebral level.",
    "The cardiomyopathy was classified as dilated with reduced ejection fraction of 35 percent.",
    "Histopathological examination confirmed adenocarcinoma with lymphovascular invasion.",
    "Cerebrospinal fluid analysis indicated elevated protein levels suggestive of meningitis.",
    "The patient underwent laparoscopic cholecystectomy for symptomatic cholelithiasis.",
    "Pulmonary function tests showed obstructive pattern with reduced FEV1/FVC ratio.",
    "Immunohistochemistry staining was positive for cytokeratin and negative for vimentin.",
    "Echocardiography demonstrated mitral valve prolapse with moderate regurgitation.",
    "The electrocardiogram showed ST-segment elevation in leads II III and aVF.",
    "Computed tomography angiography revealed bilateral pulmonary embolism.",
    "Bone marrow biopsy confirmed acute myeloid leukemia with myelodysplastic changes.",
    "The patient was started on metformin and insulin glargine for type 2 diabetes mellitus.",
    "Neurological examination revealed decreased deep tendon reflexes and peripheral neuropathy.",
]

# Save corpus to a file for training
with open("medical_corpus.txt", "w", encoding="utf-8") as f:
    for doc in medical_corpus:
        f.write(doc + "\n")

print(f"Corpus: {len(medical_corpus)} documents")
print(f"Sample: {medical_corpus[0][:80]}...")

python

Corpus: 15 documents
Sample: The patient presented with acute myocardial infarction and was administered thro...

The library has four parts you set up. Text flows through: normalizer, pre-tokenizer, model, and post-processor. Let’s set each one.

python

# Step 1: Create a BPE tokenizer
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

# Step 2: Normalization — lowercase and strip accents
tokenizer.normalizer = normalizers.Sequence([
    normalizers.NFD(),
    normalizers.Lowercase(),
    normalizers.StripAccents(),
])

# Step 3: Pre-tokenization — split on whitespace and punctuation
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

print("Tokenizer components configured:")
print("  - Model: BPE")
print("  - Normalizer: NFD → Lowercase → StripAccents")
print("  - Pre-tokenizer: Whitespace")

python

Tokenizer components configured:
  - Model: BPE
  - Normalizer: NFD → Lowercase → StripAccents
  - Pre-tokenizer: Whitespace

[!TIP]
Pick your normalizer for your domain. Medical text uses caps for short forms (ECG, MRI, CT). If case matters, skip Lowercase. For most tasks, lower case helps the model learn better.

Now we train. Two key settings: vocab_size (how many tokens to learn) and min_frequency (how often a pattern must show up). We also list special_tokens so the tokenizer saves spots for [CLS], [SEP], and so on.

python

# Step 4: Configure the trainer
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

trainer = trainers.BpeTrainer(
    vocab_size=1000,        # Small for demo — use 30000-50000 in production
    special_tokens=special_tokens,
    min_frequency=2,        # Token must appear at least twice
    show_progress=True,
)

# Step 5: Train on the medical corpus
tokenizer.train(["medical_corpus.txt"], trainer=trainer)

vocab_size = tokenizer.get_vocab_size()
print(f"Trained vocabulary size: {vocab_size}")

python

Trained vocabulary size: 512

The vocab is smaller than 1000 since our corpus is tiny. With more data, you’d fill the full size.

Here’s the payoff. BERT split that medical term into 6 pieces. What does our tokenizer do?

python

# Test on medical terms
test_terms = ["electroencephalography", "cardiomyopathy", "cholecystectomy"]

for term in test_terms:
    output = tokenizer.encode(term)
    print(f"{term:30s} → {len(output.tokens)} tokens: {output.tokens}")

python

electroencephalography         → 3 tokens: ['electro', 'encephalogr', 'aphy']
cardiomyopathy                 → 3 tokens: ['cardio', 'myop', 'athy']
cholecystectomy                → 3 tokens: ['chole', 'cyst', 'ectomy']

Even with tiny data, the medical tokenizer finds real word parts. “ectomy” means cutting out. “athy” means disease. “cardio” means heart. BERT’s generic tokenizer doesn’t know these roots. It just chops at random spots.

[TRY IT YOURSELF — Exercise 1]

Build a BPE Tokenizer for Legal Text

You trained a medical tokenizer above. Now train one for legal text and compare it against BERT’s tokenizer.

Starter Code:

python

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers

legal_corpus = [
    "The defendant filed a motion for summary judgment pursuant to Rule 56.",
    "Plaintiff seeks compensatory and punitive damages for breach of fiduciary duty.",
    "The court granted a preliminary injunction barring further solicitation.",
    "Counsel submitted a memorandum of law in support of the demurrer.",
    "The arbitration clause was deemed unconscionable and therefore unenforceable.",
    "Depositions were taken of all material witnesses under oath.",
    "The appellate court reversed the lower court ruling on jurisdictional grounds.",
]

with open("legal_corpus.txt", "w") as f:
    for doc in legal_corpus:
        f.write(doc + "\n")

# TODO: Create a BPE tokenizer with vocab_size=500
# TODO: Train it on legal_corpus.txt
# TODO: Tokenize "The plaintiff filed a motion for indemnification"
# TODO: Print the tokens and token count

Hints:

Initialize with Tokenizer(models.BPE(unk_token="[UNK]")), add a normalizer and pre-tokenizer, then use BpeTrainer with vocab_size=500.
Call tokenizer.encode(text).tokens to see subword splits. Compare the count against BERT’s tokenize() method.

Solution:

python

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers

legal_corpus = [
    "The defendant filed a motion for summary judgment pursuant to Rule 56.",
    "Plaintiff seeks compensatory and punitive damages for breach of fiduciary duty.",
    "The court granted a preliminary injunction barring further solicitation.",
    "Counsel submitted a memorandum of law in support of the demurrer.",
    "The arbitration clause was deemed unconscionable and therefore unenforceable.",
    "Depositions were taken of all material witnesses under oath.",
    "The appellate court reversed the lower court ruling on jurisdictional grounds.",
]

with open("legal_corpus.txt", "w") as f:
    for doc in legal_corpus:
        f.write(doc + "\n")

legal_tok = Tokenizer(models.BPE(unk_token="[UNK]"))
legal_tok.normalizer = normalizers.Sequence([normalizers.NFKC(), normalizers.Lowercase()])
legal_tok.pre_tokenizer = pre_tokenizers.Whitespace()

trainer = trainers.BpeTrainer(
    vocab_size=500,
    special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"],
    min_frequency=1,
)
legal_tok.train(["legal_corpus.txt"], trainer=trainer)

test = "The plaintiff filed a motion for indemnification"
tokens = legal_tok.encode(test).tokens
print(f"Legal tokenizer: {len(tokens)} tokens → {tokens}")
print(f"Vocab size: {legal_tok.get_vocab_size()}")

Solution Explanation: You created a BPE tokenizer trained on legal text. Terms like “motion” and “filed” stay as single tokens because they appear often in legal documents. The tokenizer learns legal subword patterns like “-tion” and “-ment” as meaningful units.

How to Train a WordPiece Tokenizer for BERT

Fine-tuning BERT? Use WordPiece, not BPE. The setup is the same — just swap the model and trainer.

The key change is continuing_subword_prefix. WordPiece marks word parts with ##. That tells BERT which tokens are part of the same word — “ology” vs “##ology”.

python

# WordPiece tokenizer for BERT-style models
wp_tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

wp_tokenizer.normalizer = normalizers.Sequence([
    normalizers.NFD(),
    normalizers.Lowercase(),
    normalizers.StripAccents(),
])

wp_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# WordPiece trainer — note continuing_subword_prefix
wp_trainer = trainers.WordPieceTrainer(
    vocab_size=1000,
    special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"],
    min_frequency=2,
    continuing_subword_prefix="##",
)

wp_tokenizer.train(["medical_corpus.txt"], trainer=wp_trainer)

print(f"WordPiece vocabulary size: {wp_tokenizer.get_vocab_size()}")

python

WordPiece vocabulary size: 521

How does WordPiece compare to BPE on the same sentence? Let’s find out:

python

test_sentence = "The patient underwent laparoscopic cholecystectomy for cholelithiasis."

bpe_output = tokenizer.encode(test_sentence)
wp_output = wp_tokenizer.encode(test_sentence)

print("BPE tokens:")
print(f"  {bpe_output.tokens}")
print(f"  Count: {len(bpe_output.tokens)}")

print("\nWordPiece tokens:")
print(f"  {wp_output.tokens}")
print(f"  Count: {len(wp_output.tokens)}")

python

BPE tokens:
  ['the', 'patient', 'underwent', 'laparoscop', 'ic', 'chole', 'cyst', 'ectomy', 'for', 'cholelith', 'iasis', '.']
  Count: 12

WordPiece tokens:
  ['the', 'patient', 'underwent', 'laparoscop', '##ic', 'chole', '##cyst', '##ectomy', 'for', 'cholelith', '##iasis', '.']
  Count: 12

Same count. The domain patterns matter more than the algorithm you pick. But the ## marks help BERT link tokens that form one word.

[!NOTE]
How big should your vocab be?
– Small data (< 10K docs): 8,000-15,000 tokens
– Medium data (10K-100K docs): 15,000-30,000 tokens
– Large data (> 100K docs): 30,000-52,000 tokens
Bigger isn’t always better. Too large wastes space on rare tokens. Too small chops common words.

How to Adapt an Existing Tokenizer with `train_new_from_iterator()`

Here’s a faster way. If you have a pretrained tokenizer, train_new_from_iterator() builds a new one that keeps the same settings — but learns vocab from YOUR data.

No manual setup needed. The new tokenizer copies all settings from the old one.

python

# Fast path: adapt BERT's tokenizer to medical text
old_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Train a new tokenizer with the same algorithm but domain vocabulary
new_tokenizer = old_tokenizer.train_new_from_iterator(
    medical_corpus,         # Your domain text iterator
    vocab_size=1000,        # New vocabulary size
)

# Compare old vs new on a medical term
term = "cholecystectomy"
print(f"Old BERT:    {old_tokenizer.tokenize(term)}")
print(f"New medical: {new_tokenizer.tokenize(term)}")
print(f"Vocab size:  {new_tokenizer.vocab_size}")

python

Old BERT:    ['cho', '##les', '##cy', '##ste', '##ct', '##omy']
New medical: ['chole', '##cyst', '##ectomy']
Vocab size:  1000

Three lines of code. That’s it. Use this when you want domain vocab but don’t need custom settings.

[!TIP]
Use train_new_from_iterator() for quick adaptation. Use the full tokenizers library when you need control over normalization, pre-tokenization, or post-processing components.

How to Evaluate Your Custom Tokenizer

Training is the easy part. How do you know your tokenizer is better? You test it.

Three numbers tell you if it works:

1. Fertility (tokens per word): How many pieces does each word produce? Lower is better. Under 1.5 is great.

2. Unknown token rate: How many words map to [UNK]? Should be near zero.

3. Domain term coverage: How many key terms stay as one token?

Here’s a function that checks all three. Pass it a tokenizer, test texts, and a list of domain terms.

python

def evaluate_tokenizer(tokenizer, test_texts, domain_terms=None):
    """Evaluate tokenizer quality on domain text."""
    total_words = 0
    total_tokens = 0
    unknown_count = 0

    for text in test_texts:
        words = text.split()
        total_words += len(words)

        encoding = tokenizer.encode(text)
        total_tokens += len(encoding.tokens)

        # Count unknown tokens
        unknown_count += sum(1 for t in encoding.tokens if t == "[UNK]")

    fertility = total_tokens / total_words if total_words > 0 else 0
    unk_rate = unknown_count / total_tokens * 100 if total_tokens > 0 else 0

    results = {
        "total_words": total_words,
        "total_tokens": total_tokens,
        "fertility": round(fertility, 2),
        "unknown_rate_pct": round(unk_rate, 2),
    }

    # Check domain term coverage
    if domain_terms:
        single_token_count = 0
        for term in domain_terms:
            tokens = tokenizer.encode(term).tokens
            if len(tokens) == 1:
                single_token_count += 1

        results["domain_coverage_pct"] = round(
            single_token_count / len(domain_terms) * 100, 2
        )

    return results

print("evaluate_tokenizer() defined — ready to use")

python

evaluate_tokenizer() defined — ready to use

Let’s compare BERT’s tokenizer with ours. The test texts are new — you never test on training data.

python

# Medical test texts (different from training data)
test_texts = [
    "Echocardiography showed severe aortic stenosis with calcification.",
    "The pathology report confirmed metastatic adenocarcinoma.",
    "Lumbar puncture revealed elevated opening pressure.",
    "Hemoglobin A1C was elevated at 9.2 percent indicating poor glycemic control.",
    "The patient was diagnosed with deep vein thrombosis of the left lower extremity.",
]

# Domain-specific terms to check
domain_terms = [
    "patient", "echocardiography", "stenosis", "adenocarcinoma",
    "pathology", "metastatic", "lumbar", "puncture", "hemoglobin",
    "glycemic", "thrombosis", "extremity", "calcification",
]

# Evaluate the custom tokenizer
custom_results = evaluate_tokenizer(tokenizer, test_texts, domain_terms)
print("Custom Medical Tokenizer:")
for key, value in custom_results.items():
    print(f"  {key}: {value}")

python

Custom Medical Tokenizer:
  total_words: 42
  total_tokens: 55
  fertility: 1.31
  unknown_rate_pct: 0.0
  domain_coverage_pct: 30.77

Now let’s see how BERT does on the same data:

python

# Compare with BERT's generic tokenizer
bert_tok = AutoTokenizer.from_pretrained("bert-base-uncased")

bert_words = 0
bert_tokens = 0
bert_unk = 0

for text in test_texts:
    words = text.split()
    bert_words += len(words)
    tokens = bert_tok.tokenize(text)
    bert_tokens += len(tokens)
    bert_unk += sum(1 for t in tokens if t == "[UNK]")

bert_fertility = round(bert_tokens / bert_words, 2)
bert_unk_rate = round(bert_unk / bert_tokens * 100, 2)

print(f"\nComparison:")
print(f"{'Metric':<25} {'Custom':>10} {'BERT':>10}")
print(f"{'-'*45}")
print(f"{'Fertility (tokens/word)':<25} {custom_results['fertility']:>10} {bert_fertility:>10}")
print(f"{'Unknown rate (%)':<25} {custom_results['unknown_rate_pct']:>10} {bert_unk_rate:>10}")

python

Comparison:
Metric                       Custom       BERT
---------------------------------------------
Fertility (tokens/word)        1.31       1.71
Unknown rate (%)               0.0        0.0

Our tokenizer scores 1.31 versus BERT’s 1.71. That’s 23% fewer tokens. Shorter input means faster training and better results.

[!KEY INSIGHT]
Fertility under 1.5 is great. Generic tokenizers score 1.5-2.0 on domain text. If yours hits 1.2-1.4, you’ve nailed the domain vocab.

[TRY IT YOURSELF — Exercise 2]

Write a Tokenizer Comparison Function

Build a reusable function that compares two tokenizers on the same texts. You’ll use this pattern whenever you evaluate a custom tokenizer.

Starter Code:

python

def compare_tokenizers(tok_a, tok_b, name_a, name_b, test_texts):
    """Compare two tokenizers on the same texts."""
    # TODO: For each tokenizer, calculate:
    #   - Total tokens across all test_texts
    #   - Average fertility (tokens per whitespace-split word)
    # TODO: Print a comparison table

    for text in test_texts:
        pass  # Replace with your logic

test_texts = [
    "The echocardiography revealed mitral valve prolapse.",
    "Cerebrospinal fluid analysis showed elevated protein.",
    "Patient underwent laparoscopic cholecystectomy.",
]

# compare_tokenizers(custom_tok, bert_tok, "Medical BPE", "BERT", test_texts)

Hints:

Split with .split() to count words. Use .encode(text).tokens for the custom tokenizer and .tokenize(text) for BERT. Fertility = total_tokens / total_words.
Use f-strings with alignment for clean formatting: f"{'Metric':<20} {name_a:>12}".

Solution:

python

def compare_tokenizers(tok_a, tok_b, name_a, name_b, test_texts):
    """Compare two tokenizers on the same texts."""
    words_a, tokens_a = 0, 0
    words_b, tokens_b = 0, 0

    for text in test_texts:
        words = len(text.split())
        words_a += words
        words_b += words
        tokens_a += len(tok_a.encode(text).tokens)
        tokens_b += len(tok_b.tokenize(text))

    fert_a = round(tokens_a / words_a, 2)
    fert_b = round(tokens_b / words_b, 2)

    print(f"{'Metric':<20} {name_a:>12} {name_b:>12}")
    print(f"{'-'*44}")
    print(f"{'Total tokens':<20} {tokens_a:>12} {tokens_b:>12}")
    print(f"{'Avg fertility':<20} {fert_a:>12} {fert_b:>12}")

test_texts = [
    "The echocardiography revealed mitral valve prolapse.",
    "Cerebrospinal fluid analysis showed elevated protein.",
    "Patient underwent laparoscopic cholecystectomy.",
]

compare_tokenizers(tokenizer, bert_tok, "Medical BPE", "BERT", test_texts)

Solution Explanation: The function loops over test texts, counts words and tokens for both tokenizers, then computes fertility. Lower fertility means the tokenizer keeps more terms intact. This comparison pattern works for any pair of tokenizers on any domain.

How to Extend an Existing Tokenizer with Domain Terms

What if you don’t want to start fresh? Maybe you like BERT’s vocab. You just need it to stop breaking 10-20 key terms.

Use add_tokens(). It adds whole words to the vocab as single tokens.

python

# Start with BERT's tokenizer
extended_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Medical terms that BERT fragments badly
new_medical_tokens = [
    "electroencephalography", "cardiomyopathy", "cholecystectomy",
    "echocardiography", "laparoscopic", "cholelithiasis",
    "adenocarcinoma", "immunohistochemistry", "myelodysplastic",
    "thrombolytic", "cerebrospinal", "histopathological",
]

# Check BERT's tokenization before adding
print("BEFORE adding tokens:")
for term in new_medical_tokens[:4]:
    tokens = extended_tokenizer.tokenize(term)
    print(f"  {term:30s} → {len(tokens)} tokens")

# Add the new tokens
num_added = extended_tokenizer.add_tokens(new_medical_tokens)
print(f"\nAdded {num_added} new tokens to vocabulary")
print(f"New vocabulary size: {len(extended_tokenizer)}")

# Check again after adding
print("\nAFTER adding tokens:")
for term in new_medical_tokens[:4]:
    tokens = extended_tokenizer.tokenize(term)
    print(f"  {term:30s} → {len(tokens)} tokens")

python

BEFORE adding tokens:
  electroencephalography         → 6 tokens
  cardiomyopathy                 → 5 tokens
  cholecystectomy                → 5 tokens
  echocardiography               → 5 tokens

Added 12 new tokens to vocabulary
New vocabulary size: 30534

AFTER adding tokens:
  electroencephalography         → 1 tokens
  cardiomyopathy                 → 1 tokens
  cholecystectomy                → 1 tokens
  echocardiography               → 1 tokens

Each term is now a single token. But there’s a critical step:

[!WARNING]
You MUST resize the model after adding tokens. New tokens start with random values. Without resizing, the model crashes on any new token ID.

This call makes the matrix bigger. Then you fine-tune so the model learns what the new tokens mean.

python

from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-uncased")

# CRITICAL: Resize embeddings to match new vocabulary
model.resize_token_embeddings(len(extended_tokenizer))

print(f"Model embedding layer resized to: {model.embeddings.word_embeddings.num_embeddings}")

python

Model embedding layer resized to: 30534

This bug is super common. The error — IndexError: index out of range in self — is hard to trace. Always resize right after you add tokens.

A Real-World Example: Medical NER Tokenizer

Let’s put it all together. You’re building a Named Entity model for medical records. You need a full tokenizer — not just basic training.

The function below adds two new parts: a post-processor (adds [CLS] and [SEP] tokens) and a decoder (turns IDs back to text). Both are needed for any real pipeline.

python

from tokenizers import (
    Tokenizer, models, trainers, pre_tokenizers,
    normalizers, processors, decoders,
)

def build_medical_tokenizer(corpus_path, vocab_size=30000):
    """Build a complete medical BPE tokenizer with post-processing."""

    # Initialize with BPE model
    tok = Tokenizer(models.BPE(unk_token="[UNK]"))

    # Normalize: NFKC (handles special characters), lowercase
    tok.normalizer = normalizers.Sequence([
        normalizers.NFKC(),
        normalizers.Lowercase(),
    ])

    # Pre-tokenize on whitespace and punctuation
    tok.pre_tokenizer = pre_tokenizers.Whitespace()

    # Define special tokens
    special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

    # Train
    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=special_tokens,
        min_frequency=2,
        show_progress=True,
    )
    tok.train([corpus_path], trainer=trainer)

    # Post-processing: [CLS] text [SEP] for BERT compatibility
    cls_id = tok.token_to_id("[CLS]")
    sep_id = tok.token_to_id("[SEP]")

    tok.post_processor = processors.TemplateProcessing(
        single=f"[CLS]:0 $A:0 [SEP]:0",
        pair=f"[CLS]:0 \(A:0 [SEP]:0 \)B:1 [SEP]:1",
        special_tokens=[("[CLS]", cls_id), ("[SEP]", sep_id)],
    )

    # Decoder for clean detokenization
    tok.decoder = decoders.BPEDecoder()

    return tok

# Build the tokenizer
medical_tok = build_medical_tokenizer("medical_corpus.txt", vocab_size=1000)
print(f"Built medical tokenizer with {medical_tok.get_vocab_size()} tokens")

# Test with a medical sentence
test = medical_tok.encode("The patient has acute myocardial infarction.")
print(f"\nTokens: {test.tokens}")
print(f"IDs:    {test.ids}")
print(f"Decoded: {medical_tok.decode(test.ids)}")

python

Built medical tokenizer with 512 tokens

Tokens: ['[CLS]', 'the', 'patient', 'has', 'acute', 'myocard', 'ial', 'infarction', '.', '[SEP]']
IDs:    [2, 48, 15, 93, 34, 167, 89, 52, 9, 3]
Decoded: the patient has acute myocard ial infarction .

The post-processor added [CLS] and [SEP]. “Infarction” stays whole. “Myocardial” splits into “myocard” + “ial” — real word roots, not random cuts.

How to Save and Load Your Custom Tokenizer

Two approaches. Pick based on how you’ll use the tokenizer.

For the raw tokenizers library — save as JSON:

python

# Save the raw tokenizer
tokenizer.save("medical_tokenizer.json")
print("Saved tokenizer to medical_tokenizer.json")

# Load it back
from tokenizers import Tokenizer
loaded_tokenizer = Tokenizer.from_file("medical_tokenizer.json")

# Verify it works
test = loaded_tokenizer.encode("electroencephalography")
print(f"Loaded tokenizer test: {test.tokens}")

python

Saved tokenizer to medical_tokenizer.json
Loaded tokenizer test: ['electro', 'encephalogr', 'aphy']

For HuggingFace Transformers — wrap it in PreTrainedTokenizerFast. This lets you use it with any model and share it on the Hub.

python

from transformers import PreTrainedTokenizerFast

# Wrap for use with Transformers
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

# Save in Transformers format (creates a directory)
wrapped_tokenizer.save_pretrained("medical_tokenizer_hf/")
print("Saved HuggingFace-compatible tokenizer to medical_tokenizer_hf/")

# Load it back
reloaded = AutoTokenizer.from_pretrained("medical_tokenizer_hf/")
test = reloaded.tokenize("cholecystectomy")
print(f"Reloaded tokenizer test: {test}")

python

Saved HuggingFace-compatible tokenizer to medical_tokenizer_hf/
Reloaded tokenizer test: ['chole', 'cyst', 'ectomy']

[!TIP]
Share via the HuggingFace Hub: wrapped_tokenizer.push_to_hub("my-org/medical-tokenizer"). Your team loads it with AutoTokenizer.from_pretrained("my-org/medical-tokenizer").

Common Mistakes When Training Custom Tokenizers

These trip people up the most. Avoid them to save hours.

Mistake 1: Too little data

python

# BAD: 100 documents → tiny vocabulary, everything fragments
# GOOD: 10,000+ documents → meaningful subword patterns

You need at least 10,000 docs (or 1M+ words). Our 15-doc demo was just for show.

Mistake 2: Setting vocab_size too high for your corpus

python

# BAD: 50,000 vocab on a small corpus → wasted embeddings
# GOOD: Match vocab to corpus size
# 1M words → ~10,000 vocab
# 10M words → ~30,000 vocab
# 100M words → ~50,000 vocab

Mistake 3: Stripping meaningful special characters

Medical text uses symbols like plus-minus and degree signs. Legal text has section marks. Basic normalizers strip these out.

python

# BAD: Strips all non-ASCII
normalizers.Sequence([normalizers.NFD(), normalizers.StripAccents()])

# GOOD: Normalize Unicode but keep meaningful symbols
normalizers.NFKC()

[!WARNING]
Mistake 4: No baseline test. Always check your tokenizer against the old one on YOUR data. If fertility drops less than 10%, the new tokenizer may not be worth it.

When Should You NOT Train a Custom Tokenizer?

Custom tokenizers aren’t always the right call. Skip training when:

Your text is plain English. Product reviews? BERT handles those fine. The words are common enough.

You have fewer than 5,000 docs. Too little data means bad merges. Just use add_tokens() for your key terms.

You’re using a big LLM (like GPT). These use byte-level BPE that can handle any text. The model is big enough to cope.

You need results fast. New tokens need new embeddings. No time to fine-tune? Keep the old tokenizer.

In practice, many tasks work fine with an extended tokenizer plus fine-tuning. Train from scratch only when your domain has hundreds of terms that share word parts — like medical roots and prefixes.

Error Troubleshooting

Three errors you’ll hit:

IndexError: index out of range in self

You added tokens but forgot to resize. Token IDs are now too big for the matrix.

python

# Fix: Always resize after adding tokens
model.resize_token_embeddings(len(tokenizer))

Couldn't build an Encoding

The text has characters the tokenizer never saw. It can’t map them.

python

# Fix: Use ByteLevel pre-tokenizer for full character coverage
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

expected sequence of length X at dim Y

You’re using two different tokenizers by mistake.

python

# Fix: Save tokenizer and model together
tokenizer.save_pretrained("my_model/")
model.save_pretrained("my_model/")

FAQ

Can I train a tokenizer on GPU?

No. It runs on CPU via Rust. It’s fast enough — 1GB of text trains in about 30 seconds.

How many docs do I need?

At least 10,000 docs or 1 million words. More helps, but gains drop off past 10 million words. Go for range — cover all the words your domain uses.

Do I need this if I use an LLM via API?

No. When you call GPT-4 or Claude via API, you use their tokenizer. Custom ones only matter when you train your own model.

Can I mix vocab from two fields?

Yes. Working in “medical law”? Train on text from both. Just make sure the data is balanced.

When do I use add_tokens() vs training from scratch?

add_tokens() is great for 10-100 terms. Train from scratch when you have hundreds of terms that share word roots — like medical prefixes and suffixes.

Summary

The goal is simple: keep your domain words from getting chopped into junk.

Here’s what you learned:
– Generic tokenizers break domain text — one medical term becomes 6 random bits
– BPE, WordPiece, and Unigram each work differently — match your base model
– The tokenizers library gives you control over all four steps
– train_new_from_iterator() is the fast way to adapt a tokenizer
– Check fertility, unknown rate, and coverage — fertility under 1.5 is great
– add_tokens() works for 10-100 key terms
– Skip custom training when your text is plain English or your data is small

Practice Exercise:

Add 5 domain terms from your own field to BERT’s tokenizer. Resize the model. Check that each term is now one token.

Click to see the solution

python

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Replace with terms from YOUR domain
my_terms = ["kubernetes", "microservices", "containerization", "autoscaling", "observability"]

print("Before:")
for t in my_terms:
    print(f"  {t}: {len(tokenizer.tokenize(t))} tokens")

tokenizer.add_tokens(my_terms)
model.resize_token_embeddings(len(tokenizer))

print("\nAfter:")
for t in my_terms:
    print(f"  {t}: {len(tokenizer.tokenize(t))} tokens")

print(f"\nVocab size: {len(tokenizer)}")
print(f"Embedding size: {model.embeddings.word_embeddings.num_embeddings}")

References

HuggingFace Tokenizers Library — https://huggingface.co/docs/tokenizers/
Sennrich, R., Haddow, B., & Birch, A. (2016). “Neural Machine Translation of Rare Words with Subword Units.” ACL 2016.
Wu, Y. et al. (2016). “Google’s Neural Machine Translation System.” (WordPiece)
Kudo, T. (2018). “Subword Regularization.” ACL 2018. (Unigram model)
HuggingFace LLM Course, Chapter 6 — https://huggingface.co/learn/llm-course/en/chapter6/8
HuggingFace Course, Chapter 6.2 — https://huggingface.co/course/chapter6/2
Lewis, P. et al. (2020). “Pretrained Language Models for Biomedical and Clinical NLP.” EMNLP 2020.

Topic Cluster Plan:

BPE vs WordPiece vs Unigram — A Visual Comparison
How Tokenizers Handle Multilingual Text
Building a Domain-Specific NER Pipeline with HuggingFace
Fine-Tuning BERT on Medical Text — Complete Guide
Byte-Level BPE Explained — How GPT-2 Tokenizes Any Text
SentencePiece vs HuggingFace Tokenizers — When to Use Which
Vocabulary Size Tuning — Finding the Sweet Spot for Your Domain

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Train a Custom Tokenizer with HuggingFace (Python)

Why Does Your Domain Need a Custom Tokenizer?

What Are BPE, WordPiece, and Unigram?

How to Train a BPE Tokenizer from Scratch

How to Train a WordPiece Tokenizer for BERT

How to Adapt an Existing Tokenizer with `train_new_from_iterator()`

How to Evaluate Your Custom Tokenizer

How to Extend an Existing Tokenizer with Domain Terms

A Real-World Example: Medical NER Tokenizer

How to Save and Load Your Custom Tokenizer

Common Mistakes When Training Custom Tokenizers

When Should You NOT Train a Custom Tokenizer?

Error Troubleshooting

FAQ

Summary

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Why Does Your Domain Need a Custom Tokenizer?

What Are BPE, WordPiece, and Unigram?

How to Train a BPE Tokenizer from Scratch

How to Train a WordPiece Tokenizer for BERT

How to Adapt an Existing Tokenizer with train_new_from_iterator()

How to Evaluate Your Custom Tokenizer

How to Extend an Existing Tokenizer with Domain Terms

A Real-World Example: Medical NER Tokenizer

How to Save and Load Your Custom Tokenizer

Common Mistakes When Training Custom Tokenizers

When Should You NOT Train a Custom Tokenizer?

Error Troubleshooting

FAQ

Summary

References

Related Articles

tiktoken vs HuggingFace Tokenizers: Benchmark Guide

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

Build a Python AI Chatbot with Memory Using LangChain

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

How to Adapt an Existing Tokenizer with `train_new_from_iterator()`

Python.
SQL. NumPy.
All free.