tiktoken vs HuggingFace Tokenizers: Benchmark Guide
Benchmark tiktoken vs HuggingFace Tokenizers on speed, vocabulary, and encoding. Runnable Python code, migration guide, and decision framework for your LLM apps.
tiktoken runs 2-3x faster than HuggingFace Tokenizers. But each tool fits a different use case. This guide tests both on speed and vocab size, then shows you how to switch between them.
You’re building an LLM app. You need to count tokens before calling the API. You reach for a tokenizer — but which one?
OpenAI’s tiktoken and HuggingFace’s tokenizers are the two main Python tools for BPE text splitting. They use the same core method. But they differ in speed, vocab, and where they fit best. Pick the wrong one and you lose time, hit bugs, or both.
Prerequisites
- Python version: 3.9+
- Required libraries: tiktoken (0.7+), tokenizers (0.20+), transformers (4.40+)
- Install:
pip install tiktoken tokenizers transformers - Time to complete: 20-25 minutes
- Pyodide support: Partial — tiktoken works in Pyodide, HuggingFace tokenizers does not
- Reviewed: March 2026
What Are tiktoken and HuggingFace Tokenizers?
Both tools turn text into numbers (token IDs) that models can read. That’s what they share. But they come from very different worlds.
tiktoken is OpenAI’s fast BPE tool. It comes with ready-made vocab files for OpenAI models: cl100k_base (GPT-4, GPT-3.5-Turbo), o200k_base (GPT-4o), and older ones like p50k_base. It’s built in Rust with a thin Python layer. The API is tiny — encode, decode, count. Done.
HuggingFace Tokenizers does much more. It handles BPE, WordPiece, Unigram, and more. It powers the transformers library and works with thousands of models on the Hub. It’s also Rust-based. But the API is much richer — it can train new vocab, add special tokens, pad, and trim.
Let’s load each one. We’ll encode the same text with both and see how the token IDs differ.
import tiktoken
from tokenizers import Tokenizer
# tiktoken: load by encoding name
enc = tiktoken.get_encoding("cl100k_base")
# HuggingFace: load from Hub (GPT-2 tokenizer)
hf_tok = Tokenizer.from_pretrained("gpt2")
text = "Machine learning is transforming how we build software."
tiktoken_ids = enc.encode(text)
hf_ids = hf_tok.encode(text).ids
print(f"tiktoken tokens: {len(tiktoken_ids)} -> {tiktoken_ids}")
print(f"HF tokens: {len(hf_ids)} -> {hf_ids}")
Output:
tiktoken tokens: 8 -> [22438, 6975, 374, 46890, 1268, 584, 1977, 3241]
HF tokens: 8 -> [37573, 4673, 318, 25549, 703, 356, 1382, 3788]
The IDs are nothing alike. tiktoken’s cl100k_base has ~100K tokens. GPT-2’s vocab has ~50K. Different word lists make different IDs — even for the same text.
How Does BPE Tokenization Work?
Before we test speed, you need a quick mental picture of what both tools do inside.
BPE starts with raw bytes. It finds the most common byte pair and merges them into one new token. It repeats this until the vocab hits a target size. The merge order is the “recipe” — saved in the vocab file.
Here’s the key point: both tools use this same method. The speed gap comes from how they’re built, not what they do.
Let’s see tiktoken split a word into parts. We’ll encode “tokenization” and decode each token to see the pieces.
enc = tiktoken.get_encoding("cl100k_base")
word = "tokenization"
tokens = enc.encode(word)
print(f"'{word}' -> {tokens}")
for tok_id in tokens:
print(f" Token {tok_id} -> '{enc.decode([tok_id])}'")
Output:
'tokenization' -> [5765, 2065]
Token 5765 -> 'token'
Token 2065 -> 'ization'
BPE split “tokenization” into “token” + “ization”. Both parts show up often in English, so BPE merged them into single tokens early in training.
Quick Check: What if you encode a rare word like “defenestration”? Rare words have less common parts. BPE breaks them into more pieces, which costs more tokens.
Speed Benchmark — tiktoken vs HuggingFace Tokenizers
How much faster is tiktoken? I timed three cases that match real use: short prompts (one line), medium text (a few lines), and long text (many lines).
The setup is simple. We load tiktoken’s cl100k_base and HuggingFace’s GPT-2. Then we time each over 1000 runs.
import tiktoken
import time
from tokenizers import Tokenizer
enc = tiktoken.get_encoding("cl100k_base")
hf_tok = Tokenizer.from_pretrained("gpt2")
short_text = "What is gradient descent?"
medium_text = short_text * 20
long_text = short_text * 200
The benchmark function wraps a single encode call in a tight loop. It uses time.perf_counter() for microsecond precision.
def benchmark(func, text, n=1000):
"""Time a function over n runs, return avg in microseconds."""
start = time.perf_counter()
for _ in range(n):
func(text)
elapsed = time.perf_counter() - start
return (elapsed / n) * 1_000_000
for label, text in [("Short", short_text),
("Medium", medium_text),
("Long", long_text)]:
tk_us = benchmark(enc.encode, text)
hf_us = benchmark(lambda t: hf_tok.encode(t).ids, text)
ratio = hf_us / tk_us
print(f"{label} ({len(text)} chars): "
f"tiktoken={tk_us:.1f}us HF={hf_us:.1f}us "
f"ratio={ratio:.1f}x")
Your exact numbers will vary by hardware. The pattern looks like this:
Short (25 chars): tiktoken=15.2us HF=42.8us ratio=2.8x
Medium (500 chars): tiktoken=68.4us HF=185.3us ratio=2.7x
Long (5000 chars): tiktoken=612.5us HF=1724.8us ratio=2.8x
tiktoken wins by 2-3x every time. The gap stays the same at all sizes. Both tools use Rust, but tiktoken takes a shorter path through the code.
Why Is tiktoken Faster?
I find it helpful to break this down into three layers:
- Less work per call. tiktoken runs BPE and stops. HuggingFace runs a full chain — normalize, pre-split, BPE, post-process. Each step costs time.
- Regex pre-split. tiktoken uses a fast regex to chunk text before BPE. This skips the heavier pre-split step HuggingFace uses.
- No extras. tiktoken doesn’t pad, trim, or build masks. It just encodes. Less work means faster calls.
Vocabulary and Encoding Differences
Speed matters. But getting the right tokens matters more. Do these tools give the same output for the same text?
Only if they share a vocab — and they don’t. Let’s look at three vocab files side by side.
enc_cl100k = tiktoken.get_encoding("cl100k_base")
enc_o200k = tiktoken.get_encoding("o200k_base")
hf_gpt2 = Tokenizer.from_pretrained("gpt2")
print(f"cl100k_base vocab size: {enc_cl100k.n_vocab}")
print(f"o200k_base vocab size: {enc_o200k.n_vocab}")
print(f"HF GPT-2 vocab size: {hf_gpt2.get_vocab_size()}")
Output:
cl100k_base vocab size: 100256
o200k_base vocab size: 200019
HF GPT-2 vocab size: 50257
That’s a 4x range. Here’s what it means for you:
| Encoding | Vocab Size | Models | Avg Tokens per Word |
|---|---|---|---|
cl100k_base |
~100K | GPT-4, GPT-3.5-Turbo | ~1.3 |
o200k_base |
~200K | GPT-4o, GPT-4o-mini | ~1.1 |
| GPT-2 (HF) | ~50K | GPT-2, older models | ~1.5 |
A bigger vocab means fewer tokens per word. Fewer tokens means lower API costs. o200k_base is the best of the three.
Now let’s see how the same text splits. We’ll print the actual word pieces so you can see the gap.
text = "The transformer architecture uses self-attention mechanisms."
for name, tokenizer in [("cl100k", enc_cl100k), ("o200k", enc_o200k)]:
tokens = tokenizer.encode(text)
pieces = [tokenizer.decode([t]) for t in tokens]
print(f"{name} ({len(tokens)} tokens): {pieces}")
hf_out = hf_gpt2.encode(text)
hf_pieces = [hf_gpt2.decode([t]) for t in hf_out.ids]
print(f"GPT-2 ({len(hf_out.ids)} tokens): {hf_pieces}")
Output:
cl100k (8 tokens): ['The', ' transformer', ' architecture', ' uses', ' self', '-attention', ' mechanisms', '.']
o200k (7 tokens): ['The', ' transformer', ' architecture', ' uses', ' self-attention', ' mechanisms', '.']
GPT-2 (10 tokens): ['The', ' transformer', ' architecture', ' uses', ' self', '-', 'att', 'ention', ' mechanisms', '.']
See how GPT-2’s smaller vocab splits “self-attention” into four pieces? The o200k_base handles it as one token. That’s why vocab size hits your API bill.
How Does Non-English Text Affect Token Counts?
Most guides skip this, but it matters if your app handles many languages. Chinese, Japanese, and Arabic text costs more tokens per letter. BPE vocab files are trained mostly on English. So non-English text gets split into more pieces.
enc = tiktoken.get_encoding("cl100k_base")
texts = [
("English", "Machine learning is powerful."),
("Chinese", "机器学习非常强大。"),
("Arabic", "التعلم الآلي قوي جداً."),
("Japanese", "機械学習は強力です。"),
]
for lang, text in texts:
tokens = enc.encode(text)
ratio = len(tokens) / len(text)
print(f"{lang:>8}: {len(text)} chars -> "
f"{len(tokens)} tokens ({ratio:.2f} tok/char)")
Output pattern (exact counts depend on encoding):
English: 28 chars -> 5 tokens (0.18 tok/char)
Chinese: 9 chars -> 11 tokens (1.22 tok/char)
Arabic: 22 chars -> 16 tokens (0.73 tok/char)
Japanese: 10 chars -> 10 tokens (1.00 tok/char)
Chinese text can cost 5-7x more tokens per letter than English. If you build a multi-language app, plan for this in your context window budget.
Predict the output: If you tokenize the Python code print("hello world") with cl100k_base vs GPT-2, which produces more tokens? GPT-2 will — its smaller vocabulary splits code tokens into more pieces.
Encoding Edge Cases You Should Know
Every tokenizer has quirks. Knowing them now saves you hours of debugging later.
Special Characters and Unicode
How does tiktoken handle emoji, accented characters, and code? Let’s find out with five edge cases.
enc = tiktoken.get_encoding("cl100k_base")
edge_cases = [
"Hello 👋 World",
"café résumé naïve",
"print('hello')",
" indented text",
"",
]
for text in edge_cases:
tokens = enc.encode(text)
print(f"'{text}' -> {len(tokens)} tokens")
Output:
'Hello 👋 World' -> 4 tokens
'café résumé naïve' -> 9 tokens
'print('hello')' -> 5 tokens
' indented text' -> 3 tokens
'' -> 0 tokens
Accents are costly. “cafe resume naive” (no accents) would use fewer tokens. Each accent adds extra bytes that BPE hasn’t merged as well.
The allowed_special Trap
This one trips people up. tiktoken sees strings like <|endoftext|> as special tokens. If your text has one, it throws an error.
The fix depends on what you want. Pass allowed_special="all" to keep them as one token. Pass disallowed_special=() to split them like normal text.
enc = tiktoken.get_encoding("cl100k_base")
text = "End token is <|endoftext|> in GPT models."
try:
enc.encode(text)
except ValueError as e:
print(f"Default: ValueError raised")
tokens_special = enc.encode(text, allowed_special="all")
tokens_regular = enc.encode(text, disallowed_special=())
print(f"allowed_special='all': {len(tokens_special)} tokens")
print(f"disallowed_special=(): {len(tokens_regular)} tokens")
Output:
Default: ValueError raised
allowed_special='all': 9 tokens
disallowed_special=(): 11 tokens
The count changes because the special form is one ID. As plain text, it splits into many small tokens.
Common Mistakes and How to Fix Them
Mistake 1: Hardcoding the Wrong Encoding
This is the one I see most. Someone locks in cl100k_base for all models. Then they wonder why GPT-4o counts don’t match the API bill.
import tiktoken
wrong_enc = tiktoken.get_encoding("p50k_base") # Codex!
correct_enc = tiktoken.encoding_for_model("gpt-4") # GPT-4
text = "What is the meaning of life?"
print(f"p50k_base: {len(wrong_enc.encode(text))} tokens")
print(f"GPT-4: {len(correct_enc.encode(text))} tokens")
Output:
p50k_base: 7 tokens
GPT-4: 7 tokens
For plain English, the counts look the same. But try code, other languages, or special chars and they’ll split apart. Always use encoding_for_model().
Mistake 2: Assuming Token Counts Are Model-Independent
4000 tokens in GPT-3.5-Turbo is NOT 4000 tokens in GPT-4o. I can’t say this enough — a different vocab means a different count for the same text.
text = "Attention is all you need. " * 100
for model in ["gpt-3.5-turbo", "gpt-4", "gpt-4o"]:
enc = tiktoken.encoding_for_model(model)
print(f"{model}: {len(enc.encode(text))} tokens")
Output:
gpt-3.5-turbo: 700 tokens
gpt-4: 700 tokens
gpt-4o: 600 tokens
GPT-4o has a bigger vocab. Same text, 14% fewer tokens, lower cost.
Mistake 3: Looping Instead of Batching
If you’re encoding thousands of strings with HuggingFace, don’t use a loop. The encode_batch() call runs Rust threads in parallel. The speed gain is huge.
from tokenizers import Tokenizer
import time
hf_tok = Tokenizer.from_pretrained("gpt2")
texts = ["Machine learning is great."] * 5000
start = time.perf_counter()
for t in texts:
hf_tok.encode(t)
single_time = time.perf_counter() - start
start = time.perf_counter()
hf_tok.encode_batch(texts)
batch_time = time.perf_counter() - start
print(f"Single loop: {single_time:.3f}s")
print(f"Batch: {batch_time:.3f}s")
print(f"Speedup: {single_time / batch_time:.1f}x")
Approximate results:
Single loop: 0.245s
Batch: 0.038s
Speedup: 6.4x
Batch mode can match tiktoken’s speed — or beat it. For big jobs, always test it.
{
type: ‘exercise’,
id: ‘tokenizer-compare-ex1’,
title: ‘Exercise 1: Compare Token Counts Across Models’,
difficulty: ‘beginner’,
exerciseType: ‘write’,
instructions: ‘Write a function count_tokens(text, model_name) that uses tiktoken to return the token count for the given model. Test it with “Python is a versatile programming language.” for “gpt-3.5-turbo” and “gpt-4o”.’,
starterCode: ‘import tiktoken\n\ndef count_tokens(text, model_name):\n # Get the encoding for the model\n enc = tiktoken.encoding_for_model(model_name)\n # Return the token count\n pass # Fix this line\n\n# Test\ntext = “Python is a versatile programming language.”\nprint(count_tokens(text, “gpt-3.5-turbo”))\nprint(count_tokens(text, “gpt-4o”))’,
testCases: [
{ id: ‘tc1’, input: ‘print(count_tokens(“Hello world”, “gpt-4”))’, expectedOutput: ‘2’, description: ‘Basic two-word text’ },
{ id: ‘tc2’, input: ‘print(count_tokens(“”, “gpt-4”))’, expectedOutput: ‘0’, description: ‘Empty string returns 0’ },
],
hints: [
‘The encode() method returns a list of token IDs. Use len() to count them.’,
‘Full solution: return len(enc.encode(text))’,
],
solution: ‘import tiktoken\n\ndef count_tokens(text, model_name):\n enc = tiktoken.encoding_for_model(model_name)\n return len(enc.encode(text))\n\ntext = “Python is a versatile programming language.”\nprint(count_tokens(text, “gpt-3.5-turbo”))\nprint(count_tokens(text, “gpt-4o”))’,
solutionExplanation: ‘tiktoken.encoding_for_model() returns the correct encoding for any OpenAI model. The encode() method converts text to token IDs, and len() counts them.’,
xpReward: 10,
}
Building a Token Budget Calculator
In real LLM apps, you need to fit a system prompt, user message, and context into a fixed token budget. This shows up in every RAG app I’ve built.
The function below takes each text part, counts its tokens, and tells you if the total fits. It also shows how much room is left for the reply.
import tiktoken
def token_budget_calculator(
system_prompt, user_message, context="",
model="gpt-4", max_response_tokens=1000
):
"""Calculate token usage and remaining budget."""
enc = tiktoken.encoding_for_model(model)
limits = {
"gpt-3.5-turbo": 16_385, "gpt-4": 8_192,
"gpt-4o": 128_000, "gpt-4o-mini": 128_000,
}
sys_tok = len(enc.encode(system_prompt))
usr_tok = len(enc.encode(user_message))
ctx_tok = len(enc.encode(context)) if context else 0
overhead = 10
total = sys_tok + usr_tok + ctx_tok + overhead
limit = limits.get(model, 4096)
remaining = limit - total - max_response_tokens
return total, remaining, limit
Here’s how you’d use it with a realistic setup.
system = "You are a helpful coding assistant. Answer in Python."
user_msg = "Write a function to sort a list of dicts by key."
rag_ctx = "Python's sorted() accepts a key parameter." * 10
total, remaining, limit = token_budget_calculator(
system, user_msg, rag_ctx, model="gpt-4"
)
print(f"Model: gpt-4 (context: {limit:,} tokens)")
print(f"Total input: {total:,} tokens")
print(f"Remaining: {remaining:,} tokens")
print(f"Status: {'OK' if remaining > 0 else 'OVER BUDGET'}")
Output:
Model: gpt-4 (context: 8,192 tokens)
Total input: 125 tokens
Remaining: 7,067 tokens
Status: OK
This pattern is a must for RAG apps. Run the check before each API call to set how much context to send.
{
type: ‘exercise’,
id: ‘tokenizer-budget-ex2’,
title: ‘Exercise 2: Truncate Context to Fit Budget’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function truncate_to_budget(text, max_tokens, model="gpt-4") that truncates text to fit within max_tokens. Encode, slice the token list, decode back.’,
starterCode: ‘import tiktoken\n\ndef truncate_to_budget(text, max_tokens, model=”gpt-4″):\n enc = tiktoken.encoding_for_model(model)\n tokens = enc.encode(text)\n # Truncate if needed and decode back\n pass # Fix this\n\n# Test\nlong_text = “Hello world. ” * 100\nresult = truncate_to_budget(long_text, 10)\nprint(result)\nprint(f”Token count: {len(tiktoken.encoding_for_model(\”gpt-4\”).encode(result))}”)’,
testCases: [
{ id: ‘tc1’, input: ‘enc = tiktoken.encoding_for_model(“gpt-4”)\nresult = truncate_to_budget(“Hello world. ” * 100, 10)\nprint(len(enc.encode(result)) <= 10)’, expectedOutput: ‘True’, description: ‘Result fits within 10 tokens’ },
{ id: ‘tc2’, input: ‘print(truncate_to_budget(“Hi”, 100))’, expectedOutput: ‘Hi’, description: ‘Short text unchanged’ },
],
hints: [
‘Encode the text, slice with tokens[:max_tokens], then decode back.’,
‘if len(tokens) > max_tokens: tokens = tokens[:max_tokens]\nreturn enc.decode(tokens)’,
],
solution: ‘import tiktoken\n\ndef truncate_to_budget(text, max_tokens, model=”gpt-4″):\n enc = tiktoken.encoding_for_model(model)\n tokens = enc.encode(text)\n if len(tokens) > max_tokens:\n tokens = tokens[:max_tokens]\n return enc.decode(tokens)\n\nlong_text = “Hello world. ” * 100\nresult = truncate_to_budget(long_text, 10)\nprint(result)\nprint(f”Token count: {len(tiktoken.encoding_for_model(\”gpt-4\”).encode(result))}”)’,
solutionExplanation: ‘Encode the full text, slice the token list to the budget limit, and decode back. The result may end mid-word at the token boundary, but it always fits within budget.’,
xpReward: 15,
}
Migrating Between tiktoken and HuggingFace Tokenizers
Sometimes you need to swap formats. The most common case: you’ve counted tokens with tiktoken, but now you need a HuggingFace tokenizer for a model pipeline.
tiktoken to HuggingFace Format
HuggingFace’s transformers library does this for many models on its own. If a model on the Hub has a tiktoken-format file, AutoTokenizer.from_pretrained() converts it for you.
from transformers import AutoTokenizer
# Qwen2 uses tiktoken format internally
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
text = "How do transformers work?"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")
print(f"Decoded: {tokenizer.decode(tokens)}")
Output:
Tokens: [4340, 653, 63782, 975, 30]
Decoded: How do transformers work?
HuggingFace to tiktoken
Going the other way is rare. But it helps when you want tiktoken’s speed with a custom vocab. You can pull the vocab from any HuggingFace tokenizer.
from tokenizers import Tokenizer
hf_tok = Tokenizer.from_pretrained("gpt2")
vocab = hf_tok.get_vocab()
print(f"Vocab entries: {len(vocab)}")
print(f"First 5: {dict(list(vocab.items())[:5])}")
Output:
Vocab entries: 50257
First 5: {'!': 0, '"': 1, '#': 2, '$': 3, '%': 4}
When to Use Which — A Decision Framework
Here’s how I think about it. Match the tool to your model. Once you frame it that way, the answer is clear.
Use tiktoken when:
– You call OpenAI’s API and need fast token counting
– You need a lightweight dependency with minimal install
– You’re building prompt management for GPT-4 / GPT-4o
Use HuggingFace Tokenizers when:
– You work with open-source models (Llama, Mistral, Falcon)
– You need to train a custom tokenizer on your domain data
– You need padding, truncation, and attention masks for model input
Use both when:
– Your app routes prompts to OpenAI AND open-source models
– You compare token costs across different model families
| Feature | tiktoken | HuggingFace Tokenizers |
|---|---|---|
| Speed (single) | 2-3x faster | Baseline |
| Speed (batch) | No batch API | encode_batch() parallelized |
| Vocabulary | OpenAI only | Any model on Hub |
| Custom training | Not supported | Full pipeline |
| Padding / masks | Manual | Built-in |
| Dependencies | Minimal | Medium |
| Pyodide | Yes | No |
| Best for | OpenAI token counting | Full model pipelines |
Skip tiktoken for non-OpenAI models — the IDs won’t match. Also skip it if you need to train a custom vocab.
Skip HuggingFace if you just need OpenAI token counts. tiktoken is simpler and 2-3x faster for that task.
Complete Code
Summary
tiktoken and HuggingFace Tokenizers solve related but different jobs. tiktoken is best for fast token counts in OpenAI apps. HuggingFace fits when you need the full chain for open-source models.
The speed gap is real — tiktoken runs 2-3x faster on single strings. But HuggingFace catches up with batch mode and gives you more tools for training and running models.
For multi-language apps, non-Latin text costs a lot more tokens. Keep that in mind when you plan your context budgets.
Practice Exercise:
Frequently Asked Questions
How do I count tokens for chat messages with tiktoken?
Chat messages have extra overhead. Each one costs about 4 tokens on top of the text. Here’s a helper.
import tiktoken
def count_chat_tokens(messages, model="gpt-4"):
enc = tiktoken.encoding_for_model(model)
total = 0
for msg in messages:
total += 4 # role + formatting overhead
total += len(enc.encode(msg["content"]))
total += 2 # reply priming
return total
messages = [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "What is Python?"}
]
print(f"Total tokens: {count_chat_tokens(messages)}")
Output:
Total tokens: 17
Does tiktoken work offline?
tiktoken grabs its vocab files on first use. After that, they’re cached and it works offline. To be offline from the start, pre-load the files or add them to your Docker image.
Can I train a custom BPE tokenizer with tiktoken?
No. tiktoken only encodes and decodes. To train your own, use tokenizers.BpeTrainer from HuggingFace. You can then convert the result for faster use.
Is tiktoken’s cl100k_base the same as GPT-4’s tokenizer?
Yes. Both GPT-4 and GPT-3.5-Turbo use cl100k_base. GPT-4o uses the newer o200k_base with a 200K vocabulary. Use encoding_for_model() so you don’t need to remember these mappings.
What about SentencePiece tokenizers?
SentencePiece is a different tool, used by Llama and T5. HuggingFace works with it. tiktoken does not. If your model needs SentencePiece, use AutoTokenizer.from_pretrained() from transformers.
References
- OpenAI — tiktoken: Fast BPE tokeniser for OpenAI models. GitHub
- OpenAI — How to count tokens with tiktoken. Developer Cookbook
- HuggingFace — Tokenizers: Fast state-of-the-art tokenizers. GitHub
- HuggingFace — Summary of the tokenizers. Docs
- HuggingFace — Tiktoken and interaction with Transformers. Docs
- Sennrich, R., Haddow, B., Birch, A. — Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909 (2016).
- Gage, P. — A New Algorithm for Data Compression. The C Users Journal (1994).
- OpenAI — tiktoken model.py: Model-to-encoding mapping. Source
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →