LLM Context Windows Explained: Token Budget Guide
Learn how LLM context windows work, count tokens with tiktoken, estimate API costs, and build a Python token budget manager that allocates context across system prompts, examples, and input.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.
Count tokens, estimate costs, and split your context window across system prompts, examples, and user input — with a reusable budget planner.
You send a long prompt to GPT-4o. The response comes back missing half your instructions. No error. No warning. The model just… forgot the middle of your prompt.
This isn’t a bug. It’s a context window problem. Every LLM has a hard limit on how many tokens it can process at once. Go over that limit, and the API rejects your request or drops information silently. That same 10K-token prompt costs $0.035 on GPT-4o but just $0.002 on GPT-4o-mini — and most developers never check.
Understanding context windows — and managing token budgets — is one of the most practical skills you can build.
What Is a Context Window?
A context window is the total number of tokens an LLM can handle in one request. It covers everything: system prompt, few-shot examples, user input, conversation history, and the model’s response.
Think of it as a desk. You can only spread so many papers before things fall off the edges. The desk size varies by model.
Here’s what the major models offer today:
| Model | Context Window | Max Output | Input $/1M | Output $/1M |
|---|---|---|---|---|
| GPT-4o | 128,000 | 16,384 | $2.50 | $10.00 |
| GPT-4o-mini | 128,000 | 16,384 | $0.15 | $0.60 |
| Claude 3.5 Sonnet | 200,000 | 8,192 | $3.00 | $15.00 |
| Claude Opus 4 | 200,000 | 32,000 | $5.00 | $25.00 |
| Gemini 2.0 Flash | 1,048,576 | 8,192 | $0.10 | $0.40 |
| Gemini 2.5 Pro | 1,048,576 | 65,536 | $1.25 | $10.00 |
Those numbers look generous. A million tokens sounds infinite. But you burn through tokens fast — especially when system prompts, few-shot examples, and conversation history stack up.
What Are Tokens?
Before you can manage a budget, you need to understand the currency. Tokens aren’t words. They’re chunks of text that the model’s tokenizer creates.
The word “tokenization” splits into two tokens: “token” and “ization”. Short common words like “the” or “is” are single tokens. Unusual words get split into smaller pieces.
A rough guide:
- 1 token is about 4 characters in English
- 100 tokens is about 75 words
- 1,000 tokens is roughly a page of text
But these are estimates. The actual count depends on each model’s tokenizer. That’s why we need a library to count precisely.
Prerequisites
- Python version: 3.9+
- Required library: tiktoken (0.5+)
- Install:
pip install tiktoken - Time to complete: 20-25 minutes
- Pyodide note: tiktoken uses Rust extensions. In a browser environment, install via
micropip.install("tiktoken"). If that fails, use the pure-Python token estimation fallback shown later in this article.
Let’s load the tokenizer for GPT-4o and see how it breaks text into tokens. The encoding_for_model function picks the right tokenizer automatically. We’ll encode a sentence, print the token IDs, then decode each one back to text.
import micropip
await micropip.install('tiktoken')
import tiktoken
# Load the tokenizer for GPT-4o
encoder = tiktoken.encoding_for_model("gpt-4o")
# Tokenize a sentence
text = "Context windows are shared between input and output tokens."
tokens = encoder.encode(text)
print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")
print(f"Decoded: {[encoder.decode([t]) for t in tokens]}")
Output:
Text: Context windows are shared between input and output tokens.
Token count: 9
Tokens: [2014, 11, 527, 6222, 1948, 1946, 323, 2550, 11]
Decoded: ['Context', ' windows', ' are', ' shared', ' between', ' input', ' and', ' output', ' tokens']
Notice the spaces? They attach to the following word. The tokenizer doesn’t treat spaces as separate tokens. This is how byte-pair encoding (BPE) works under the hood.
[UNDER THE HOOD]
How BPE tokenization works: BPE starts with single characters. It then merges the most common pair, over and over. After training, “th” + “e” become “the” — one token. Common words stay whole. Rare words get split into pieces. GPT-4o’s o200k_base vocabulary has 200,000 tokens. A bigger vocabulary means fewer tokens per text. That’s why newer models use larger vocabularies — more fits in your context window.
Quick check: How many tokens do you think the word “MachineLearningPlus” would produce? It’s a single word, but it’s unusual. Try len(encoder.encode("MachineLearningPlus")) — you might be surprised.
Counting Tokens for Any Text
Let’s build a reusable function that counts tokens for different models. This is the foundation of our budget manager.
The count_tokens function takes a text string and a model name. It loads the right tokenizer, encodes the text, and returns the length. We’ll also build a cost estimator that uses per-model pricing.
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count tokens in a text string for a given model."""
encoder = tiktoken.encoding_for_model(model)
return len(encoder.encode(text))
# Pricing per 1M tokens (input, output)
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00, "context": 128_000},
"gpt-4o-mini": {"input": 0.15, "output": 0.60, "context": 128_000},
"gpt-3.5-turbo": {"input": 0.50, "output": 1.50, "context": 16_385},
}
def estimate_cost(input_tokens: int, output_tokens: int,
model: str = "gpt-4o") -> float:
"""Estimate API cost in USD for a given token count."""
pricing = MODEL_PRICING[model]
input_cost = (input_tokens / 1_000_000) * pricing["input"]
output_cost = (output_tokens / 1_000_000) * pricing["output"]
return round(input_cost + output_cost, 6)
Both functions are straightforward. Let’s test them with a realistic prompt. We’ll count tokens for a system prompt and a user message, then estimate what the API call would cost.
system_prompt = """You are a senior data analyst. Answer questions about sales data.
Be precise with numbers. Format large numbers with commas.
Always show your reasoning step by step."""
user_message = "What was our total revenue last quarter?"
system_tokens = count_tokens(system_prompt)
user_tokens = count_tokens(user_message)
total_input = system_tokens + user_tokens
# Assume the model generates about 500 tokens in response
estimated_output = 500
cost = estimate_cost(total_input, estimated_output)
print(f"System prompt: {system_tokens} tokens")
print(f"User message: {user_tokens} tokens")
print(f"Total input: {total_input} tokens")
print(f"Est. output: {estimated_output} tokens")
print(f"Est. cost: ${cost:.6f}")
print(f"Context used: {total_input / 128_000 * 100:.2f}%")
Output:
System prompt: 33 tokens
User message: 9 tokens
Total input: 42 tokens
Est. output: 500 tokens
Est. cost: $0.005105
Context used: 0.03%
We’re barely touching the context window here. But what happens when you add conversation history, few-shot examples, or retrieved documents? The numbers climb fast.
{
type: 'exercise',
id: 'token-counting-cost',
title: 'Exercise 1: Count Tokens and Estimate Cost',
difficulty: 'beginner',
exerciseType: 'write',
instructions: 'You have a system prompt and a user message. Count the tokens in each using the `count_tokens` function, then compute the total estimated cost using `estimate_cost`. Assume 300 output tokens. Print the total input tokens and the cost.',
starterCode: 'system = "You are a Python tutor. Explain concepts simply."\nuser = "What is the difference between a list and a tuple?"\n\n# Count tokens for each\nsystem_tokens = count_tokens(system)\nuser_tokens = count_tokens(user)\ntotal_input = system_tokens + user_tokens\n\n# Estimate cost with 300 output tokens\ncost = # YOUR CODE HERE\n\nprint(f"Total input tokens: {total_input}")\nprint(f"Estimated cost: ${cost:.6f}")',
testCases: [
{ id: 'tc1', input: 'print(type(cost))', expectedOutput: "<class 'float'>", description: 'cost should be a float' },
{ id: 'tc2', input: 'print(total_input > 0)', expectedOutput: 'True', description: 'total_input should be positive' },
{ id: 'tc3', input: 'print(cost > 0)', expectedOutput: 'True', hidden: true, description: 'cost should be positive' },
],
hints: [
'Use the estimate_cost function with total_input, 300, and model="gpt-4o".',
'Full answer: cost = estimate_cost(total_input, 300, model="gpt-4o")',
],
solution: 'system = "You are a Python tutor. Explain concepts simply."\nuser = "What is the difference between a list and a tuple?"\nsystem_tokens = count_tokens(system)\nuser_tokens = count_tokens(user)\ntotal_input = system_tokens + user_tokens\ncost = estimate_cost(total_input, 300, model="gpt-4o")\nprint(f"Total input tokens: {total_input}")\nprint(f"Estimated cost: ${cost:.6f}")',
solutionExplanation: 'We count tokens for both the system prompt and user message, sum them for total input, then pass that plus 300 output tokens to estimate_cost. The function multiplies each count by the per-token price.',
xpReward: 15,
}
What Happens When You Exceed the Context Window Limit?
This is where things get dangerous. And honestly? This catches even experienced developers off guard. Exceeding the context window doesn’t always produce a clear error.
Three things can happen:
1. Hard rejection. The API returns an error like context_length_exceeded. This is the best outcome — you know something broke.
2. Silent truncation. Some APIs quietly trim your input to fit. You lose information without any warning. The model generates a response from incomplete context.
3. The “lost in the middle” problem. Even when your prompt fits, LLMs pay more attention to the start and end. Information buried in the middle gets ignored. Stanford researchers confirmed this in 2023. Models perform worst when the answer sits in the middle third of a long prompt.
Why does this matter to you? Because a RAG application that stuffs 50 retrieved documents into the prompt might get worse answers than one that includes just the 5 most relevant ones.
Let’s build a function that checks whether a prompt fits before you call the API. It takes input token count and desired output length, then reports whether you’re within budget.
def check_context_fit(input_tokens: int, desired_output: int,
model: str = "gpt-4o") -> dict:
"""Check if a prompt fits within the model's context window."""
context_limit = MODEL_PRICING[model]["context"]
total_needed = input_tokens + desired_output
fits = total_needed <= context_limit
remaining = context_limit - input_tokens
return {
"fits": fits,
"input_tokens": input_tokens,
"desired_output": desired_output,
"total_needed": total_needed,
"context_limit": context_limit,
"remaining_for_output": max(0, remaining),
"utilization_pct": round(input_tokens / context_limit * 100, 2),
}
# Test: does a 100K token input fit in GPT-4o with 16K output?
result = check_context_fit(100_000, 16_000, model="gpt-4o")
print(f"Input tokens: {result['input_tokens']:,}")
print(f"Desired output: {result['desired_output']:,}")
print(f"Total needed: {result['total_needed']:,}")
print(f"Context limit: {result['context_limit']:,}")
print(f"Fits in window: {result['fits']}")
print(f"Remaining: {result['remaining_for_output']:,}")
Output:
Input tokens: 100,000
Desired output: 16,000
Total needed: 116,000
Context limit: 128,000
Fits in window: True
Remaining: 28,000
It fits — but only with 28K tokens to spare. Add a few thousand tokens of conversation history and you’re over the edge.
Predict the output: What happens if you change desired_output to 30,000? The total becomes 130,000 — which exceeds the 128K limit. The function would return fits: False.
Building the Token Budget Manager
Here’s the core idea. We’ll split the context window into four zones: system prompt, few-shot examples, user input (plus any RAG docs), and response. Each zone gets a token allocation. The manager tracks usage and warns you before you blow the budget.
I find this four-zone approach covers most real-world LLM applications. Some use cases might not need examples at all — we’ll handle that too.
Let’s start with the BudgetZone dataclass. Each zone stores its name, max token budget, current usage, and the actual content. Three properties compute remaining capacity, utilization percentage, and whether the zone is over budget.
from dataclasses import dataclass
@dataclass
class BudgetZone:
"""A single budget allocation zone."""
name: str
max_tokens: int
current_tokens: int = 0
content: str = ""
@property
def remaining(self) -> int:
return max(0, self.max_tokens - self.current_tokens)
@property
def utilization_pct(self) -> float:
if self.max_tokens == 0:
return 0.0
return round(self.current_tokens / self.max_tokens * 100, 1)
@property
def is_over_budget(self) -> bool:
return self.current_tokens > self.max_tokens
Simple and clean. The remaining property uses max(0, ...) so it never returns negative numbers.
Now the main manager class. It creates four zones based on percentage allocations you provide. The set_content method assigns text to a zone and counts its tokens. The get_summary method prints a visual budget report.
class TokenBudgetManager:
"""Manage token allocations across context window zones."""
def __init__(self, model: str = "gpt-4o",
system_pct: float = 0.10,
examples_pct: float = 0.15,
input_pct: float = 0.50,
response_pct: float = 0.25):
self.model = model
self.encoder = tiktoken.encoding_for_model(model)
self.context_limit = MODEL_PRICING[model]["context"]
# Allocate zones by percentage of context window
self.zones = {
"system": BudgetZone("System Prompt",
int(self.context_limit * system_pct)),
"examples": BudgetZone("Few-Shot Examples",
int(self.context_limit * examples_pct)),
"input": BudgetZone("User Input + RAG",
int(self.context_limit * input_pct)),
"response": BudgetZone("Response Reserve",
int(self.context_limit * response_pct)),
}
def _count(self, text: str) -> int:
"""Count tokens using the model's tokenizer."""
return len(self.encoder.encode(text))
The constructor takes a model name and four percentage values. They should add up to 1.0 (100% of the context window). Each percentage gets multiplied by the context limit to produce a token budget.
Now the methods that make it useful — set_content to fill zones and get_summary to see the budget at a glance:
def set_content(self, zone_name: str, content: str) -> dict:
"""Set content for a zone and return status."""
zone = self.zones[zone_name]
token_count = self._count(content)
zone.content = content
zone.current_tokens = token_count
return {
"zone": zone.name,
"tokens_used": token_count,
"budget": zone.max_tokens,
"remaining": zone.remaining,
"over_budget": zone.is_over_budget,
"trim_by": max(0, token_count - zone.max_tokens),
}
def get_summary(self) -> str:
"""Return a formatted budget summary."""
lines = [f"\n{'='*60}"]
lines.append(f" Token Budget — {self.model}")
lines.append(f" Context Window: {self.context_limit:,} tokens")
lines.append(f"{'='*60}")
total_used = 0
for zone in self.zones.values():
status = "OVER" if zone.is_over_budget else "OK"
bar_len = int(zone.utilization_pct / 5)
bar = "█" * bar_len + "░" * (20 - bar_len)
lines.append(
f" {zone.name:<20} {bar} "
f"{zone.current_tokens:>7,} / {zone.max_tokens:>7,} "
f"({zone.utilization_pct:>5.1f}%) [{status}]"
)
total_used += zone.current_tokens
lines.append(f"{'─'*60}")
total_pct = round(total_used / self.context_limit * 100, 1)
lines.append(
f" {'Total Used':<20} "
f"{total_used:>30,} / {self.context_limit:>7,} ({total_pct}%)"
)
lines.append(f"{'='*60}\n")
return "\n".join(lines)
# Attach methods to class (needed if running cells sequentially)
TokenBudgetManager.set_content = set_content
TokenBudgetManager.get_summary = get_summary
set_content returns a status dictionary. It tells you how many tokens the content uses, how much budget remains, whether you’re over, and how many tokens to trim.
get_summary builds a visual bar chart. Each zone gets a progress bar, token count, budget, and a status flag.
Using the Budget Manager
Let’s put it to work. We’ll simulate a chatbot with a system prompt, two few-shot examples, and a user question with retrieved context.
manager = TokenBudgetManager(
model="gpt-4o",
system_pct=0.10, # 12,800 tokens for system prompt
examples_pct=0.15, # 19,200 tokens for examples
input_pct=0.50, # 64,000 tokens for user input + docs
response_pct=0.25, # 32,000 tokens for response
)
# Fill the system zone
system_prompt = """You are a financial analyst assistant.
Answer questions about quarterly earnings using the provided context.
Be precise with numbers. Cite specific sections when possible."""
result = manager.set_content("system", system_prompt)
print(f"System: {result['tokens_used']} tokens used")
# Fill the examples zone
examples = """Example 1:
User: What was Q3 revenue?
Context: Q3 2025 revenue was $4.2B, up 12% YoY.
Assistant: Q3 2025 revenue was $4.2 billion, a 12% YoY increase.
Example 2:
User: How did margins change?
Context: Operating margin improved from 18.3% to 21.7%.
Assistant: Operating margins rose 3.4 points, from 18.3% to 21.7%."""
result = manager.set_content("examples", examples)
print(f"Examples: {result['tokens_used']} tokens used")
# Fill the input zone
user_input = """User: What drove the revenue increase in Q3?
Context:
Revenue growth was driven by three factors:
1. Cloud services revenue grew 28% to $1.8B
2. Enterprise licensing renewed at 95% rate
3. New product launches contributed $340M"""
result = manager.set_content("input", user_input)
print(f"Input: {result['tokens_used']} tokens used")
# See the full picture
print(manager.get_summary())
Output:
System: 36 tokens used
Examples: 99 tokens used
Input: 66 tokens used
============================================================
Token Budget — gpt-4o
Context Window: 128,000 tokens
============================================================
System Prompt ░░░░░░░░░░░░░░░░░░░░ 36 / 12,800 ( 0.3%) [OK]
Few-Shot Examples ░░░░░░░░░░░░░░░░░░░░ 99 / 19,200 ( 0.5%) [OK]
User Input + RAG ░░░░░░░░░░░░░░░░░░░░ 66 / 64,000 ( 0.1%) [OK]
Response Reserve ░░░░░░░░░░░░░░░░░░░░ 0 / 32,000 ( 0.0%) [OK]
──────────────────────────────────────────────────────────
Total Used 201 / 128,000 (0.2%)
============================================================
With short inputs, we’re barely using the window. Real applications look different. Let’s push the limits.
{
type: 'exercise',
id: 'budget-manager-setup',
title: 'Exercise 2: Configure a Budget Manager for a Summarizer',
difficulty: 'intermediate',
exerciseType: 'write',
instructions: 'Create a TokenBudgetManager for a document summarizer. A summarizer needs a small system prompt (5%), no few-shot examples (0%), a large input zone (80%), and a small response zone (15%). Use "gpt-4o-mini". Set a system prompt and print the summary.',
starterCode: '# Create a budget manager for a summarizer\nmanager = TokenBudgetManager(\n model="gpt-4o-mini",\n system_pct=0.05,\n examples_pct=# YOUR CODE HERE,\n input_pct=# YOUR CODE HERE,\n response_pct=# YOUR CODE HERE,\n)\n\nmanager.set_content("system", "Summarize the document in 3 bullet points.")\nprint(manager.get_summary())',
testCases: [
{ id: 'tc1', input: 'print(manager.zones["examples"].max_tokens)', expectedOutput: '0', description: 'Examples zone should be 0 tokens' },
{ id: 'tc2', input: 'print(manager.zones["input"].max_tokens)', expectedOutput: '102400', description: 'Input zone should be 80% of 128K' },
{ id: 'tc3', input: 'print(manager.zones["response"].max_tokens)', expectedOutput: '19200', hidden: true, description: 'Response zone should be 15% of 128K' },
],
hints: [
'A summarizer with no examples uses examples_pct=0.0. Input needs 80%.',
'Full answer: examples_pct=0.0, input_pct=0.80, response_pct=0.15',
],
solution: 'manager = TokenBudgetManager(\n model="gpt-4o-mini",\n system_pct=0.05,\n examples_pct=0.0,\n input_pct=0.80,\n response_pct=0.15,\n)\nmanager.set_content("system", "Summarize the document in 3 bullet points.")\nprint(manager.get_summary())',
solutionExplanation: 'A summarizer maximizes input space for the document. With 80% allocated to input, you get 102,400 tokens — about 75,000 words of source text. The response zone at 15% gives 19,200 tokens for the summary.',
xpReward: 15,
}
Stress-Testing with Large Inputs
RAG applications regularly stuff thousands of tokens into the input zone. Let’s see what happens when retrieved documents blow the budget.
We’ll create a large chunk of text and feed it to the input zone. The manager’s set_content method returns a trim_by value — the exact number of tokens you need to cut.
# Simulate a large RAG retrieval
large_context = (
"The quarterly earnings report shows continued growth "
"across all business segments with particular strength "
"in cloud computing and enterprise solutions. "
) * 500
result = manager.set_content("input", large_context)
print(f"Tokens used: {result['tokens_used']:,}")
print(f"Budget: {result['budget']:,}")
print(f"Over budget: {result['over_budget']}")
if result['over_budget']:
print(f"Trim by: {result['trim_by']:,} tokens")
print(manager.get_summary())
The input zone goes over budget. The trim_by field tells you exactly how many tokens to cut. In production, you’d use this signal to drop the least relevant retrieved documents.
Adjusting Budget Allocations by Use Case
Different applications need different splits. A chatbot with long conversation history needs more input space. A code generator needs more response space. How do you decide?
Here’s a function that shows token allocations for four common use cases. It turns percentages into actual token counts.
def compare_allocations(model: str = "gpt-4o"):
"""Compare token budgets for different use cases."""
use_cases = {
"Chatbot": {"sys": 0.05, "ex": 0.05, "in": 0.60, "res": 0.30},
"RAG Q&A": {"sys": 0.10, "ex": 0.10, "in": 0.55, "res": 0.25},
"Code Gen": {"sys": 0.15, "ex": 0.20, "in": 0.15, "res": 0.50},
"Summarizer": {"sys": 0.05, "ex": 0.05, "in": 0.75, "res": 0.15},
}
context = MODEL_PRICING[model]["context"]
print(f"\nBudget Allocations — {model} ({context:,} tokens)")
header = f"{'Use Case':<12} {'System':>8} {'Examples':>10}"
header += f" {'Input':>8} {'Response':>10}"
print(header)
print("─" * 52)
for name, a in use_cases.items():
s = int(context * a["sys"])
e = int(context * a["ex"])
i = int(context * a["in"])
r = int(context * a["res"])
print(f"{name:<12} {s:>7,} {e:>9,} {i:>7,} {r:>9,}")
compare_allocations("gpt-4o")
Output:
Budget Allocations — gpt-4o (128,000 tokens)
Use Case System Examples Input Response
────────────────────────────────────────────────────
Chatbot 6,400 6,400 76,800 38,400
RAG Q&A 12,800 12,800 70,400 32,000
Code Gen 19,200 25,600 19,200 64,000
Summarizer 6,400 6,400 96,000 19,200
See the pattern? Code generation reserves half the window for the response because generated code can be long. Summarizers flip the ratio — they need maximum input space and a short output. A chatbot sits in the middle.
The split you choose depends on your application. Start with these templates, then adjust based on real usage patterns.
Cost Estimation Across Models
When you know your typical prompt size, picking the cheapest model that fits your quality needs saves serious money. Let me show you the numbers.
This function shows the per-request cost for every model in our pricing table.
def estimate_costs_across_models(input_tokens: int, output_tokens: int):
"""Compare costs across all tracked models."""
print(f"\nCost: {input_tokens:,} input + {output_tokens:,} output")
print(f"{'Model':<18} {'Input $':>10} {'Output $':>10} {'Total':>10}")
print("─" * 52)
results = []
for model, pricing in MODEL_PRICING.items():
in_cost = (input_tokens / 1_000_000) * pricing["input"]
out_cost = (output_tokens / 1_000_000) * pricing["output"]
total = in_cost + out_cost
results.append((model, in_cost, out_cost, total))
results.sort(key=lambda x: x[3])
for model, in_cost, out_cost, total in results:
print(f"{model:<18} \({in_cost:>8.4f} \){out_cost:>8.4f} ${total:>8.4f}")
cheapest = results[0]
priciest = results[-1]
ratio = priciest[3] / cheapest[3] if cheapest[3] > 0 else 0
print(f"\n{priciest[0]} costs {ratio:.1f}x more than {cheapest[0]}")
# Typical RAG query
estimate_costs_across_models(10_000, 1_000)
# Heavy summarization
estimate_costs_across_models(50_000, 2_000)
Output:
Cost: 10,000 input + 1,000 output
Model Input $ Output $ Total
────────────────────────────────────────────────────
gpt-4o-mini $ 0.0015 $ 0.0006 $ 0.0021
gpt-3.5-turbo $ 0.0050 $ 0.0015 $ 0.0065
gpt-4o $ 0.0250 $ 0.0100 $ 0.0350
gpt-4o costs 16.7x more than gpt-4o-mini
Cost: 50,000 input + 2,000 output
Model Input $ Output $ Total
────────────────────────────────────────────────────
gpt-4o-mini $ 0.0075 $ 0.0012 $ 0.0087
gpt-3.5-turbo $ 0.0250 $ 0.0030 $ 0.0280
gpt-4o $ 0.1250 $ 0.0200 $ 0.1450
gpt-4o costs 16.7x more than gpt-4o-mini
At 1,000 requests per day with the heavy summarization profile, GPT-4o costs $145/day. GPT-4o-mini costs $8.70/day. That’s $4,350/month versus $261/month.
The gap is massive. And for many tasks, the quality difference is smaller than you’d expect.
{
type: 'exercise',
id: 'monthly-cost-calc',
title: 'Exercise 3: Calculate Monthly API Cost',
difficulty: 'intermediate',
exerciseType: 'write',
instructions: 'Your app handles 500 requests per day. Each request uses 8,000 input tokens and 1,500 output tokens on "gpt-4o". Calculate the daily and monthly cost (30 days). Print both formatted to 2 decimal places.',
starterCode: 'requests_per_day = 500\ninput_per_request = 8_000\noutput_per_request = 1_500\n\ncost_per_request = estimate_cost(\n input_per_request, output_per_request, model="gpt-4o"\n)\n\n# Calculate daily and monthly costs\ndaily_cost = # YOUR CODE HERE\nmonthly_cost = # YOUR CODE HERE\n\nprint(f"Daily cost: \({daily_cost:.2f}")\nprint(f"Monthly cost: \){monthly_cost:.2f}")',
testCases: [
{ id: 'tc1', input: 'print(f"{daily_cost:.2f}")', expectedOutput: '17.50', description: 'Daily cost should be $17.50' },
{ id: 'tc2', input: 'print(f"{monthly_cost:.2f}")', expectedOutput: '525.00', description: 'Monthly cost should be $525.00' },
],
hints: [
'Multiply cost_per_request by requests_per_day for daily cost.',
'daily_cost = cost_per_request * requests_per_day; monthly_cost = daily_cost * 30',
],
solution: 'requests_per_day = 500\ninput_per_request = 8_000\noutput_per_request = 1_500\ncost_per_request = estimate_cost(input_per_request, output_per_request, model="gpt-4o")\ndaily_cost = cost_per_request * requests_per_day\nmonthly_cost = daily_cost * 30\nprint(f"Daily cost: \({daily_cost:.2f}")\nprint(f"Monthly cost: \){monthly_cost:.2f}")',
solutionExplanation: 'Each request costs (8000/1M * $2.50) + (1500/1M * $10.00) = $0.02 + $0.015 = $0.035. At 500 requests/day: $17.50/day. Over 30 days: $525/month. This projection helps you budget API costs before scaling.',
xpReward: 15,
}
Common Context Window Mistakes (and How to Fix Them)
These are the mistakes I see most often when reviewing GenAI applications. Every one leads to wasted money or broken outputs.
Mistake 1: Ignoring the output reservation.
Your response gets cut mid-sentence because you gave the model no room to reply.
# BAD — uses entire context for input max_input = 128_000 # Leaves nothing for response! # GOOD — reserves space for the response max_input = 128_000 - 4_000 # 4K reserved for output
Mistake 2: Forgetting that few-shot examples cost tokens.
Three detailed examples might cost 500-1,000 tokens. In a tight window, that matters.
# Count EVERYTHING going into the prompt
total_input = (
count_tokens(system_prompt)
+ count_tokens(few_shot_examples)
+ count_tokens(conversation_history)
+ count_tokens(user_message)
+ count_tokens(retrieved_docs)
)
print(f"Total input: {total_input:,} tokens")
Mistake 3: Stuffing the window to maximum capacity.
More context isn’t always better. The “lost in the middle” effect means the model may ignore documents in the middle of your prompt. Five relevant documents beat fifty irrelevant ones.
Practice Exercise
Build on the TokenBudgetManager to add a truncate_to_fit method. This method should trim content from the end of a zone’s text until it fits within the budget.
Summary
Context windows define the hard limit of what an LLM can process. Every token counts — system prompt, examples, history, and response all share the same space.
Here’s what to take away:
- Count tokens with
tiktoken, not word-count estimates. Different models use different tokenizers. - Budget your context into zones. System, examples, input, and response each get a percentage.
- Reserve output space first. A full input window leaves no room for the response.
- Compare costs across models. The cheapest model that meets your bar saves thousands monthly.
- Relevance beats volume. The “lost in the middle” effect means fewer, better documents outperform more documents stuffed into a large context.
The TokenBudgetManager we built is a starting point. In production, you’d add dynamic reallocation, conversation history pruning, and integration with your RAG pipeline’s chunking logic.
FAQ
Q: Can I use tiktoken to count tokens for Claude or Gemini?
No. Tiktoken only works for OpenAI models. Claude uses its own tokenizer. Gemini uses SentencePiece. The counts will differ for the same text.
For Anthropic, use their token counting API endpoint. For Gemini, use the count_tokens method in Google’s AI SDK.
| Provider | Tokenizer Library | Encoding | How to Count |
|---|---|---|---|
| OpenAI | tiktoken | BPE (o200k_base) | tiktoken.encoding_for_model("gpt-4o") |
| Anthropic | Anthropic API | Proprietary | client.count_tokens(messages) |
| google-genai SDK | SentencePiece | model.count_tokens(text) |
# OpenAI — use tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
count = len(encoder.encode("Hello world"))
print(f"OpenAI tokens: {count}") # 2
Q: Does a larger context window mean better results?
Not necessarily. Research shows LLM performance degrades on long contexts. Information in the middle gets less attention. Use the smallest effective context for your task.
Q: How do I handle conversation history that grows beyond budget?
Two strategies work well. First, the sliding window — keep only the last N messages. Second, summarization — compress older messages into a shorter summary. The sliding window is simpler. Summarization preserves more context but adds latency.
Complete Code
References
- OpenAI — Tiktoken: BPE tokenizer for OpenAI models. Link
- OpenAI Cookbook — How to count tokens with Tiktoken. Link
- OpenAI — GPT-4o model documentation and pricing. Link
- Anthropic — Claude context windows documentation. Link
- Liu, N. et al. — “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172 (2023). Link
- Google DeepMind — Gemini model family documentation. Link
- OpenAI — Tokenizer tool and BPE explanation. Link
- Anthropic — Token counting API. Link
Topic Cluster Plan
- Tokenization Deep Dive — How BPE, WordPiece, and SentencePiece tokenizers work
- RAG Context Management — Chunking strategies and context window optimization
- Conversation Memory Patterns — Sliding window, summarization, and hybrid approaches
- LLM Cost Optimization — Caching, batching, model routing, prompt compression
- Prompt Compression Techniques — LLMLingua and semantic compression
- Streaming and Token-by-Token Generation — How streaming APIs work
- Multi-Model Routing — Auto-selecting the cheapest model per request
- Fine-Tuning vs Long Context — When to fine-tune versus context stuffing
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →