LLM Context Windows Explained: Token Budget Guide

Learn how LLM context windows work, count tokens with tiktoken, estimate API costs, and build a Python token budget manager that allocates context across system prompts, examples, and input.

Written by Selva Prabhakaran | 28 min read

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

Count tokens, estimate costs, and split your context window across system prompts, examples, and user input — with a reusable budget planner.

You send a long prompt to GPT-4o. The response comes back missing half your instructions. No error. No warning. The model just… forgot the middle of your prompt.

This isn’t a bug. It’s a context window problem. Every LLM has a hard limit on how many tokens it can process at once. Go over that limit, and the API rejects your request or drops information silently. That same 10K-token prompt costs $0.035 on GPT-4o but just $0.002 on GPT-4o-mini — and most developers never check.

Understanding context windows — and managing token budgets — is one of the most practical skills you can build.

What Is a Context Window?

A context window is the total number of tokens an LLM can handle in one request. It covers everything: system prompt, few-shot examples, user input, conversation history, and the model’s response.

Think of it as a desk. You can only spread so many papers before things fall off the edges. The desk size varies by model.

Here’s what the major models offer today:

Model	Context Window	Max Output	Input $/1M	Output $/1M
GPT-4o	128,000	16,384	$2.50	$10.00
GPT-4o-mini	128,000	16,384	$0.15	$0.60
Claude 3.5 Sonnet	200,000	8,192	$3.00	$15.00
Claude Opus 4	200,000	32,000	$5.00	$25.00
Gemini 2.0 Flash	1,048,576	8,192	$0.10	$0.40
Gemini 2.5 Pro	1,048,576	65,536	$1.25	$10.00

Those numbers look generous. A million tokens sounds infinite. But you burn through tokens fast — especially when system prompts, few-shot examples, and conversation history stack up.

Key Insight: The context window is shared between input and output. If your model has a 128K window and your prompt uses 120K tokens, the model can only generate 8K tokens in response. Budget the *whole* window, not just the input.

What Are Tokens?

Before you can manage a budget, you need to understand the currency. Tokens aren’t words. They’re chunks of text that the model’s tokenizer creates.

The word “tokenization” splits into two tokens: “token” and “ization”. Short common words like “the” or “is” are single tokens. Unusual words get split into smaller pieces.

A rough guide:

1 token is about 4 characters in English
100 tokens is about 75 words
1,000 tokens is roughly a page of text

But these are estimates. The actual count depends on each model’s tokenizer. That’s why we need a library to count precisely.

Prerequisites

Python version: 3.9+
Required library: tiktoken (0.5+)
Install: pip install tiktoken
Time to complete: 20-25 minutes
Pyodide note: tiktoken uses Rust extensions. In a browser environment, install via micropip.install("tiktoken"). If that fails, use the pure-Python token estimation fallback shown later in this article.

Let’s load the tokenizer for GPT-4o and see how it breaks text into tokens. The encoding_for_model function picks the right tokenizer automatically. We’ll encode a sentence, print the token IDs, then decode each one back to text.

import micropip
await micropip.install('tiktoken')

import tiktoken

# Load the tokenizer for GPT-4o
encoder = tiktoken.encoding_for_model("gpt-4o")

# Tokenize a sentence
text = "Context windows are shared between input and output tokens."
tokens = encoder.encode(text)

print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")
print(f"Decoded: {[encoder.decode([t]) for t in tokens]}")

Output:

python

Text: Context windows are shared between input and output tokens.
Token count: 9
Tokens: [2014, 11, 527, 6222, 1948, 1946, 323, 2550, 11]
Decoded: ['Context', ' windows', ' are', ' shared', ' between', ' input', ' and', ' output', ' tokens']

Notice the spaces? They attach to the following word. The tokenizer doesn’t treat spaces as separate tokens. This is how byte-pair encoding (BPE) works under the hood.

[UNDER THE HOOD]
How BPE tokenization works: BPE starts with single characters. It then merges the most common pair, over and over. After training, “th” + “e” become “the” — one token. Common words stay whole. Rare words get split into pieces. GPT-4o’s o200k_base vocabulary has 200,000 tokens. A bigger vocabulary means fewer tokens per text. That’s why newer models use larger vocabularies — more fits in your context window.

Quick check: How many tokens do you think the word “MachineLearningPlus” would produce? It’s a single word, but it’s unusual. Try len(encoder.encode("MachineLearningPlus")) — you might be surprised.

Tip: Different models use different tokenizers. GPT-4o uses `o200k_base`. Older GPT-3.5 models used `cl100k_base`. Token counts for the same text differ between models. Always count with the tokenizer that matches your target model.

Counting Tokens for Any Text

Let’s build a reusable function that counts tokens for different models. This is the foundation of our budget manager.

The count_tokens function takes a text string and a model name. It loads the right tokenizer, encodes the text, and returns the length. We’ll also build a cost estimator that uses per-model pricing.

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens in a text string for a given model."""
    encoder = tiktoken.encoding_for_model(model)
    return len(encoder.encode(text))

# Pricing per 1M tokens (input, output)
MODEL_PRICING = {
    "gpt-4o":         {"input": 2.50, "output": 10.00, "context": 128_000},
    "gpt-4o-mini":    {"input": 0.15, "output": 0.60,  "context": 128_000},
    "gpt-3.5-turbo":  {"input": 0.50, "output": 1.50,  "context": 16_385},
}

def estimate_cost(input_tokens: int, output_tokens: int,
                  model: str = "gpt-4o") -> float:
    """Estimate API cost in USD for a given token count."""
    pricing = MODEL_PRICING[model]
    input_cost = (input_tokens / 1_000_000) * pricing["input"]
    output_cost = (output_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)

Both functions are straightforward. Let’s test them with a realistic prompt. We’ll count tokens for a system prompt and a user message, then estimate what the API call would cost.

system_prompt = """You are a senior data analyst. Answer questions about sales data.
Be precise with numbers. Format large numbers with commas.
Always show your reasoning step by step."""

user_message = "What was our total revenue last quarter?"

system_tokens = count_tokens(system_prompt)
user_tokens = count_tokens(user_message)
total_input = system_tokens + user_tokens

# Assume the model generates about 500 tokens in response
estimated_output = 500
cost = estimate_cost(total_input, estimated_output)

print(f"System prompt: {system_tokens} tokens")
print(f"User message:  {user_tokens} tokens")
print(f"Total input:   {total_input} tokens")
print(f"Est. output:   {estimated_output} tokens")
print(f"Est. cost:     ${cost:.6f}")
print(f"Context used:  {total_input / 128_000 * 100:.2f}%")

Output:

python

System prompt: 33 tokens
User message:  9 tokens
Total input:   42 tokens
Est. output:   500 tokens
Est. cost:     $0.005105
Context used:  0.03%

We’re barely touching the context window here. But what happens when you add conversation history, few-shot examples, or retrieved documents? The numbers climb fast.

{
  type: 'exercise',
  id: 'token-counting-cost',
  title: 'Exercise 1: Count Tokens and Estimate Cost',
  difficulty: 'beginner',
  exerciseType: 'write',
  instructions: 'You have a system prompt and a user message. Count the tokens in each using the `count_tokens` function, then compute the total estimated cost using `estimate_cost`. Assume 300 output tokens. Print the total input tokens and the cost.',
  starterCode: 'system = "You are a Python tutor. Explain concepts simply."\nuser = "What is the difference between a list and a tuple?"\n\n# Count tokens for each\nsystem_tokens = count_tokens(system)\nuser_tokens = count_tokens(user)\ntotal_input = system_tokens + user_tokens\n\n# Estimate cost with 300 output tokens\ncost = # YOUR CODE HERE\n\nprint(f"Total input tokens: {total_input}")\nprint(f"Estimated cost: ${cost:.6f}")',
  testCases: [
    { id: 'tc1', input: 'print(type(cost))', expectedOutput: "<class 'float'>", description: 'cost should be a float' },
    { id: 'tc2', input: 'print(total_input > 0)', expectedOutput: 'True', description: 'total_input should be positive' },
    { id: 'tc3', input: 'print(cost > 0)', expectedOutput: 'True', hidden: true, description: 'cost should be positive' },
  ],
  hints: [
    'Use the estimate_cost function with total_input, 300, and model="gpt-4o".',
    'Full answer: cost = estimate_cost(total_input, 300, model="gpt-4o")',
  ],
  solution: 'system = "You are a Python tutor. Explain concepts simply."\nuser = "What is the difference between a list and a tuple?"\nsystem_tokens = count_tokens(system)\nuser_tokens = count_tokens(user)\ntotal_input = system_tokens + user_tokens\ncost = estimate_cost(total_input, 300, model="gpt-4o")\nprint(f"Total input tokens: {total_input}")\nprint(f"Estimated cost: ${cost:.6f}")',
  solutionExplanation: 'We count tokens for both the system prompt and user message, sum them for total input, then pass that plus 300 output tokens to estimate_cost. The function multiplies each count by the per-token price.',
  xpReward: 15,
}

What Happens When You Exceed the Context Window Limit?

This is where things get dangerous. And honestly? This catches even experienced developers off guard. Exceeding the context window doesn’t always produce a clear error.

Three things can happen:

1. Hard rejection. The API returns an error like context_length_exceeded. This is the best outcome — you know something broke.

2. Silent truncation. Some APIs quietly trim your input to fit. You lose information without any warning. The model generates a response from incomplete context.

3. The “lost in the middle” problem. Even when your prompt fits, LLMs pay more attention to the start and end. Information buried in the middle gets ignored. Stanford researchers confirmed this in 2023. Models perform worst when the answer sits in the middle third of a long prompt.

Why does this matter to you? Because a RAG application that stuffs 50 retrieved documents into the prompt might get worse answers than one that includes just the 5 most relevant ones.

Let’s build a function that checks whether a prompt fits before you call the API. It takes input token count and desired output length, then reports whether you’re within budget.

def check_context_fit(input_tokens: int, desired_output: int,
                      model: str = "gpt-4o") -> dict:
    """Check if a prompt fits within the model's context window."""
    context_limit = MODEL_PRICING[model]["context"]
    total_needed = input_tokens + desired_output
    fits = total_needed <= context_limit
    remaining = context_limit - input_tokens

    return {
        "fits": fits,
        "input_tokens": input_tokens,
        "desired_output": desired_output,
        "total_needed": total_needed,
        "context_limit": context_limit,
        "remaining_for_output": max(0, remaining),
        "utilization_pct": round(input_tokens / context_limit * 100, 2),
    }

# Test: does a 100K token input fit in GPT-4o with 16K output?
result = check_context_fit(100_000, 16_000, model="gpt-4o")

print(f"Input tokens:    {result['input_tokens']:,}")
print(f"Desired output:  {result['desired_output']:,}")
print(f"Total needed:    {result['total_needed']:,}")
print(f"Context limit:   {result['context_limit']:,}")
print(f"Fits in window:  {result['fits']}")
print(f"Remaining:       {result['remaining_for_output']:,}")

Output:

python

Input tokens:    100,000
Desired output:  16,000
Total needed:    116,000
Context limit:   128,000
Fits in window:  True
Remaining:       28,000

It fits — but only with 28K tokens to spare. Add a few thousand tokens of conversation history and you’re over the edge.

Predict the output: What happens if you change desired_output to 30,000? The total becomes 130,000 — which exceeds the 128K limit. The function would return fits: False.

Warning: Don’t trust token estimates based on word count. The “1 token = 0.75 words” rule is a rough average. Code and JSON tokenize very differently. A 500-word JSON payload might use 800+ tokens because of brackets, colons, and quotes. Always count with the actual tokenizer.

Building the Token Budget Manager

Here’s the core idea. We’ll split the context window into four zones: system prompt, few-shot examples, user input (plus any RAG docs), and response. Each zone gets a token allocation. The manager tracks usage and warns you before you blow the budget.

I find this four-zone approach covers most real-world LLM applications. Some use cases might not need examples at all — we’ll handle that too.

Let’s start with the BudgetZone dataclass. Each zone stores its name, max token budget, current usage, and the actual content. Three properties compute remaining capacity, utilization percentage, and whether the zone is over budget.

from dataclasses import dataclass

@dataclass
class BudgetZone:
    """A single budget allocation zone."""
    name: str
    max_tokens: int
    current_tokens: int = 0
    content: str = ""

    @property
    def remaining(self) -> int:
        return max(0, self.max_tokens - self.current_tokens)

    @property
    def utilization_pct(self) -> float:
        if self.max_tokens == 0:
            return 0.0
        return round(self.current_tokens / self.max_tokens * 100, 1)

    @property
    def is_over_budget(self) -> bool:
        return self.current_tokens > self.max_tokens

Simple and clean. The remaining property uses max(0, ...) so it never returns negative numbers.

Now the main manager class. It creates four zones based on percentage allocations you provide. The set_content method assigns text to a zone and counts its tokens. The get_summary method prints a visual budget report.

class TokenBudgetManager:
    """Manage token allocations across context window zones."""

    def __init__(self, model: str = "gpt-4o",
                 system_pct: float = 0.10,
                 examples_pct: float = 0.15,
                 input_pct: float = 0.50,
                 response_pct: float = 0.25):
        self.model = model
        self.encoder = tiktoken.encoding_for_model(model)
        self.context_limit = MODEL_PRICING[model]["context"]

        # Allocate zones by percentage of context window
        self.zones = {
            "system":   BudgetZone("System Prompt",
                                   int(self.context_limit * system_pct)),
            "examples": BudgetZone("Few-Shot Examples",
                                   int(self.context_limit * examples_pct)),
            "input":    BudgetZone("User Input + RAG",
                                   int(self.context_limit * input_pct)),
            "response": BudgetZone("Response Reserve",
                                   int(self.context_limit * response_pct)),
        }

    def _count(self, text: str) -> int:
        """Count tokens using the model's tokenizer."""
        return len(self.encoder.encode(text))

The constructor takes a model name and four percentage values. They should add up to 1.0 (100% of the context window). Each percentage gets multiplied by the context limit to produce a token budget.

Now the methods that make it useful — set_content to fill zones and get_summary to see the budget at a glance:

def set_content(self, zone_name: str, content: str) -> dict:
        """Set content for a zone and return status."""
        zone = self.zones[zone_name]
        token_count = self._count(content)
        zone.content = content
        zone.current_tokens = token_count

        return {
            "zone": zone.name,
            "tokens_used": token_count,
            "budget": zone.max_tokens,
            "remaining": zone.remaining,
            "over_budget": zone.is_over_budget,
            "trim_by": max(0, token_count - zone.max_tokens),
        }

    def get_summary(self) -> str:
        """Return a formatted budget summary."""
        lines = [f"\n{'='*60}"]
        lines.append(f"  Token Budget — {self.model}")
        lines.append(f"  Context Window: {self.context_limit:,} tokens")
        lines.append(f"{'='*60}")

        total_used = 0
        for zone in self.zones.values():
            status = "OVER" if zone.is_over_budget else "OK"
            bar_len = int(zone.utilization_pct / 5)
            bar = "█" * bar_len + "░" * (20 - bar_len)
            lines.append(
                f"  {zone.name:<20} {bar} "
                f"{zone.current_tokens:>7,} / {zone.max_tokens:>7,} "
                f"({zone.utilization_pct:>5.1f}%) [{status}]"
            )
            total_used += zone.current_tokens

        lines.append(f"{'─'*60}")
        total_pct = round(total_used / self.context_limit * 100, 1)
        lines.append(
            f"  {'Total Used':<20} "
            f"{total_used:>30,} / {self.context_limit:>7,} ({total_pct}%)"
        )
        lines.append(f"{'='*60}\n")
        return "\n".join(lines)

# Attach methods to class (needed if running cells sequentially)
TokenBudgetManager.set_content = set_content
TokenBudgetManager.get_summary = get_summary

set_content returns a status dictionary. It tells you how many tokens the content uses, how much budget remains, whether you’re over, and how many tokens to trim.

get_summary builds a visual bar chart. Each zone gets a progress bar, token count, budget, and a status flag.

Using the Budget Manager

Let’s put it to work. We’ll simulate a chatbot with a system prompt, two few-shot examples, and a user question with retrieved context.

manager = TokenBudgetManager(
    model="gpt-4o",
    system_pct=0.10,    # 12,800 tokens for system prompt
    examples_pct=0.15,  # 19,200 tokens for examples
    input_pct=0.50,     # 64,000 tokens for user input + docs
    response_pct=0.25,  # 32,000 tokens for response
)

# Fill the system zone
system_prompt = """You are a financial analyst assistant.
Answer questions about quarterly earnings using the provided context.
Be precise with numbers. Cite specific sections when possible."""

result = manager.set_content("system", system_prompt)
print(f"System: {result['tokens_used']} tokens used")

# Fill the examples zone
examples = """Example 1:
User: What was Q3 revenue?
Context: Q3 2025 revenue was $4.2B, up 12% YoY.
Assistant: Q3 2025 revenue was $4.2 billion, a 12% YoY increase.

Example 2:
User: How did margins change?
Context: Operating margin improved from 18.3% to 21.7%.
Assistant: Operating margins rose 3.4 points, from 18.3% to 21.7%."""

result = manager.set_content("examples", examples)
print(f"Examples: {result['tokens_used']} tokens used")

# Fill the input zone
user_input = """User: What drove the revenue increase in Q3?

Context:
Revenue growth was driven by three factors:
1. Cloud services revenue grew 28% to $1.8B
2. Enterprise licensing renewed at 95% rate
3. New product launches contributed $340M"""

result = manager.set_content("input", user_input)
print(f"Input: {result['tokens_used']} tokens used")

# See the full picture
print(manager.get_summary())

Output:

python

System: 36 tokens used
Examples: 99 tokens used
Input: 66 tokens used

============================================================
  Token Budget — gpt-4o
  Context Window: 128,000 tokens
============================================================
  System Prompt        ░░░░░░░░░░░░░░░░░░░░      36 /  12,800 (  0.3%) [OK]
  Few-Shot Examples    ░░░░░░░░░░░░░░░░░░░░      99 /  19,200 (  0.5%) [OK]
  User Input + RAG     ░░░░░░░░░░░░░░░░░░░░      66 /  64,000 (  0.1%) [OK]
  Response Reserve     ░░░░░░░░░░░░░░░░░░░░       0 /  32,000 (  0.0%) [OK]
──────────────────────────────────────────────────────────
  Total Used                                      201 / 128,000 (0.2%)
============================================================

With short inputs, we’re barely using the window. Real applications look different. Let’s push the limits.

{
  type: 'exercise',
  id: 'budget-manager-setup',
  title: 'Exercise 2: Configure a Budget Manager for a Summarizer',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Create a TokenBudgetManager for a document summarizer. A summarizer needs a small system prompt (5%), no few-shot examples (0%), a large input zone (80%), and a small response zone (15%). Use "gpt-4o-mini". Set a system prompt and print the summary.',
  starterCode: '# Create a budget manager for a summarizer\nmanager = TokenBudgetManager(\n    model="gpt-4o-mini",\n    system_pct=0.05,\n    examples_pct=# YOUR CODE HERE,\n    input_pct=# YOUR CODE HERE,\n    response_pct=# YOUR CODE HERE,\n)\n\nmanager.set_content("system", "Summarize the document in 3 bullet points.")\nprint(manager.get_summary())',
  testCases: [
    { id: 'tc1', input: 'print(manager.zones["examples"].max_tokens)', expectedOutput: '0', description: 'Examples zone should be 0 tokens' },
    { id: 'tc2', input: 'print(manager.zones["input"].max_tokens)', expectedOutput: '102400', description: 'Input zone should be 80% of 128K' },
    { id: 'tc3', input: 'print(manager.zones["response"].max_tokens)', expectedOutput: '19200', hidden: true, description: 'Response zone should be 15% of 128K' },
  ],
  hints: [
    'A summarizer with no examples uses examples_pct=0.0. Input needs 80%.',
    'Full answer: examples_pct=0.0, input_pct=0.80, response_pct=0.15',
  ],
  solution: 'manager = TokenBudgetManager(\n    model="gpt-4o-mini",\n    system_pct=0.05,\n    examples_pct=0.0,\n    input_pct=0.80,\n    response_pct=0.15,\n)\nmanager.set_content("system", "Summarize the document in 3 bullet points.")\nprint(manager.get_summary())',
  solutionExplanation: 'A summarizer maximizes input space for the document. With 80% allocated to input, you get 102,400 tokens — about 75,000 words of source text. The response zone at 15% gives 19,200 tokens for the summary.',
  xpReward: 15,
}

Stress-Testing with Large Inputs

RAG applications regularly stuff thousands of tokens into the input zone. Let’s see what happens when retrieved documents blow the budget.

We’ll create a large chunk of text and feed it to the input zone. The manager’s set_content method returns a trim_by value — the exact number of tokens you need to cut.

# Simulate a large RAG retrieval
large_context = (
    "The quarterly earnings report shows continued growth "
    "across all business segments with particular strength "
    "in cloud computing and enterprise solutions. "
) * 500

result = manager.set_content("input", large_context)

print(f"Tokens used: {result['tokens_used']:,}")
print(f"Budget:      {result['budget']:,}")
print(f"Over budget: {result['over_budget']}")
if result['over_budget']:
    print(f"Trim by:     {result['trim_by']:,} tokens")

print(manager.get_summary())

The input zone goes over budget. The trim_by field tells you exactly how many tokens to cut. In production, you’d use this signal to drop the least relevant retrieved documents.

Adjusting Budget Allocations by Use Case

Different applications need different splits. A chatbot with long conversation history needs more input space. A code generator needs more response space. How do you decide?

Here’s a function that shows token allocations for four common use cases. It turns percentages into actual token counts.

def compare_allocations(model: str = "gpt-4o"):
    """Compare token budgets for different use cases."""
    use_cases = {
        "Chatbot":    {"sys": 0.05, "ex": 0.05, "in": 0.60, "res": 0.30},
        "RAG Q&A":    {"sys": 0.10, "ex": 0.10, "in": 0.55, "res": 0.25},
        "Code Gen":   {"sys": 0.15, "ex": 0.20, "in": 0.15, "res": 0.50},
        "Summarizer": {"sys": 0.05, "ex": 0.05, "in": 0.75, "res": 0.15},
    }

    context = MODEL_PRICING[model]["context"]

    print(f"\nBudget Allocations — {model} ({context:,} tokens)")
    header = f"{'Use Case':<12} {'System':>8} {'Examples':>10}"
    header += f" {'Input':>8} {'Response':>10}"
    print(header)
    print("─" * 52)

    for name, a in use_cases.items():
        s = int(context * a["sys"])
        e = int(context * a["ex"])
        i = int(context * a["in"])
        r = int(context * a["res"])
        print(f"{name:<12} {s:>7,} {e:>9,} {i:>7,} {r:>9,}")

compare_allocations("gpt-4o")

Output:

python

Budget Allocations — gpt-4o (128,000 tokens)
Use Case      System   Examples    Input   Response
────────────────────────────────────────────────────
Chatbot        6,400      6,400   76,800     38,400
RAG Q&A       12,800     12,800   70,400     32,000
Code Gen      19,200     25,600   19,200     64,000
Summarizer     6,400      6,400   96,000     19,200

See the pattern? Code generation reserves half the window for the response because generated code can be long. Summarizers flip the ratio — they need maximum input space and a short output. A chatbot sits in the middle.

The split you choose depends on your application. Start with these templates, then adjust based on real usage patterns.

Cost Estimation Across Models

When you know your typical prompt size, picking the cheapest model that fits your quality needs saves serious money. Let me show you the numbers.

This function shows the per-request cost for every model in our pricing table.

def estimate_costs_across_models(input_tokens: int, output_tokens: int):
    """Compare costs across all tracked models."""
    print(f"\nCost: {input_tokens:,} input + {output_tokens:,} output")
    print(f"{'Model':<18} {'Input $':>10} {'Output $':>10} {'Total':>10}")
    print("─" * 52)

    results = []
    for model, pricing in MODEL_PRICING.items():
        in_cost = (input_tokens / 1_000_000) * pricing["input"]
        out_cost = (output_tokens / 1_000_000) * pricing["output"]
        total = in_cost + out_cost
        results.append((model, in_cost, out_cost, total))

    results.sort(key=lambda x: x[3])

    for model, in_cost, out_cost, total in results:
        print(f"{model:<18} \({in_cost:>8.4f} \){out_cost:>8.4f} ${total:>8.4f}")

    cheapest = results[0]
    priciest = results[-1]
    ratio = priciest[3] / cheapest[3] if cheapest[3] > 0 else 0
    print(f"\n{priciest[0]} costs {ratio:.1f}x more than {cheapest[0]}")

# Typical RAG query
estimate_costs_across_models(10_000, 1_000)

# Heavy summarization
estimate_costs_across_models(50_000, 2_000)

Output:

python

Cost: 10,000 input + 1,000 output
Model              Input $   Output $      Total
────────────────────────────────────────────────────
gpt-4o-mini        $  0.0015 $  0.0006 $  0.0021
gpt-3.5-turbo      $  0.0050 $  0.0015 $  0.0065
gpt-4o             $  0.0250 $  0.0100 $  0.0350

gpt-4o costs 16.7x more than gpt-4o-mini

Cost: 50,000 input + 2,000 output
Model              Input $   Output $      Total
────────────────────────────────────────────────────
gpt-4o-mini        $  0.0075 $  0.0012 $  0.0087
gpt-3.5-turbo      $  0.0250 $  0.0030 $  0.0280
gpt-4o             $  0.1250 $  0.0200 $  0.1450

gpt-4o costs 16.7x more than gpt-4o-mini

At 1,000 requests per day with the heavy summarization profile, GPT-4o costs $145/day. GPT-4o-mini costs $8.70/day. That’s $4,350/month versus $261/month.

The gap is massive. And for many tasks, the quality difference is smaller than you’d expect.

Tip: Start with the cheapest model that meets your quality bar. Run a blind test: send 50 prompts to GPT-4o and GPT-4o-mini. Score the outputs. If mini scores within 10% of GPT-4o on your task, you just saved 90%+ on API costs.

{
  type: 'exercise',
  id: 'monthly-cost-calc',
  title: 'Exercise 3: Calculate Monthly API Cost',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Your app handles 500 requests per day. Each request uses 8,000 input tokens and 1,500 output tokens on "gpt-4o". Calculate the daily and monthly cost (30 days). Print both formatted to 2 decimal places.',
  starterCode: 'requests_per_day = 500\ninput_per_request = 8_000\noutput_per_request = 1_500\n\ncost_per_request = estimate_cost(\n    input_per_request, output_per_request, model="gpt-4o"\n)\n\n# Calculate daily and monthly costs\ndaily_cost = # YOUR CODE HERE\nmonthly_cost = # YOUR CODE HERE\n\nprint(f"Daily cost: \({daily_cost:.2f}")\nprint(f"Monthly cost: \){monthly_cost:.2f}")',
  testCases: [
    { id: 'tc1', input: 'print(f"{daily_cost:.2f}")', expectedOutput: '17.50', description: 'Daily cost should be $17.50' },
    { id: 'tc2', input: 'print(f"{monthly_cost:.2f}")', expectedOutput: '525.00', description: 'Monthly cost should be $525.00' },
  ],
  hints: [
    'Multiply cost_per_request by requests_per_day for daily cost.',
    'daily_cost = cost_per_request * requests_per_day; monthly_cost = daily_cost * 30',
  ],
  solution: 'requests_per_day = 500\ninput_per_request = 8_000\noutput_per_request = 1_500\ncost_per_request = estimate_cost(input_per_request, output_per_request, model="gpt-4o")\ndaily_cost = cost_per_request * requests_per_day\nmonthly_cost = daily_cost * 30\nprint(f"Daily cost: \({daily_cost:.2f}")\nprint(f"Monthly cost: \){monthly_cost:.2f}")',
  solutionExplanation: 'Each request costs (8000/1M * $2.50) + (1500/1M * $10.00) = $0.02 + $0.015 = $0.035. At 500 requests/day: $17.50/day. Over 30 days: $525/month. This projection helps you budget API costs before scaling.',
  xpReward: 15,
}

Common Context Window Mistakes (and How to Fix Them)

These are the mistakes I see most often when reviewing GenAI applications. Every one leads to wasted money or broken outputs.

Mistake 1: Ignoring the output reservation.

Your response gets cut mid-sentence because you gave the model no room to reply.

# BAD — uses entire context for input
max_input = 128_000  # Leaves nothing for response!

# GOOD — reserves space for the response
max_input = 128_000 - 4_000  # 4K reserved for output

Mistake 2: Forgetting that few-shot examples cost tokens.

Three detailed examples might cost 500-1,000 tokens. In a tight window, that matters.

# Count EVERYTHING going into the prompt
total_input = (
    count_tokens(system_prompt)
    + count_tokens(few_shot_examples)
    + count_tokens(conversation_history)
    + count_tokens(user_message)
    + count_tokens(retrieved_docs)
)
print(f"Total input: {total_input:,} tokens")

Mistake 3: Stuffing the window to maximum capacity.

More context isn’t always better. The “lost in the middle” effect means the model may ignore documents in the middle of your prompt. Five relevant documents beat fifty irrelevant ones.

Warning: Never assume “bigger context window = better results.” Gemini’s 1M-token window is impressive, but filling it to capacity degrades performance. Use the smallest effective context for your task. Quality of context beats quantity every time.

Practice Exercise

Build on the TokenBudgetManager to add a truncate_to_fit method. This method should trim content from the end of a zone’s text until it fits within the budget.

Click to see the solution

def truncate_to_fit(self, zone_name: str) -> dict:
    """Truncate a zone's content to fit within its budget."""
    zone = self.zones[zone_name]
    if not zone.is_over_budget:
        return {"action": "none", "tokens_removed": 0}

    tokens = self.encoder.encode(zone.content)
    trimmed = tokens[:zone.max_tokens]
    zone.content = self.encoder.decode(trimmed)
    original = zone.current_tokens
    zone.current_tokens = len(trimmed)

    return {
        "action": "truncated",
        "tokens_removed": original - zone.current_tokens,
        "new_count": zone.current_tokens,
    }

TokenBudgetManager.truncate_to_fit = truncate_to_fit

The method encodes the content into tokens, slices to the budget, and decodes back to text. It returns how many tokens were removed.

Summary

Context windows define the hard limit of what an LLM can process. Every token counts — system prompt, examples, history, and response all share the same space.

Here’s what to take away:

Count tokens with tiktoken, not word-count estimates. Different models use different tokenizers.
Budget your context into zones. System, examples, input, and response each get a percentage.
Reserve output space first. A full input window leaves no room for the response.
Compare costs across models. The cheapest model that meets your bar saves thousands monthly.
Relevance beats volume. The “lost in the middle” effect means fewer, better documents outperform more documents stuffed into a large context.

The TokenBudgetManager we built is a starting point. In production, you’d add dynamic reallocation, conversation history pruning, and integration with your RAG pipeline’s chunking logic.

FAQ

Q: Can I use tiktoken to count tokens for Claude or Gemini?

No. Tiktoken only works for OpenAI models. Claude uses its own tokenizer. Gemini uses SentencePiece. The counts will differ for the same text.

For Anthropic, use their token counting API endpoint. For Gemini, use the count_tokens method in Google’s AI SDK.

Provider	Tokenizer Library	Encoding	How to Count
OpenAI	tiktoken	BPE (o200k_base)	`tiktoken.encoding_for_model("gpt-4o")`
Anthropic	Anthropic API	Proprietary	`client.count_tokens(messages)`
Google	google-genai SDK	SentencePiece	`model.count_tokens(text)`

# OpenAI — use tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
count = len(encoder.encode("Hello world"))
print(f"OpenAI tokens: {count}")  # 2

Q: Does a larger context window mean better results?

Not necessarily. Research shows LLM performance degrades on long contexts. Information in the middle gets less attention. Use the smallest effective context for your task.

Q: How do I handle conversation history that grows beyond budget?

Two strategies work well. First, the sliding window — keep only the last N messages. Second, summarization — compress older messages into a shorter summary. The sliding window is simpler. Summarization preserves more context but adds latency.

Complete Code

Click to expand the full script (copy-paste and run)

# Complete code from: Context Windows in LLMs
# Requires: pip install tiktoken
# Python 3.9+

import tiktoken
from dataclasses import dataclass

# --- Token Counting ---

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens in a text string for a given model."""
    encoder = tiktoken.encoding_for_model(model)
    return len(encoder.encode(text))

# --- Model Pricing ---

MODEL_PRICING = {
    "gpt-4o":        {"input": 2.50, "output": 10.00, "context": 128_000},
    "gpt-4o-mini":   {"input": 0.15, "output": 0.60,  "context": 128_000},
    "gpt-3.5-turbo": {"input": 0.50, "output": 1.50,  "context": 16_385},
}

def estimate_cost(input_tokens: int, output_tokens: int,
                  model: str = "gpt-4o") -> float:
    """Estimate API cost in USD."""
    pricing = MODEL_PRICING[model]
    input_cost = (input_tokens / 1_000_000) * pricing["input"]
    output_cost = (output_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)

# --- Budget Zone ---

@dataclass
class BudgetZone:
    name: str
    max_tokens: int
    current_tokens: int = 0
    content: str = ""

    @property
    def remaining(self) -> int:
        return max(0, self.max_tokens - self.current_tokens)

    @property
    def utilization_pct(self) -> float:
        if self.max_tokens == 0:
            return 0.0
        return round(self.current_tokens / self.max_tokens * 100, 1)

    @property
    def is_over_budget(self) -> bool:
        return self.current_tokens > self.max_tokens

# --- Token Budget Manager ---

class TokenBudgetManager:
    def __init__(self, model="gpt-4o", system_pct=0.10,
                 examples_pct=0.15, input_pct=0.50, response_pct=0.25):
        self.model = model
        self.encoder = tiktoken.encoding_for_model(model)
        self.context_limit = MODEL_PRICING[model]["context"]
        self.zones = {
            "system":   BudgetZone("System Prompt",
                                   int(self.context_limit * system_pct)),
            "examples": BudgetZone("Few-Shot Examples",
                                   int(self.context_limit * examples_pct)),
            "input":    BudgetZone("User Input + RAG",
                                   int(self.context_limit * input_pct)),
            "response": BudgetZone("Response Reserve",
                                   int(self.context_limit * response_pct)),
        }

    def _count(self, text):
        return len(self.encoder.encode(text))

    def set_content(self, zone_name, content):
        zone = self.zones[zone_name]
        token_count = self._count(content)
        zone.content = content
        zone.current_tokens = token_count
        return {
            "zone": zone.name, "tokens_used": token_count,
            "budget": zone.max_tokens, "remaining": zone.remaining,
            "over_budget": zone.is_over_budget,
            "trim_by": max(0, token_count - zone.max_tokens),
        }

    def get_summary(self):
        lines = [f"\n{'='*60}",
                 f"  Token Budget — {self.model}",
                 f"  Context Window: {self.context_limit:,} tokens",
                 f"{'='*60}"]
        total_used = 0
        for zone in self.zones.values():
            status = "OVER" if zone.is_over_budget else "OK"
            bar_len = int(zone.utilization_pct / 5)
            bar = "█" * bar_len + "░" * (20 - bar_len)
            lines.append(
                f"  {zone.name:<20} {bar} "
                f"{zone.current_tokens:>7,} / {zone.max_tokens:>7,} "
                f"({zone.utilization_pct:>5.1f}%) [{status}]"
            )
            total_used += zone.current_tokens
        lines.append(f"{'─'*60}")
        pct = round(total_used / self.context_limit * 100, 1)
        lines.append(f"  Total: {total_used:,} / {self.context_limit:,} ({pct}%)")
        lines.append(f"{'='*60}\n")
        return "\n".join(lines)

    def truncate_to_fit(self, zone_name):
        zone = self.zones[zone_name]
        if not zone.is_over_budget:
            return {"action": "none", "tokens_removed": 0}
        tokens = self.encoder.encode(zone.content)
        trimmed = tokens[:zone.max_tokens]
        zone.content = self.encoder.decode(trimmed)
        orig = zone.current_tokens
        zone.current_tokens = len(trimmed)
        return {"action": "truncated", "removed": orig - zone.current_tokens}

# --- Utilities ---

def check_context_fit(input_tokens, desired_output, model="gpt-4o"):
    ctx = MODEL_PRICING[model]["context"]
    total = input_tokens + desired_output
    return {
        "fits": total <= ctx,
        "input_tokens": input_tokens,
        "total_needed": total,
        "context_limit": ctx,
        "remaining": max(0, ctx - input_tokens),
    }

def compare_allocations(model="gpt-4o"):
    cases = {
        "Chatbot":    {"s": 0.05, "e": 0.05, "i": 0.60, "r": 0.30},
        "RAG Q&A":    {"s": 0.10, "e": 0.10, "i": 0.55, "r": 0.25},
        "Code Gen":   {"s": 0.15, "e": 0.20, "i": 0.15, "r": 0.50},
        "Summarizer": {"s": 0.05, "e": 0.05, "i": 0.75, "r": 0.15},
    }
    ctx = MODEL_PRICING[model]["context"]
    print(f"\nAllocations — {model} ({ctx:,} tokens)")
    for name, a in cases.items():
        print(f"  {name:<12} sys={int(ctx*a['s']):>6,}  "
              f"ex={int(ctx*a['e']):>6,}  "
              f"in={int(ctx*a['i']):>6,}  "
              f"res={int(ctx*a['r']):>6,}")

def estimate_costs_across_models(input_tokens, output_tokens):
    print(f"\nCost: {input_tokens:,} in + {output_tokens:,} out")
    for model, p in sorted(MODEL_PRICING.items(),
                           key=lambda x: x[1]["input"]):
        ic = (input_tokens/1e6) * p["input"]
        oc = (output_tokens/1e6) * p["output"]
        print(f"  {model:<18} ${ic+oc:.4f}")

# --- Demo ---

if __name__ == "__main__":
    text = "Context windows are shared between input and output."
    print(f"Token count: {count_tokens(text)}")

    mgr = TokenBudgetManager(model="gpt-4o")
    mgr.set_content("system", "You are a helpful assistant.")
    mgr.set_content("input", "What is Python?")
    print(mgr.get_summary())

    estimate_costs_across_models(10_000, 1_000)
    compare_allocations("gpt-4o")
    print("\nScript completed successfully.")

References

OpenAI — Tiktoken: BPE tokenizer for OpenAI models. Link
OpenAI Cookbook — How to count tokens with Tiktoken. Link
OpenAI — GPT-4o model documentation and pricing. Link
Anthropic — Claude context windows documentation. Link
Liu, N. et al. — “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172 (2023). Link
Google DeepMind — Gemini model family documentation. Link
OpenAI — Tokenizer tool and BPE explanation. Link
Anthropic — Token counting API. Link

Topic Cluster Plan

Tokenization Deep Dive — How BPE, WordPiece, and SentencePiece tokenizers work
RAG Context Management — Chunking strategies and context window optimization
Conversation Memory Patterns — Sliding window, summarization, and hybrid approaches
LLM Cost Optimization — Caching, batching, model routing, prompt compression
Prompt Compression Techniques — LLMLingua and semantic compression
Streaming and Token-by-Token Generation — How streaming APIs work
Multi-Model Routing — Auto-selecting the cheapest model per request
Fine-Tuning vs Long Context — When to fine-tune versus context stuffing

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

LLM Context Windows Explained: Token Budget Guide

What Is a Context Window?

What Are Tokens?

Prerequisites

Counting Tokens for Any Text

What Happens When You Exceed the Context Window Limit?

Building the Token Budget Manager

Using the Budget Manager

Stress-Testing with Large Inputs

Adjusting Budget Allocations by Use Case

Cost Estimation Across Models

Common Context Window Mistakes (and How to Fix Them)

Practice Exercise

Summary

FAQ

Complete Code

References

Topic Cluster Plan

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is a Context Window?

What Are Tokens?

Prerequisites

Counting Tokens for Any Text

What Happens When You Exceed the Context Window Limit?

Building the Token Budget Manager

Using the Budget Manager

Stress-Testing with Large Inputs

Adjusting Budget Allocations by Use Case

Cost Estimation Across Models

Common Context Window Mistakes (and How to Fix Them)

Practice Exercise

Summary

FAQ

Complete Code

References

Topic Cluster Plan

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Zero-Shot vs Few-Shot Prompting: Complete Guide

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science