Menu

LLM API Pricing Guide: Compare & Optimize Costs

Learn to count LLM tokens with tiktoken, compare API pricing across OpenAI, Claude, and Gemini (March 2026), and cut costs by 90% with 10 proven strategies and code.

Written by Selva Prabhakaran | 26 min read


⚡ This post has interactive code — click ▶ Run or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

Every dollar you spend on LLM APIs comes down to tokens. Here’s how to count them, compare prices across OpenAI, Claude, and Gemini, and slash your bill by up to 90%.

You called an API ten times during prototyping. The bill was $0.03. Fine. Then you shipped to production, and suddenly you’re burning $200 a day.

What happened? Tokens happened. Every prompt you send and every response you get back is measured in tokens. Miss this, and costs spiral. Track it, and you control exactly where your money goes.

What Are Tokens and Why Do They Control Your LLM Costs?

import micropip
await micropip.install(['tiktoken'])


import tiktoken
from datetime import datetime
import hashlib

encoder = tiktoken.encoding_for_model("gpt-4o")
text = "Managing LLM costs is essential for production apps."
tokens = encoder.encode(text)
print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")

Output:

python
Text: Managing LLM costs is essential for production apps.
Token count: 9
Tokens: [38032, 445, 11237, 7194, 374, 7718, 369, 5788, 10721]

A token is a chunk of text. In English, one token equals roughly 4 characters. The model doesn’t see words. It sees tokens.

Here’s what trips people up: you pay twice per request. Input tokens have one rate. Output tokens cost more. Usually 3-5x more.

Why the gap? The model reads all input tokens at once. But it writes output tokens one by one. Each one needs a full pass through the model. More work means a higher price.

Key Insight: Your LLM API bill = (input tokens x input price) + (output tokens x output price). Output tokens cost 3-5x more, so a verbose response hurts your wallet far more than a long prompt.

Prerequisites

  • Python version: 3.10+
  • Required library: tiktoken (pip install tiktoken)
  • Pyodide compatible: Yes (tiktoken runs in the browser)
  • Time to complete: ~25 minutes
  • Cost: $0 (all calculations run locally — no API keys needed)

How Do You Count Tokens Before Sending a Request?

You can’t cut costs you don’t measure. Counting tokens before each API call is the first step to controlling your LLM API costs.

tiktoken is OpenAI’s official tokenizer. It gives you the exact token count the API will bill you for.

def count_tokens(text, model="gpt-4o"):
    """Count tokens for a given text and model."""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

short = "What is Python?"
print(f"Short prompt: {count_tokens(short)} tokens")

long_prompt = """You are a senior data scientist. Analyze this dataset
and provide insights on customer churn patterns. Focus on the top 3
factors driving churn and suggest retention strategies."""
print(f"Long prompt: {count_tokens(long_prompt)} tokens")

Output:

python
Short prompt: 3 tokens
Long prompt: 33 tokens

That works for a single message. But real API calls bundle system prompts, conversation history, and user messages. Every piece adds tokens.

Quick check: Before you read the next block, guess — how many extra tokens does the chat message format add per message? (Answer: about 4 overhead tokens.)

The function below counts tokens for a full chat conversation. It accounts for the overhead that OpenAI’s chat format adds per message. The API charges for those formatting tokens too.

def count_chat_tokens(messages, model="gpt-4o"):
    """Count tokens for a full chat conversation."""
    enc = tiktoken.encoding_for_model(model)
    tokens_per_message = 4  # overhead per message

    total = 0
    for message in messages:
        total += tokens_per_message
        for key, value in message.items():
            total += len(enc.encode(value))
    total += 2  # reply priming tokens
    return total

messages = [
    {"role": "system", "content": "You are a helpful data science tutor."},
    {"role": "user", "content": "Explain gradient descent in simple terms."},
]
print(f"Chat tokens: {count_chat_tokens(messages)}")

Output:

python
Chat tokens: 24
Tip: Always count tokens before sending. A 2,000-token system prompt across 10,000 daily requests = 20 million input tokens per day. At Sonnet pricing ($3/M), that’s $60/day just for the system prompt.

LLM API Pricing Comparison — March 2026

Pricing moves fast. Here’s what the major providers charge as of March 2026, organized by cost tier so you can match your budget to your use case.

Budget Tier — Under $0.50 per Million Input Tokens

These models handle classification, extraction, and simple chat. Don’t underestimate them — they’re surprisingly good for routine tasks.

Model Input (per 1M) Output (per 1M) Context
Gemini 2.5 Flash-Lite $0.10 $0.40 1M
GPT-4o-mini $0.15 $0.60 128K
DeepSeek V3.2 $0.28 $0.42 128K
Gemini 2.0 Flash $0.30 $2.50 1M

Mid Tier — $1 to $3 per Million Input Tokens

The workhorses. This tier covers 80% of production use cases that I’ve encountered.

Model Input (per 1M) Output (per 1M) Context
Claude Haiku 4.5 $1.00 $5.00 200K
Gemini 2.5 Pro $1.25 $10.00 1M
GPT-4.1 $2.00 $8.00 1M
GPT-4o $2.50 $10.00 128K
Claude Sonnet 4.6 $3.00 $15.00 1M

Premium Tier — $5+ per Million Input Tokens

When quality matters more than cost. Complex reasoning, creative writing, nuanced analysis.

Model Input (per 1M) Output (per 1M) Context
Claude Opus 4.6 $5.00 $25.00 1M
GPT-5.4 $2.50 $10.00 1M
Warning: Output tokens are the silent budget killer. Claude Opus charges $25 per million output tokens. A chatty response with 500+ tokens per reply will dwarf your input costs. Always set `max_tokens`.

What do these prices mean for a real app? Say you handle 10,000 requests per day. Each request has a 500-token prompt and a 300-token response.

daily_requests = 10_000
input_tokens_per_request = 500
output_tokens_per_request = 300

models = {
    "GPT-4o-mini":           (0.15, 0.60),
    "Claude Sonnet 4.6":     (3.00, 15.00),
    "Gemini 2.5 Flash-Lite": (0.10, 0.40),
}

print(f"{'Model':<25} {'Daily Cost':>10} {'Monthly Cost':>12}")
print("-" * 50)

for model, (inp_price, out_price) in models.items():
    daily_input = (daily_requests * input_tokens_per_request / 1e6) * inp_price
    daily_output = (daily_requests * output_tokens_per_request / 1e6) * out_price
    daily_total = daily_input + daily_output
    monthly_total = daily_total * 30
    print(f"{model:<25} \({daily_total:>8.2f} \){monthly_total:>10.2f}")

Result:

python
Model                     Daily Cost Monthly Cost
--------------------------------------------------
GPT-4o-mini               $    2.55 $     76.50
Claude Sonnet 4.6         $   60.00 $   1800.00
Gemini 2.5 Flash-Lite     $    1.70 $     51.00

Sonnet costs 35x more than Flash-Lite for the identical workload. That’s $1,749 per month difference. Model selection is far and away your biggest cost lever.

Watch Out: Reasoning Models Have Hidden Costs

Models like OpenAI’s o3 and o4-mini “think” before they answer. They make hidden tokens as part of their chain of thought. A reply with 200 visible tokens might burn 10,000-30,000 thinking tokens behind the scenes.

You pay for those thinking tokens at the output rate. One request can cost 50-100x what you’d guess. If you use reasoning models, track total tokens — visible plus thinking.

Build a Cost Tracking Dashboard

Knowing prices is step one. Tracking actual spend in real time is step two. Here’s a CostTracker class that logs every API call and shows exactly where your money goes.

It tracks eight popular models, logs token counts, computes costs, and gives you reports by model and by feature. Only needs Python built-ins plus tiktoken — runs in Pyodide too.

class CostTracker:
    """Track LLM API costs across multiple providers."""

    PRICING = {
        "gpt-4o-mini":          (0.15, 0.60),
        "gpt-4o":               (2.50, 10.00),
        "gpt-4.1":              (2.00, 8.00),
        "claude-haiku-4.5":     (1.00, 5.00),
        "claude-sonnet-4.6":    (3.00, 15.00),
        "claude-opus-4.6":      (5.00, 25.00),
        "gemini-2.5-flash-lite":(0.10, 0.40),
        "gemini-2.5-pro":       (1.25, 10.00),
    }

    def __init__(self):
        self.logs = []

    def log_request(self, model, input_tokens, output_tokens, tag="default"):
        """Log a single API request with cost calculation."""
        if model not in self.PRICING:
            raise ValueError(f"Unknown model: {model}")
        inp_price, out_price = self.PRICING[model]
        input_cost = (input_tokens / 1_000_000) * inp_price
        output_cost = (output_tokens / 1_000_000) * out_price
        entry = {
            "timestamp": datetime.now().isoformat(),
            "model": model, "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "input_cost": input_cost, "output_cost": output_cost,
            "total_cost": input_cost + output_cost, "tag": tag,
        }
        self.logs.append(entry)
        return entry

    def summary_by_model(self):
        """Cost breakdown grouped by model."""
        summary = {}
        for log in self.logs:
            m = log["model"]
            if m not in summary:
                summary[m] = {"requests": 0, "total_cost": 0.0}
            summary[m]["requests"] += 1
            summary[m]["total_cost"] += log["total_cost"]
        return summary

    def summary_by_tag(self):
        """Cost breakdown grouped by usage tag."""
        summary = {}
        for log in self.logs:
            t = log["tag"]
            if t not in summary:
                summary[t] = {"requests": 0, "total_cost": 0.0}
            summary[t]["requests"] += 1
            summary[t]["total_cost"] += log["total_cost"]
        return summary

    def total_spend(self):
        return sum(log["total_cost"] for log in self.logs)

Each log_request call records the model, token counts, computed costs, and a tag like “chat” or “summarization.” The tag is the key — it lets you track costs per feature in your app.

Predict the output: We’ll log 500 cheap chat requests, 100 mid-tier summarizations, and 50 code generation calls. Which feature will eat the most budget?

tracker = CostTracker()

for _ in range(500):
    tracker.log_request("gpt-4o-mini", input_tokens=800,
                        output_tokens=200, tag="chat")

for _ in range(100):
    tracker.log_request("claude-sonnet-4.6", input_tokens=4000,
                        output_tokens=500, tag="summarization")

for _ in range(50):
    tracker.log_request("gpt-4.1", input_tokens=1500,
                        output_tokens=2000, tag="code-gen")

print(f"Total spend: ${tracker.total_spend():.4f}\n")

print("=== Cost by Model ===")
for model, data in tracker.summary_by_model().items():
    print(f"  {model}: {data['requests']} reqs, ${data['total_cost']:.4f}")

print("\n=== Cost by Feature ===")
for tag, data in tracker.summary_by_tag().items():
    print(f"  {tag}: {data['requests']} reqs, ${data['total_cost']:.4f}")

Here’s what you see:

python
Total spend: $3.0200

=== Cost by Model ===
  gpt-4o-mini: 500 reqs, $0.1200
  claude-sonnet-4.6: 100 reqs, $1.9500
  gpt-4.1: 50 reqs, $0.9500

=== Cost by Feature ===
  chat: 500 reqs, $0.1200
  summarization: 100 reqs, $1.9500
  code-gen: 50 reqs, $0.9500

Sonnet eats 65% of the budget with just 100 requests. Chat runs 5x more volume but costs a fraction. The model price and input size make all the difference. This kind of insight changes how you build LLM apps.

10 Strategies to Cut Your LLM API Costs

I’ve ranked these by impact. The first three alone can cut costs by 80%.

Strategy 1: Route Tasks to the Cheapest Model That Works

Not every request needs your flagship model. A classification task doesn’t need Claude Opus. Route requests based on complexity.

def route_to_model(task_type):
    """Route tasks to the most cost-effective model."""
    routing = {
        "classification": "gpt-4o-mini",
        "extraction": "gpt-4o-mini",
        "summarization": "gemini-2.5-flash-lite",
        "code_generation": "gpt-4.1",
        "complex_reasoning": "claude-sonnet-4.6",
        "creative_writing": "claude-opus-4.6",
    }
    return routing.get(task_type, "gpt-4o-mini")

for task in ["classification", "summarization", "complex_reasoning"]:
    print(f"  {task:>20} -> {route_to_model(task)}")

Output:

python
      classification -> gpt-4o-mini
      summarization -> gemini-2.5-flash-lite
  complex_reasoning -> claude-sonnet-4.6

Route 70% of requests to budget models. Send 20% to mid-tier. Save premium for just 10%. That mix cuts costs 60-80%. It’s the single biggest lever you have.

Strategy 2: Enable Prompt Caching

Does your system prompt stay the same across requests? Prompt caching will save you a fortune. The provider stores your prompt on their servers. Each reuse costs pennies.

Provider Cache Write Cache Read Savings on Cached Portion
OpenAI (GPT-4.1) 1x standard 25% of standard 75%
OpenAI (GPT-5) 1x standard 10% of standard 90%
Anthropic 1.25x standard 10% of standard 90%
Google 1x standard Free (some models) Up to 100%

What does that look like for a 2,000-token system prompt sent 10,000 times daily?

system_tokens = 2000
daily_reqs = 10_000
price_per_m = 3.00  # Sonnet input price

no_cache = (system_tokens * daily_reqs / 1e6) * price_per_m
cache_write = (system_tokens / 1e6) * price_per_m * 1.25
cache_reads = (system_tokens * (daily_reqs - 1) / 1e6) * price_per_m * 0.10
cached = cache_write + cache_reads

print(f"Without caching: ${no_cache:.2f}/day")
print(f"With caching:    ${cached:.2f}/day")
print(f"Savings:         ${no_cache - cached:.2f}/day ({(1 - cached/no_cache)*100:.0f}%)")

Output:

python
Without caching: $60.00/day
With caching:    $6.01/day
Savings:         $53.99/day (90%)

Key Insight: Prompt caching saves 75-90% on repeated content. Put static instructions first and variable content last in your prompt to maximize cache hits.

Strategy 3: Use the Batch API for Non-Urgent Work

OpenAI and Anthropic both offer batch APIs. Submit a file of requests, wait up to 24 hours, and pay 50% less. You can even combine batch with prompt caching.

Good candidates: nightly data processing, bulk classification, eval pipelines, document analysis.

daily_input = 50_000_000   # 50M input tokens
daily_output = 10_000_000  # 10M output tokens

standard = (daily_input/1e6 * 2.00) + (daily_output/1e6 * 8.00)
batch = standard * 0.50

print(f"Standard: ${standard:.2f}/day")
print(f"Batch:    ${batch:.2f}/day")
print(f"Monthly savings: ${(standard - batch) * 30:.2f}")

Output:

python
Standard: $180.00/day
Batch:    $90.00/day
Monthly savings: $2700.00

Strategy 4: Cap Output Tokens with max_tokens

Every extra output token costs money. Without max_tokens, the model decides how long to respond. For open-ended questions, that could mean 1,000+ tokens when 200 would do.

scenarios = [
    ("No cap (avg 800 tokens)", 800),
    ("Cap at 500 tokens", 500),
    ("Cap at 200 tokens", 200),
    ("Cap at 100 tokens", 100),
]

daily_reqs = 10_000
out_price = 15.00  # Sonnet output price per M

print(f"{'Scenario':<30} {'Daily Output Cost':>18}")
print("-" * 50)
for name, tok in scenarios:
    cost = daily_reqs * tok / 1e6 * out_price
    print(f"{name:<30} ${cost:>16.2f}")

Output:

python
Scenario                         Daily Output Cost
--------------------------------------------------
No cap (avg 800 tokens)        $          120.00
Cap at 500 tokens              $           75.00
Cap at 200 tokens              $           30.00
Cap at 100 tokens              $           15.00

Dropping from 800 to 200 cuts output costs by 75%. Tell the model to be concise in the system prompt, then enforce it with max_tokens.

Strategy 5: Compress Your Prompts

Many prompts say the same thing twice in different words. Cut the filler.

Here’s a before/after:

verbose = """Please analyze the following customer feedback data and
provide a comprehensive summary of the main themes, sentiment
distribution, and key actionable insights that our product team
should focus on for the next quarter's roadmap planning process."""

compressed = """Analyze this customer feedback. Return:
1. Top 3 themes
2. Sentiment breakdown (positive/neutral/negative %)
3. Top 3 actionable items for product team"""

enc = tiktoken.encoding_for_model("gpt-4o")
v_count = len(enc.encode(verbose))
c_count = len(enc.encode(compressed))

print(f"Verbose:    {v_count} tokens")
print(f"Compressed: {c_count} tokens")
print(f"Saved:      {v_count - c_count} tokens ({(1 - c_count/v_count)*100:.0f}%)")

Output:

python
Verbose:    40 tokens
Compressed: 36 tokens
Saved:      4 tokens (10%)

Per-prompt savings look modest. The real win comes from trimming bloated system prompts. A 4,000-token system prompt that you shave to 2,000 saves $30/day at Sonnet pricing across 10K requests.

Strategy 6: Trim Conversation History

Chat apps pile up history. Every turn adds tokens. You send ALL of it with every new request.

Sound familiar? Your chatbot works great for 5 messages. By message 20, it’s eating 10x the tokens per request.

def trim_conversation(messages, max_tokens=2000, model="gpt-4o"):
    """Keep only recent messages within a token budget."""
    enc = tiktoken.encoding_for_model(model)
    system_msgs = [m for m in messages if m["role"] == "system"]
    other_msgs = [m for m in messages if m["role"] != "system"]

    sys_tokens = sum(len(enc.encode(m["content"])) for m in system_msgs)
    budget = max_tokens - sys_tokens

    kept = []
    used = 0
    for msg in reversed(other_msgs):
        msg_tokens = len(enc.encode(msg["content"])) + 4
        if used + msg_tokens > budget:
            break
        kept.insert(0, msg)
        used += msg_tokens
    return system_msgs + kept

messages = [{"role": "system", "content": "You are a helpful assistant."}]
for i in range(20):
    messages.append({"role": "user", "content": f"Question {i+1} about data science topics."})
    messages.append({"role": "assistant", "content": f"Answer {i+1} with a detailed explanation."})

original = count_chat_tokens(messages)
trimmed_msgs = trim_conversation(messages, max_tokens=500)
trimmed = count_chat_tokens(trimmed_msgs)

print(f"Original: {len(messages)} msgs, {original} tokens")
print(f"Trimmed:  {len(trimmed_msgs)} msgs, {trimmed} tokens")
print(f"Saved:    {original - trimmed} tokens")

Output:

python
Original: 41 msgs, 564 tokens
Trimmed:  29 msgs, 414 tokens
Saved:    150 tokens

I prefer a sliding window of the last N turns plus the system prompt. You lose early context, but costs stay predictable.

Strategy 7: Ask for Structured Output

When you need data, ask for JSON. It’s compact and parseable. Models produce fewer tokens when constrained to a schema.

prose_prompt = "Analyze the sentiment of this review and explain your reasoning."
struct_prompt = """Analyze this review. Return JSON only:
{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0}"""

enc = tiktoken.encoding_for_model("gpt-4o")
print(f"Prose prompt:      {len(enc.encode(prose_prompt))} input tokens")
print(f"Structured prompt: {len(enc.encode(struct_prompt))} input tokens")
print(f"Expected output savings: ~60-70% (JSON vs prose)")

Output:

python
Prose prompt:      12 input tokens
Structured prompt: 27 input tokens
Expected output savings: ~60-70% (JSON vs prose)

The structured prompt costs a few more input tokens. But the response shrinks dramatically. Good trade — output tokens cost 3-5x more.

Strategy 8: Cache Responses Locally

If the same question appears repeatedly, don’t hit the API again. Cache it.

class ResponseCache:
    """Simple in-memory cache for API responses."""

    def __init__(self):
        self.cache = {}
        self.hits = 0
        self.misses = 0

    def _make_key(self, model, messages):
        content = f"{model}:{str(messages)}"
        return hashlib.md5(content.encode()).hexdigest()

    def get(self, model, messages):
        key = self._make_key(model, messages)
        if key in self.cache:
            self.hits += 1
            return self.cache[key]
        self.misses += 1
        return None

    def set(self, model, messages, response):
        key = self._make_key(model, messages)
        self.cache[key] = response

    def stats(self):
        total = self.hits + self.misses
        rate = (self.hits / total * 100) if total > 0 else 0
        return f"hits={self.hits}, misses={self.misses}, rate={rate:.0f}%"

cache = ResponseCache()

queries = ["What is Python?"] * 8 + ["What is Java?"] * 2
for q in queries:
    msgs = [{"role": "user", "content": q}]
    if cache.get("gpt-4o-mini", msgs) is None:
        cache.set("gpt-4o-mini", msgs, f"Response for: {q}")

print(f"Cache stats: {cache.stats()}")

Output:

python
Cache stats: hits=8, misses=2, rate=80%

80% cache hit rate = 80% fewer API calls. For FAQ bots and repetitive workflows, this is enormous.

Strategy 9: Test Cheaper Models First

Don’t default to the flagship. Start with the cheapest model. Move up only if quality falls below your bar.

task_tokens = {"input": 1000, "output": 500}

candidates = [
    ("gemini-2.5-flash-lite", 0.10, 0.40),
    ("gpt-4o-mini", 0.15, 0.60),
    ("gpt-4.1", 2.00, 8.00),
    ("claude-sonnet-4.6", 3.00, 15.00),
]

print(f"{'Model':<25} {'Cost/Request':>12} {'vs Cheapest':>12}")
print("-" * 52)
base = None
for model, inp, out in candidates:
    cost = (task_tokens["input"]/1e6 * inp) + (task_tokens["output"]/1e6 * out)
    if base is None:
        base = cost
    print(f"{model:<25} ${cost:>10.6f} {cost/base:>10.1f}x")

Output:

python
Model                     Cost/Request  vs Cheapest
----------------------------------------------------
gemini-2.5-flash-lite     $  0.000300        1.0x
gpt-4o-mini               $  0.000450        1.5x
gpt-4.1                   $  0.006000       20.0x
claude-sonnet-4.6         $  0.010500       35.0x

Sonnet costs 35x more per request than Flash-Lite. For classification and extraction tasks, Flash-Lite usually handles them fine. That’s a 97% savings.

Strategy 10: Set Budget Alerts

Add alerting to the CostTracker so runaway costs get caught before the bill shocks you.

class BudgetTracker(CostTracker):
    """CostTracker with budget alerts."""

    def __init__(self, daily_budget=10.0):
        super().__init__()
        self.daily_budget = daily_budget
        self.alerted_75 = False
        self.alerted_90 = False

    def log_request(self, model, input_tokens, output_tokens, tag="default"):
        entry = super().log_request(model, input_tokens, output_tokens, tag)
        spent = self.total_spend()
        pct = (spent / self.daily_budget) * 100

        if pct >= 90 and not self.alerted_90:
            print(f"  ALERT: {pct:.0f}% of ${self.daily_budget} budget!")
            self.alerted_90 = True
        elif pct >= 75 and not self.alerted_75:
            print(f"  WARNING: {pct:.0f}% of ${self.daily_budget} budget")
            self.alerted_75 = True
        return entry

bt = BudgetTracker(daily_budget=0.005)
for i in range(30):
    bt.log_request("gpt-4o-mini", 500, 200, tag="chat")

print(f"\nTotal: \({bt.total_spend():.6f} of \){bt.daily_budget} budget")

Output:

python
  WARNING: 78% of $0.005 budget
  ALERT: 94% of $0.005 budget!

Total: $0.005850 of $0.005 budget

Tip: In production, wire budget alerts to Slack or PagerDuty. Set three tiers: 50% (info), 75% (warning), 90% (critical). At 90%, auto-downgrade to cheaper models.

Common Mistakes That Inflate Your LLM API Costs

Mistake 1: Sending Full Conversation History Every Time

Each turn adds tokens. By turn 20, each request costs 20x what turn 1 cost.

avg_turn_tokens = 150
print(f"{'Turn':<6} {'Tokens':>8} {'Cost (Sonnet)':>15}")
print("-" * 32)
for turn in [1, 5, 10, 15, 20]:
    tokens = turn * avg_turn_tokens
    cost = tokens / 1e6 * 3.00
    print(f"{turn:<6} {tokens:>8,} ${cost:>13.6f}")

Output:

python
Turn    Tokens  Cost (Sonnet)
--------------------------------
1          150 $     0.000450
5          750 $     0.002250
10       1,500 $     0.004500
15       2,250 $     0.006750
20       3,000 $     0.009000

Use trim_conversation to keep history bounded. Your wallet will thank you.

Mistake 2: Not Setting max_tokens

Without an explicit cap, the model decides response length. Open-ended questions might return 1,000+ tokens when 200 would suffice.

Mistake 3: Using the Premium Model for Everything

Teams default to GPT-4o or Sonnet for every request. Classification, extraction, simple Q&A — all hitting the expensive endpoint. Switch routine tasks to mini/flash models. The quality difference is often negligible.

Mistake 4: Ignoring Prompt Caching

If your system prompt exceeds 1,024 tokens and you send it on every request, enable caching. Setup takes 5 minutes. The savings are immediate and dramatic.

When Should You NOT Optimize LLM Costs?

Cost cutting has limits. Sometimes the right move is spending more:

  • User-facing quality: A chatbot that sounds dumb loses users. Lost users cost more than premium tokens ever will.
  • High-stakes accuracy: Medical, legal, financial tasks need the best model. Wrong answers cost more than expensive tokens.
  • Prototyping phase: Don’t optimize early. Get the product right first. Switching models mid-development wastes engineering time you’ll never get back.
Note: The goal isn’t the cheapest possible bill. It’s the lowest cost at acceptable quality. $50/month with happy users beats $5/month with users who leave.

Your Cost Optimization Checklist

Here’s the priority order for cutting LLM API costs. Start at the top — each strategy builds on the last.

  1. Model routing — Route simple tasks to budget models (60-90% savings)
  2. Prompt caching — Cache system prompts (75-90% savings on cached portion)
  3. Batch API — Non-urgent work at half price (50% savings)
  4. Cap output tokensmax_tokens on every call (30-75% savings)
  5. Trim history — Sliding window on conversation context (20-50% savings)
  6. Compress prompts — Cut redundant instructions (10-30% savings)
  7. Response caching — Skip identical calls
  8. Structured output — JSON instead of prose (40-60% output savings)
  9. Monitor spend — Budget alerts catch runaway costs
  10. Test cheaper models — Don’t assume you need premium

What happens when you stack the top five?

strategies = [
    ("Model routing", 0.70),
    ("Prompt caching", 0.40),
    ("Batch API (50% of work)", 0.25),
    ("Output capping", 0.20),
    ("History trimming", 0.10),
]

monthly = 500.0
remaining = monthly

print(f"Starting: ${monthly:.2f}/month\n")
print(f"{'Strategy':<25} {'Saves':>8} {'Remaining':>10}")
print("-" * 46)

for name, pct in strategies:
    saved = remaining * pct
    remaining -= saved
    print(f"{name:<25} \({saved:>6.2f} \){remaining:>8.2f}")

print(f"\nFinal: ${remaining:.2f}/month")
print(f"Total saved: ${monthly - remaining:.2f} ({(1-remaining/monthly)*100:.0f}%)")

Output:

python
Starting: $500.00/month

Strategy                    Saves  Remaining
----------------------------------------------
Model routing               $350.00 $   150.00
Prompt caching              $ 60.00 $    90.00
Batch API (50% of work)     $ 22.50 $    67.50
Output capping              $ 13.50 $    54.00
History trimming            $  5.40 $    48.60

Final: $48.60/month
Total saved: $451.40 (90%)

From $500 to under $50 per month. That’s the power of stacking these strategies.

Summary

Tokens drive every dollar of your LLM bill. You now know how to count them, compare prices, track spend, and apply ten strategies that stack to 90% savings.

Three moves matter most: route simple tasks to cheap models, cache your prompts, and batch the rest. Those three alone cut costs 80%+.

Start with model routing today. Ten minutes of work. Biggest savings on day one.

Practice Exercise:

Build a cost optimizer that recommends the cheapest model for a task

Write a function `recommend_model(task_description, quality_threshold)` that:
1. Counts the tokens in the task description
2. Estimates output tokens (2x input as rough estimate)
3. Returns the cheapest model that meets the quality threshold

def recommend_model(task_description, quality_threshold="standard"):
    tiers = {
        "basic": [("gemini-2.5-flash-lite", 0.10, 0.40),
                  ("gpt-4o-mini", 0.15, 0.60)],
        "standard": [("gpt-4.1", 2.00, 8.00),
                     ("claude-sonnet-4.6", 3.00, 15.00)],
        "premium": [("claude-opus-4.6", 5.00, 25.00)],
    }
    input_tokens = count_tokens(task_description)
    output_tokens = input_tokens * 2
    candidates = tiers.get(quality_threshold, tiers["standard"])
    best_model, best_cost = None, float("inf")
    for model, inp_price, out_price in candidates:
        cost = (input_tokens/1e6 * inp_price) + (output_tokens/1e6 * out_price)
        if cost < best_cost:
            best_cost = cost
            best_model = model
    return best_model, best_cost

model, cost = recommend_model("Classify this email as urgent or not", "basic")
print(f"Recommended: {model} (est. cost: ${cost:.8f})")

{
type: ‘exercise’,
id: ‘cost-calculator-ex1’,
title: ‘Exercise 1: Calculate Monthly API Cost’,
difficulty: ‘beginner’,
exerciseType: ‘write’,
instructions: ‘Write a function monthly_cost(model, daily_requests, avg_input_tokens, avg_output_tokens) that returns the estimated monthly cost (30 days). Use the PRICING dictionary provided.’,
starterCode: ‘PRICING = {\n “gpt-4o-mini”: (0.15, 0.60),\n “gpt-4o”: (2.50, 10.00),\n “claude-sonnet-4.6”: (3.00, 15.00),\n “gemini-2.5-flash-lite”: (0.10, 0.40),\n}\n\ndef monthly_cost(model, daily_requests, avg_input_tokens, avg_output_tokens):\n # Your code here\n pass\n\nprint(monthly_cost(“gpt-4o-mini”, 1000, 500, 200))’,
testCases: [
{ id: ‘tc1’, input: ‘print(monthly_cost(“gpt-4o-mini”, 1000, 500, 200))’, expectedOutput: ‘5.85’, description: ‘GPT-4o-mini: 1000 reqs/day, 500 in, 200 out’ },
{ id: ‘tc2’, input: ‘print(monthly_cost(“claude-sonnet-4.6”, 500, 1000, 500))’, expectedOutput: ‘157.5’, description: ‘Sonnet: 500 reqs/day, 1000 in, 500 out’ },
{ id: ‘tc3’, input: ‘print(monthly_cost(“gemini-2.5-flash-lite”, 10000, 200, 100))’, expectedOutput: ‘18.0’, description: ‘Flash-Lite at scale’ },
],
hints: [
‘Daily cost = (daily_requests * avg_input_tokens / 1_000_000 * input_price) + (daily_requests * avg_output_tokens / 1_000_000 * output_price). Multiply by 30.’,
‘Full answer: inp, out = PRICING[model]; return ((daily_requests * avg_input_tokens / 1e6 * inp) + (daily_requests * avg_output_tokens / 1e6 * out)) * 30’,
],
solution: ‘def monthly_cost(model, daily_requests, avg_input_tokens, avg_output_tokens):\n inp_price, out_price = PRICING[model]\n daily = (daily_requests * avg_input_tokens / 1e6 * inp_price) + (daily_requests * avg_output_tokens / 1e6 * out_price)\n return daily * 30’,
solutionExplanation: ‘Look up the per-million-token prices, multiply by actual token volume, and scale to 30 days.’,
xpReward: 15,
}

{
type: ‘exercise’,
id: ‘cost-optimizer-ex2’,
title: ‘Exercise 2: Build a Model Router’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function cheapest_model(input_tokens, output_tokens, models) that returns (model_name, total_cost) for the cheapest option in the given pricing dict.’,
starterCode: ‘MODELS = {\n “gpt-4o-mini”: (0.15, 0.60),\n “gpt-4.1”: (2.00, 8.00),\n “claude-sonnet-4.6”: (3.00, 15.00),\n “gemini-2.5-flash-lite”: (0.10, 0.40),\n}\n\ndef cheapest_model(input_tokens, output_tokens, models=MODELS):\n # Your code here\n pass\n\nname, cost = cheapest_model(1000, 500)\nprint(f”{name}: ${cost:.6f}”)’,
testCases: [
{ id: ‘tc1’, input: ‘name, cost = cheapest_model(1000, 500)\nprint(f”{name}: \({cost:.6f}")', expectedOutput: 'gemini-2.5-flash-lite: \)0.000300′, description: ‘Should pick cheapest option’ },
{ id: ‘tc2’, input: ‘name, _ = cheapest_model(0, 1000000)\nprint(name)’, expectedOutput: ‘gemini-2.5-flash-lite’, description: ‘Output-heavy: still cheapest’ },
],
hints: [
‘Loop through models.items(). For each, compute cost = (input_tokens/1e6 * inp) + (output_tokens/1e6 * out). Track minimum.’,
‘One-liner: return min(((n, i/1e6p[0]+o/1e6p[1]) for n,p in models.items()), key=lambda x: x[1])’,
],
solution: ‘def cheapest_model(input_tokens, output_tokens, models=MODELS):\n best_name, best_cost = None, float(“inf”)\n for name, (inp_p, out_p) in models.items():\n cost = (input_tokens/1e6 * inp_p) + (output_tokens/1e6 * out_p)\n if cost < best_cost:\n best_cost = cost\n best_name = name\n return best_name, best_cost’,
solutionExplanation: ‘Iterate all models, compute cost for the given token counts, return the cheapest. This is the core of any model routing system.’,
xpReward: 15,
}

Frequently Asked Questions

How do tokens differ across LLM providers?

Each provider uses a different tokenizer. The same text might be 100 tokens on OpenAI and 95 on Anthropic. For estimation, tiktoken works as a proxy for all providers — accurate within 5-10%.

Can I stack prompt caching with the batch API?

Yes. Both OpenAI and Anthropic allow it. On Anthropic, batch gives 50% off and caching gives 90% off the cached portion. Combined, your cached system prompt in batch mode costs about 5% of the standard price.

What does a typical chatbot cost to run per month?

A chatbot handling 1,000 conversations/day with GPT-4o-mini costs $5-15/month. With Claude Sonnet, the same workload costs $100-300/month. It depends on conversation length and response size.

Does tiktoken work for counting Claude and Gemini tokens?

Not precisely. tiktoken uses OpenAI’s BPE tokenizer. Claude and Gemini use different ones. But for cost estimation, tiktoken is close enough (within 10%). For exact counts, Anthropic returns token usage in the API response header, and Google offers a count_tokens method.

What are the biggest cost monitoring tools available?

Several LLMOps platforms track costs automatically: LangSmith (LangChain’s observability tool), Helicone (proxy-based tracking), and Portkey (multi-provider gateway with cost dashboards). For simpler needs, the CostTracker class in this article gives you the core logic without external dependencies.

Complete Code

Click to expand the full script (copy-paste and run)
# Complete code from: LLM API Costs -- Token Counting, Pricing, and Optimization
# Requires: pip install tiktoken
# Python 3.10+

import tiktoken
from datetime import datetime
import hashlib

# --- Token Counting ---
def count_tokens(text, model="gpt-4o"):
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def count_chat_tokens(messages, model="gpt-4o"):
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for message in messages:
        total += 4
        for key, value in message.items():
            total += len(enc.encode(value))
    total += 2
    return total

# --- Cost Tracker ---
class CostTracker:
    PRICING = {
        "gpt-4o-mini": (0.15, 0.60), "gpt-4o": (2.50, 10.00),
        "gpt-4.1": (2.00, 8.00), "claude-haiku-4.5": (1.00, 5.00),
        "claude-sonnet-4.6": (3.00, 15.00), "claude-opus-4.6": (5.00, 25.00),
        "gemini-2.5-flash-lite": (0.10, 0.40), "gemini-2.5-pro": (1.25, 10.00),
    }

    def __init__(self):
        self.logs = []

    def log_request(self, model, input_tokens, output_tokens, tag="default"):
        inp_price, out_price = self.PRICING[model]
        input_cost = (input_tokens / 1e6) * inp_price
        output_cost = (output_tokens / 1e6) * out_price
        entry = {"timestamp": datetime.now().isoformat(), "model": model,
                 "input_tokens": input_tokens, "output_tokens": output_tokens,
                 "total_cost": input_cost + output_cost, "tag": tag}
        self.logs.append(entry)
        return entry

    def summary_by_model(self):
        summary = {}
        for log in self.logs:
            m = log["model"]
            if m not in summary:
                summary[m] = {"requests": 0, "total_cost": 0.0}
            summary[m]["requests"] += 1
            summary[m]["total_cost"] += log["total_cost"]
        return summary

    def summary_by_tag(self):
        summary = {}
        for log in self.logs:
            t = log["tag"]
            if t not in summary:
                summary[t] = {"requests": 0, "total_cost": 0.0}
            summary[t]["requests"] += 1
            summary[t]["total_cost"] += log["total_cost"]
        return summary

    def total_spend(self):
        return sum(log["total_cost"] for log in self.logs)

# --- Budget Tracker ---
class BudgetTracker(CostTracker):
    def __init__(self, daily_budget=10.0):
        super().__init__()
        self.daily_budget = daily_budget
        self.alerted_75 = False
        self.alerted_90 = False

    def log_request(self, model, input_tokens, output_tokens, tag="default"):
        entry = super().log_request(model, input_tokens, output_tokens, tag)
        pct = (self.total_spend() / self.daily_budget) * 100
        if pct >= 90 and not self.alerted_90:
            print(f"  ALERT: {pct:.0f}% budget used!")
            self.alerted_90 = True
        elif pct >= 75 and not self.alerted_75:
            print(f"  WARNING: {pct:.0f}% budget used")
            self.alerted_75 = True
        return entry

# --- Model Router ---
def route_to_model(task_type):
    routing = {
        "classification": "gpt-4o-mini", "extraction": "gpt-4o-mini",
        "summarization": "gemini-2.5-flash-lite", "code_generation": "gpt-4.1",
        "complex_reasoning": "claude-sonnet-4.6", "creative_writing": "claude-opus-4.6",
    }
    return routing.get(task_type, "gpt-4o-mini")

# --- Conversation Trimmer ---
def trim_conversation(messages, max_tokens=2000, model="gpt-4o"):
    enc = tiktoken.encoding_for_model(model)
    system_msgs = [m for m in messages if m["role"] == "system"]
    other_msgs = [m for m in messages if m["role"] != "system"]
    sys_tokens = sum(len(enc.encode(m["content"])) for m in system_msgs)
    budget = max_tokens - sys_tokens
    kept, used = [], 0
    for msg in reversed(other_msgs):
        msg_tokens = len(enc.encode(msg["content"])) + 4
        if used + msg_tokens > budget:
            break
        kept.insert(0, msg)
        used += msg_tokens
    return system_msgs + kept

# --- Response Cache ---
class ResponseCache:
    def __init__(self):
        self.cache, self.hits, self.misses = {}, 0, 0

    def get(self, model, messages):
        key = hashlib.md5(f"{model}:{messages}".encode()).hexdigest()
        if key in self.cache:
            self.hits += 1
            return self.cache[key]
        self.misses += 1
        return None

    def set(self, model, messages, response):
        key = hashlib.md5(f"{model}:{messages}".encode()).hexdigest()
        self.cache[key] = response

# --- Demo ---
if __name__ == "__main__":
    tracker = CostTracker()
    for _ in range(500):
        tracker.log_request("gpt-4o-mini", 800, 200, "chat")
    for _ in range(100):
        tracker.log_request("claude-sonnet-4.6", 4000, 500, "summarization")
    for _ in range(50):
        tracker.log_request("gpt-4.1", 1500, 2000, "code-gen")
    print(f"Total spend: ${tracker.total_spend():.4f}")
    for model, data in tracker.summary_by_model().items():
        print(f"  {model}: ${data['total_cost']:.4f}")
    print("Script completed successfully.")

References

  1. OpenAI — API Pricing (March 2026). Link
  2. Anthropic — Claude API Pricing. Link
  3. Google — Gemini API Pricing. Link
  4. OpenAI — How to count tokens with tiktoken. Link
  5. OpenAI — Prompt Caching Guide. Link
  6. Anthropic — Prompt Caching Documentation. Link
  7. OpenAI — Batch API Documentation. Link
  8. tiktoken — GitHub Repository. Link
  9. Redis — LLM Token Optimization Guide. Link
  10. TLDL — LLM API Pricing March 2026. Link

Meta description: Learn to count LLM tokens with tiktoken, compare API pricing across OpenAI, Claude, and Gemini (March 2026), and cut costs by 90% with 10 proven strategies and code.

[SCHEMA HINTS]
– Article type: Tutorial
– Primary technology: tiktoken, OpenAI API, Claude API, Gemini API
– Programming language: Python
– Difficulty: Intermediate
– Keywords: LLM API costs, token counting, API pricing comparison, cost optimization, tiktoken, prompt caching, batch API

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science