LLM API Pricing Guide: Compare & Optimize Costs
Learn to count LLM tokens with tiktoken, compare API pricing across OpenAI, Claude, and Gemini (March 2026), and cut costs by 90% with 10 proven strategies and code.
Every dollar you spend on LLM APIs comes down to tokens. Here’s how to count them, compare prices across OpenAI, Claude, and Gemini, and slash your bill by up to 90%.
You called an API ten times during prototyping. The bill was $0.03. Fine. Then you shipped to production, and suddenly you’re burning $200 a day.
What happened? Tokens happened. Every prompt you send and every response you get back is measured in tokens. Miss this, and costs spiral. Track it, and you control exactly where your money goes.
What Are Tokens and Why Do They Control Your LLM Costs?
import micropip
await micropip.install(['tiktoken'])
import tiktoken
from datetime import datetime
import hashlib
encoder = tiktoken.encoding_for_model("gpt-4o")
text = "Managing LLM costs is essential for production apps."
tokens = encoder.encode(text)
print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")
Output:
Text: Managing LLM costs is essential for production apps.
Token count: 9
Tokens: [38032, 445, 11237, 7194, 374, 7718, 369, 5788, 10721]
A token is a chunk of text. In English, one token equals roughly 4 characters. The model doesn’t see words. It sees tokens.
Here’s what trips people up: you pay twice per request. Input tokens have one rate. Output tokens cost more. Usually 3-5x more.
Why the gap? The model reads all input tokens at once. But it writes output tokens one by one. Each one needs a full pass through the model. More work means a higher price.
Prerequisites
- Python version: 3.10+
- Required library:
tiktoken(pip install tiktoken) - Pyodide compatible: Yes (
tiktokenruns in the browser) - Time to complete: ~25 minutes
- Cost: $0 (all calculations run locally — no API keys needed)
How Do You Count Tokens Before Sending a Request?
You can’t cut costs you don’t measure. Counting tokens before each API call is the first step to controlling your LLM API costs.
tiktoken is OpenAI’s official tokenizer. It gives you the exact token count the API will bill you for.
def count_tokens(text, model="gpt-4o"):
"""Count tokens for a given text and model."""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
short = "What is Python?"
print(f"Short prompt: {count_tokens(short)} tokens")
long_prompt = """You are a senior data scientist. Analyze this dataset
and provide insights on customer churn patterns. Focus on the top 3
factors driving churn and suggest retention strategies."""
print(f"Long prompt: {count_tokens(long_prompt)} tokens")
Output:
Short prompt: 3 tokens
Long prompt: 33 tokens
That works for a single message. But real API calls bundle system prompts, conversation history, and user messages. Every piece adds tokens.
Quick check: Before you read the next block, guess — how many extra tokens does the chat message format add per message? (Answer: about 4 overhead tokens.)
The function below counts tokens for a full chat conversation. It accounts for the overhead that OpenAI’s chat format adds per message. The API charges for those formatting tokens too.
def count_chat_tokens(messages, model="gpt-4o"):
"""Count tokens for a full chat conversation."""
enc = tiktoken.encoding_for_model(model)
tokens_per_message = 4 # overhead per message
total = 0
for message in messages:
total += tokens_per_message
for key, value in message.items():
total += len(enc.encode(value))
total += 2 # reply priming tokens
return total
messages = [
{"role": "system", "content": "You are a helpful data science tutor."},
{"role": "user", "content": "Explain gradient descent in simple terms."},
]
print(f"Chat tokens: {count_chat_tokens(messages)}")
Output:
Chat tokens: 24
LLM API Pricing Comparison — March 2026
Pricing moves fast. Here’s what the major providers charge as of March 2026, organized by cost tier so you can match your budget to your use case.
Budget Tier — Under $0.50 per Million Input Tokens
These models handle classification, extraction, and simple chat. Don’t underestimate them — they’re surprisingly good for routine tasks.
| Model | Input (per 1M) | Output (per 1M) | Context |
|---|---|---|---|
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | 1M |
| GPT-4o-mini | $0.15 | $0.60 | 128K |
| DeepSeek V3.2 | $0.28 | $0.42 | 128K |
| Gemini 2.0 Flash | $0.30 | $2.50 | 1M |
Mid Tier — $1 to $3 per Million Input Tokens
The workhorses. This tier covers 80% of production use cases that I’ve encountered.
| Model | Input (per 1M) | Output (per 1M) | Context |
|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $5.00 | 200K |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M |
| GPT-4.1 | $2.00 | $8.00 | 1M |
| GPT-4o | $2.50 | $10.00 | 128K |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 1M |
Premium Tier — $5+ per Million Input Tokens
When quality matters more than cost. Complex reasoning, creative writing, nuanced analysis.
| Model | Input (per 1M) | Output (per 1M) | Context |
|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | 1M |
| GPT-5.4 | $2.50 | $10.00 | 1M |
What do these prices mean for a real app? Say you handle 10,000 requests per day. Each request has a 500-token prompt and a 300-token response.
daily_requests = 10_000
input_tokens_per_request = 500
output_tokens_per_request = 300
models = {
"GPT-4o-mini": (0.15, 0.60),
"Claude Sonnet 4.6": (3.00, 15.00),
"Gemini 2.5 Flash-Lite": (0.10, 0.40),
}
print(f"{'Model':<25} {'Daily Cost':>10} {'Monthly Cost':>12}")
print("-" * 50)
for model, (inp_price, out_price) in models.items():
daily_input = (daily_requests * input_tokens_per_request / 1e6) * inp_price
daily_output = (daily_requests * output_tokens_per_request / 1e6) * out_price
daily_total = daily_input + daily_output
monthly_total = daily_total * 30
print(f"{model:<25} \({daily_total:>8.2f} \){monthly_total:>10.2f}")
Result:
Model Daily Cost Monthly Cost
--------------------------------------------------
GPT-4o-mini $ 2.55 $ 76.50
Claude Sonnet 4.6 $ 60.00 $ 1800.00
Gemini 2.5 Flash-Lite $ 1.70 $ 51.00
Sonnet costs 35x more than Flash-Lite for the identical workload. That’s $1,749 per month difference. Model selection is far and away your biggest cost lever.
Watch Out: Reasoning Models Have Hidden Costs
Models like OpenAI’s o3 and o4-mini “think” before they answer. They make hidden tokens as part of their chain of thought. A reply with 200 visible tokens might burn 10,000-30,000 thinking tokens behind the scenes.
You pay for those thinking tokens at the output rate. One request can cost 50-100x what you’d guess. If you use reasoning models, track total tokens — visible plus thinking.
Build a Cost Tracking Dashboard
Knowing prices is step one. Tracking actual spend in real time is step two. Here’s a CostTracker class that logs every API call and shows exactly where your money goes.
It tracks eight popular models, logs token counts, computes costs, and gives you reports by model and by feature. Only needs Python built-ins plus tiktoken — runs in Pyodide too.
class CostTracker:
"""Track LLM API costs across multiple providers."""
PRICING = {
"gpt-4o-mini": (0.15, 0.60),
"gpt-4o": (2.50, 10.00),
"gpt-4.1": (2.00, 8.00),
"claude-haiku-4.5": (1.00, 5.00),
"claude-sonnet-4.6": (3.00, 15.00),
"claude-opus-4.6": (5.00, 25.00),
"gemini-2.5-flash-lite":(0.10, 0.40),
"gemini-2.5-pro": (1.25, 10.00),
}
def __init__(self):
self.logs = []
def log_request(self, model, input_tokens, output_tokens, tag="default"):
"""Log a single API request with cost calculation."""
if model not in self.PRICING:
raise ValueError(f"Unknown model: {model}")
inp_price, out_price = self.PRICING[model]
input_cost = (input_tokens / 1_000_000) * inp_price
output_cost = (output_tokens / 1_000_000) * out_price
entry = {
"timestamp": datetime.now().isoformat(),
"model": model, "input_tokens": input_tokens,
"output_tokens": output_tokens,
"input_cost": input_cost, "output_cost": output_cost,
"total_cost": input_cost + output_cost, "tag": tag,
}
self.logs.append(entry)
return entry
def summary_by_model(self):
"""Cost breakdown grouped by model."""
summary = {}
for log in self.logs:
m = log["model"]
if m not in summary:
summary[m] = {"requests": 0, "total_cost": 0.0}
summary[m]["requests"] += 1
summary[m]["total_cost"] += log["total_cost"]
return summary
def summary_by_tag(self):
"""Cost breakdown grouped by usage tag."""
summary = {}
for log in self.logs:
t = log["tag"]
if t not in summary:
summary[t] = {"requests": 0, "total_cost": 0.0}
summary[t]["requests"] += 1
summary[t]["total_cost"] += log["total_cost"]
return summary
def total_spend(self):
return sum(log["total_cost"] for log in self.logs)
Each log_request call records the model, token counts, computed costs, and a tag like “chat” or “summarization.” The tag is the key — it lets you track costs per feature in your app.
Predict the output: We’ll log 500 cheap chat requests, 100 mid-tier summarizations, and 50 code generation calls. Which feature will eat the most budget?
tracker = CostTracker()
for _ in range(500):
tracker.log_request("gpt-4o-mini", input_tokens=800,
output_tokens=200, tag="chat")
for _ in range(100):
tracker.log_request("claude-sonnet-4.6", input_tokens=4000,
output_tokens=500, tag="summarization")
for _ in range(50):
tracker.log_request("gpt-4.1", input_tokens=1500,
output_tokens=2000, tag="code-gen")
print(f"Total spend: ${tracker.total_spend():.4f}\n")
print("=== Cost by Model ===")
for model, data in tracker.summary_by_model().items():
print(f" {model}: {data['requests']} reqs, ${data['total_cost']:.4f}")
print("\n=== Cost by Feature ===")
for tag, data in tracker.summary_by_tag().items():
print(f" {tag}: {data['requests']} reqs, ${data['total_cost']:.4f}")
Here’s what you see:
Total spend: $3.0200
=== Cost by Model ===
gpt-4o-mini: 500 reqs, $0.1200
claude-sonnet-4.6: 100 reqs, $1.9500
gpt-4.1: 50 reqs, $0.9500
=== Cost by Feature ===
chat: 500 reqs, $0.1200
summarization: 100 reqs, $1.9500
code-gen: 50 reqs, $0.9500
Sonnet eats 65% of the budget with just 100 requests. Chat runs 5x more volume but costs a fraction. The model price and input size make all the difference. This kind of insight changes how you build LLM apps.
10 Strategies to Cut Your LLM API Costs
I’ve ranked these by impact. The first three alone can cut costs by 80%.
Strategy 1: Route Tasks to the Cheapest Model That Works
Not every request needs your flagship model. A classification task doesn’t need Claude Opus. Route requests based on complexity.
def route_to_model(task_type):
"""Route tasks to the most cost-effective model."""
routing = {
"classification": "gpt-4o-mini",
"extraction": "gpt-4o-mini",
"summarization": "gemini-2.5-flash-lite",
"code_generation": "gpt-4.1",
"complex_reasoning": "claude-sonnet-4.6",
"creative_writing": "claude-opus-4.6",
}
return routing.get(task_type, "gpt-4o-mini")
for task in ["classification", "summarization", "complex_reasoning"]:
print(f" {task:>20} -> {route_to_model(task)}")
Output:
classification -> gpt-4o-mini
summarization -> gemini-2.5-flash-lite
complex_reasoning -> claude-sonnet-4.6
Route 70% of requests to budget models. Send 20% to mid-tier. Save premium for just 10%. That mix cuts costs 60-80%. It’s the single biggest lever you have.
Strategy 2: Enable Prompt Caching
Does your system prompt stay the same across requests? Prompt caching will save you a fortune. The provider stores your prompt on their servers. Each reuse costs pennies.
| Provider | Cache Write | Cache Read | Savings on Cached Portion |
|---|---|---|---|
| OpenAI (GPT-4.1) | 1x standard | 25% of standard | 75% |
| OpenAI (GPT-5) | 1x standard | 10% of standard | 90% |
| Anthropic | 1.25x standard | 10% of standard | 90% |
| 1x standard | Free (some models) | Up to 100% |
What does that look like for a 2,000-token system prompt sent 10,000 times daily?
system_tokens = 2000
daily_reqs = 10_000
price_per_m = 3.00 # Sonnet input price
no_cache = (system_tokens * daily_reqs / 1e6) * price_per_m
cache_write = (system_tokens / 1e6) * price_per_m * 1.25
cache_reads = (system_tokens * (daily_reqs - 1) / 1e6) * price_per_m * 0.10
cached = cache_write + cache_reads
print(f"Without caching: ${no_cache:.2f}/day")
print(f"With caching: ${cached:.2f}/day")
print(f"Savings: ${no_cache - cached:.2f}/day ({(1 - cached/no_cache)*100:.0f}%)")
Output:
Without caching: $60.00/day
With caching: $6.01/day
Savings: $53.99/day (90%)
Strategy 3: Use the Batch API for Non-Urgent Work
OpenAI and Anthropic both offer batch APIs. Submit a file of requests, wait up to 24 hours, and pay 50% less. You can even combine batch with prompt caching.
Good candidates: nightly data processing, bulk classification, eval pipelines, document analysis.
daily_input = 50_000_000 # 50M input tokens
daily_output = 10_000_000 # 10M output tokens
standard = (daily_input/1e6 * 2.00) + (daily_output/1e6 * 8.00)
batch = standard * 0.50
print(f"Standard: ${standard:.2f}/day")
print(f"Batch: ${batch:.2f}/day")
print(f"Monthly savings: ${(standard - batch) * 30:.2f}")
Output:
Standard: $180.00/day
Batch: $90.00/day
Monthly savings: $2700.00
Strategy 4: Cap Output Tokens with max_tokens
Every extra output token costs money. Without max_tokens, the model decides how long to respond. For open-ended questions, that could mean 1,000+ tokens when 200 would do.
scenarios = [
("No cap (avg 800 tokens)", 800),
("Cap at 500 tokens", 500),
("Cap at 200 tokens", 200),
("Cap at 100 tokens", 100),
]
daily_reqs = 10_000
out_price = 15.00 # Sonnet output price per M
print(f"{'Scenario':<30} {'Daily Output Cost':>18}")
print("-" * 50)
for name, tok in scenarios:
cost = daily_reqs * tok / 1e6 * out_price
print(f"{name:<30} ${cost:>16.2f}")
Output:
Scenario Daily Output Cost
--------------------------------------------------
No cap (avg 800 tokens) $ 120.00
Cap at 500 tokens $ 75.00
Cap at 200 tokens $ 30.00
Cap at 100 tokens $ 15.00
Dropping from 800 to 200 cuts output costs by 75%. Tell the model to be concise in the system prompt, then enforce it with max_tokens.
Strategy 5: Compress Your Prompts
Many prompts say the same thing twice in different words. Cut the filler.
Here’s a before/after:
verbose = """Please analyze the following customer feedback data and
provide a comprehensive summary of the main themes, sentiment
distribution, and key actionable insights that our product team
should focus on for the next quarter's roadmap planning process."""
compressed = """Analyze this customer feedback. Return:
1. Top 3 themes
2. Sentiment breakdown (positive/neutral/negative %)
3. Top 3 actionable items for product team"""
enc = tiktoken.encoding_for_model("gpt-4o")
v_count = len(enc.encode(verbose))
c_count = len(enc.encode(compressed))
print(f"Verbose: {v_count} tokens")
print(f"Compressed: {c_count} tokens")
print(f"Saved: {v_count - c_count} tokens ({(1 - c_count/v_count)*100:.0f}%)")
Output:
Verbose: 40 tokens
Compressed: 36 tokens
Saved: 4 tokens (10%)
Per-prompt savings look modest. The real win comes from trimming bloated system prompts. A 4,000-token system prompt that you shave to 2,000 saves $30/day at Sonnet pricing across 10K requests.
Strategy 6: Trim Conversation History
Chat apps pile up history. Every turn adds tokens. You send ALL of it with every new request.
Sound familiar? Your chatbot works great for 5 messages. By message 20, it’s eating 10x the tokens per request.
def trim_conversation(messages, max_tokens=2000, model="gpt-4o"):
"""Keep only recent messages within a token budget."""
enc = tiktoken.encoding_for_model(model)
system_msgs = [m for m in messages if m["role"] == "system"]
other_msgs = [m for m in messages if m["role"] != "system"]
sys_tokens = sum(len(enc.encode(m["content"])) for m in system_msgs)
budget = max_tokens - sys_tokens
kept = []
used = 0
for msg in reversed(other_msgs):
msg_tokens = len(enc.encode(msg["content"])) + 4
if used + msg_tokens > budget:
break
kept.insert(0, msg)
used += msg_tokens
return system_msgs + kept
messages = [{"role": "system", "content": "You are a helpful assistant."}]
for i in range(20):
messages.append({"role": "user", "content": f"Question {i+1} about data science topics."})
messages.append({"role": "assistant", "content": f"Answer {i+1} with a detailed explanation."})
original = count_chat_tokens(messages)
trimmed_msgs = trim_conversation(messages, max_tokens=500)
trimmed = count_chat_tokens(trimmed_msgs)
print(f"Original: {len(messages)} msgs, {original} tokens")
print(f"Trimmed: {len(trimmed_msgs)} msgs, {trimmed} tokens")
print(f"Saved: {original - trimmed} tokens")
Output:
Original: 41 msgs, 564 tokens
Trimmed: 29 msgs, 414 tokens
Saved: 150 tokens
I prefer a sliding window of the last N turns plus the system prompt. You lose early context, but costs stay predictable.
Strategy 7: Ask for Structured Output
When you need data, ask for JSON. It’s compact and parseable. Models produce fewer tokens when constrained to a schema.
prose_prompt = "Analyze the sentiment of this review and explain your reasoning."
struct_prompt = """Analyze this review. Return JSON only:
{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0}"""
enc = tiktoken.encoding_for_model("gpt-4o")
print(f"Prose prompt: {len(enc.encode(prose_prompt))} input tokens")
print(f"Structured prompt: {len(enc.encode(struct_prompt))} input tokens")
print(f"Expected output savings: ~60-70% (JSON vs prose)")
Output:
Prose prompt: 12 input tokens
Structured prompt: 27 input tokens
Expected output savings: ~60-70% (JSON vs prose)
The structured prompt costs a few more input tokens. But the response shrinks dramatically. Good trade — output tokens cost 3-5x more.
Strategy 8: Cache Responses Locally
If the same question appears repeatedly, don’t hit the API again. Cache it.
class ResponseCache:
"""Simple in-memory cache for API responses."""
def __init__(self):
self.cache = {}
self.hits = 0
self.misses = 0
def _make_key(self, model, messages):
content = f"{model}:{str(messages)}"
return hashlib.md5(content.encode()).hexdigest()
def get(self, model, messages):
key = self._make_key(model, messages)
if key in self.cache:
self.hits += 1
return self.cache[key]
self.misses += 1
return None
def set(self, model, messages, response):
key = self._make_key(model, messages)
self.cache[key] = response
def stats(self):
total = self.hits + self.misses
rate = (self.hits / total * 100) if total > 0 else 0
return f"hits={self.hits}, misses={self.misses}, rate={rate:.0f}%"
cache = ResponseCache()
queries = ["What is Python?"] * 8 + ["What is Java?"] * 2
for q in queries:
msgs = [{"role": "user", "content": q}]
if cache.get("gpt-4o-mini", msgs) is None:
cache.set("gpt-4o-mini", msgs, f"Response for: {q}")
print(f"Cache stats: {cache.stats()}")
Output:
Cache stats: hits=8, misses=2, rate=80%
80% cache hit rate = 80% fewer API calls. For FAQ bots and repetitive workflows, this is enormous.
Strategy 9: Test Cheaper Models First
Don’t default to the flagship. Start with the cheapest model. Move up only if quality falls below your bar.
task_tokens = {"input": 1000, "output": 500}
candidates = [
("gemini-2.5-flash-lite", 0.10, 0.40),
("gpt-4o-mini", 0.15, 0.60),
("gpt-4.1", 2.00, 8.00),
("claude-sonnet-4.6", 3.00, 15.00),
]
print(f"{'Model':<25} {'Cost/Request':>12} {'vs Cheapest':>12}")
print("-" * 52)
base = None
for model, inp, out in candidates:
cost = (task_tokens["input"]/1e6 * inp) + (task_tokens["output"]/1e6 * out)
if base is None:
base = cost
print(f"{model:<25} ${cost:>10.6f} {cost/base:>10.1f}x")
Output:
Model Cost/Request vs Cheapest
----------------------------------------------------
gemini-2.5-flash-lite $ 0.000300 1.0x
gpt-4o-mini $ 0.000450 1.5x
gpt-4.1 $ 0.006000 20.0x
claude-sonnet-4.6 $ 0.010500 35.0x
Sonnet costs 35x more per request than Flash-Lite. For classification and extraction tasks, Flash-Lite usually handles them fine. That’s a 97% savings.
Strategy 10: Set Budget Alerts
Add alerting to the CostTracker so runaway costs get caught before the bill shocks you.
class BudgetTracker(CostTracker):
"""CostTracker with budget alerts."""
def __init__(self, daily_budget=10.0):
super().__init__()
self.daily_budget = daily_budget
self.alerted_75 = False
self.alerted_90 = False
def log_request(self, model, input_tokens, output_tokens, tag="default"):
entry = super().log_request(model, input_tokens, output_tokens, tag)
spent = self.total_spend()
pct = (spent / self.daily_budget) * 100
if pct >= 90 and not self.alerted_90:
print(f" ALERT: {pct:.0f}% of ${self.daily_budget} budget!")
self.alerted_90 = True
elif pct >= 75 and not self.alerted_75:
print(f" WARNING: {pct:.0f}% of ${self.daily_budget} budget")
self.alerted_75 = True
return entry
bt = BudgetTracker(daily_budget=0.005)
for i in range(30):
bt.log_request("gpt-4o-mini", 500, 200, tag="chat")
print(f"\nTotal: \({bt.total_spend():.6f} of \){bt.daily_budget} budget")
Output:
WARNING: 78% of $0.005 budget
ALERT: 94% of $0.005 budget!
Total: $0.005850 of $0.005 budget
Common Mistakes That Inflate Your LLM API Costs
Mistake 1: Sending Full Conversation History Every Time
Each turn adds tokens. By turn 20, each request costs 20x what turn 1 cost.
avg_turn_tokens = 150
print(f"{'Turn':<6} {'Tokens':>8} {'Cost (Sonnet)':>15}")
print("-" * 32)
for turn in [1, 5, 10, 15, 20]:
tokens = turn * avg_turn_tokens
cost = tokens / 1e6 * 3.00
print(f"{turn:<6} {tokens:>8,} ${cost:>13.6f}")
Output:
Turn Tokens Cost (Sonnet)
--------------------------------
1 150 $ 0.000450
5 750 $ 0.002250
10 1,500 $ 0.004500
15 2,250 $ 0.006750
20 3,000 $ 0.009000
Use trim_conversation to keep history bounded. Your wallet will thank you.
Mistake 2: Not Setting max_tokens
Without an explicit cap, the model decides response length. Open-ended questions might return 1,000+ tokens when 200 would suffice.
Mistake 3: Using the Premium Model for Everything
Teams default to GPT-4o or Sonnet for every request. Classification, extraction, simple Q&A — all hitting the expensive endpoint. Switch routine tasks to mini/flash models. The quality difference is often negligible.
Mistake 4: Ignoring Prompt Caching
If your system prompt exceeds 1,024 tokens and you send it on every request, enable caching. Setup takes 5 minutes. The savings are immediate and dramatic.
When Should You NOT Optimize LLM Costs?
Cost cutting has limits. Sometimes the right move is spending more:
- User-facing quality: A chatbot that sounds dumb loses users. Lost users cost more than premium tokens ever will.
- High-stakes accuracy: Medical, legal, financial tasks need the best model. Wrong answers cost more than expensive tokens.
- Prototyping phase: Don’t optimize early. Get the product right first. Switching models mid-development wastes engineering time you’ll never get back.
Your Cost Optimization Checklist
Here’s the priority order for cutting LLM API costs. Start at the top — each strategy builds on the last.
- Model routing — Route simple tasks to budget models (60-90% savings)
- Prompt caching — Cache system prompts (75-90% savings on cached portion)
- Batch API — Non-urgent work at half price (50% savings)
- Cap output tokens —
max_tokenson every call (30-75% savings) - Trim history — Sliding window on conversation context (20-50% savings)
- Compress prompts — Cut redundant instructions (10-30% savings)
- Response caching — Skip identical calls
- Structured output — JSON instead of prose (40-60% output savings)
- Monitor spend — Budget alerts catch runaway costs
- Test cheaper models — Don’t assume you need premium
What happens when you stack the top five?
strategies = [
("Model routing", 0.70),
("Prompt caching", 0.40),
("Batch API (50% of work)", 0.25),
("Output capping", 0.20),
("History trimming", 0.10),
]
monthly = 500.0
remaining = monthly
print(f"Starting: ${monthly:.2f}/month\n")
print(f"{'Strategy':<25} {'Saves':>8} {'Remaining':>10}")
print("-" * 46)
for name, pct in strategies:
saved = remaining * pct
remaining -= saved
print(f"{name:<25} \({saved:>6.2f} \){remaining:>8.2f}")
print(f"\nFinal: ${remaining:.2f}/month")
print(f"Total saved: ${monthly - remaining:.2f} ({(1-remaining/monthly)*100:.0f}%)")
Output:
Starting: $500.00/month
Strategy Saves Remaining
----------------------------------------------
Model routing $350.00 $ 150.00
Prompt caching $ 60.00 $ 90.00
Batch API (50% of work) $ 22.50 $ 67.50
Output capping $ 13.50 $ 54.00
History trimming $ 5.40 $ 48.60
Final: $48.60/month
Total saved: $451.40 (90%)
From $500 to under $50 per month. That’s the power of stacking these strategies.
Summary
Tokens drive every dollar of your LLM bill. You now know how to count them, compare prices, track spend, and apply ten strategies that stack to 90% savings.
Three moves matter most: route simple tasks to cheap models, cache your prompts, and batch the rest. Those three alone cut costs 80%+.
Start with model routing today. Ten minutes of work. Biggest savings on day one.
Practice Exercise:
{
type: ‘exercise’,
id: ‘cost-calculator-ex1’,
title: ‘Exercise 1: Calculate Monthly API Cost’,
difficulty: ‘beginner’,
exerciseType: ‘write’,
instructions: ‘Write a function monthly_cost(model, daily_requests, avg_input_tokens, avg_output_tokens) that returns the estimated monthly cost (30 days). Use the PRICING dictionary provided.’,
starterCode: ‘PRICING = {\n “gpt-4o-mini”: (0.15, 0.60),\n “gpt-4o”: (2.50, 10.00),\n “claude-sonnet-4.6”: (3.00, 15.00),\n “gemini-2.5-flash-lite”: (0.10, 0.40),\n}\n\ndef monthly_cost(model, daily_requests, avg_input_tokens, avg_output_tokens):\n # Your code here\n pass\n\nprint(monthly_cost(“gpt-4o-mini”, 1000, 500, 200))’,
testCases: [
{ id: ‘tc1’, input: ‘print(monthly_cost(“gpt-4o-mini”, 1000, 500, 200))’, expectedOutput: ‘5.85’, description: ‘GPT-4o-mini: 1000 reqs/day, 500 in, 200 out’ },
{ id: ‘tc2’, input: ‘print(monthly_cost(“claude-sonnet-4.6”, 500, 1000, 500))’, expectedOutput: ‘157.5’, description: ‘Sonnet: 500 reqs/day, 1000 in, 500 out’ },
{ id: ‘tc3’, input: ‘print(monthly_cost(“gemini-2.5-flash-lite”, 10000, 200, 100))’, expectedOutput: ‘18.0’, description: ‘Flash-Lite at scale’ },
],
hints: [
‘Daily cost = (daily_requests * avg_input_tokens / 1_000_000 * input_price) + (daily_requests * avg_output_tokens / 1_000_000 * output_price). Multiply by 30.’,
‘Full answer: inp, out = PRICING[model]; return ((daily_requests * avg_input_tokens / 1e6 * inp) + (daily_requests * avg_output_tokens / 1e6 * out)) * 30’,
],
solution: ‘def monthly_cost(model, daily_requests, avg_input_tokens, avg_output_tokens):\n inp_price, out_price = PRICING[model]\n daily = (daily_requests * avg_input_tokens / 1e6 * inp_price) + (daily_requests * avg_output_tokens / 1e6 * out_price)\n return daily * 30’,
solutionExplanation: ‘Look up the per-million-token prices, multiply by actual token volume, and scale to 30 days.’,
xpReward: 15,
}
{
type: ‘exercise’,
id: ‘cost-optimizer-ex2’,
title: ‘Exercise 2: Build a Model Router’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function cheapest_model(input_tokens, output_tokens, models) that returns (model_name, total_cost) for the cheapest option in the given pricing dict.’,
starterCode: ‘MODELS = {\n “gpt-4o-mini”: (0.15, 0.60),\n “gpt-4.1”: (2.00, 8.00),\n “claude-sonnet-4.6”: (3.00, 15.00),\n “gemini-2.5-flash-lite”: (0.10, 0.40),\n}\n\ndef cheapest_model(input_tokens, output_tokens, models=MODELS):\n # Your code here\n pass\n\nname, cost = cheapest_model(1000, 500)\nprint(f”{name}: ${cost:.6f}”)’,
testCases: [
{ id: ‘tc1’, input: ‘name, cost = cheapest_model(1000, 500)\nprint(f”{name}: \({cost:.6f}")', expectedOutput: 'gemini-2.5-flash-lite: \)0.000300′, description: ‘Should pick cheapest option’ },
{ id: ‘tc2’, input: ‘name, _ = cheapest_model(0, 1000000)\nprint(name)’, expectedOutput: ‘gemini-2.5-flash-lite’, description: ‘Output-heavy: still cheapest’ },
],
hints: [
‘Loop through models.items(). For each, compute cost = (input_tokens/1e6 * inp) + (output_tokens/1e6 * out). Track minimum.’,
‘One-liner: return min(((n, i/1e6p[0]+o/1e6p[1]) for n,p in models.items()), key=lambda x: x[1])’,
],
solution: ‘def cheapest_model(input_tokens, output_tokens, models=MODELS):\n best_name, best_cost = None, float(“inf”)\n for name, (inp_p, out_p) in models.items():\n cost = (input_tokens/1e6 * inp_p) + (output_tokens/1e6 * out_p)\n if cost < best_cost:\n best_cost = cost\n best_name = name\n return best_name, best_cost’,
solutionExplanation: ‘Iterate all models, compute cost for the given token counts, return the cheapest. This is the core of any model routing system.’,
xpReward: 15,
}
Frequently Asked Questions
How do tokens differ across LLM providers?
Each provider uses a different tokenizer. The same text might be 100 tokens on OpenAI and 95 on Anthropic. For estimation, tiktoken works as a proxy for all providers — accurate within 5-10%.
Can I stack prompt caching with the batch API?
Yes. Both OpenAI and Anthropic allow it. On Anthropic, batch gives 50% off and caching gives 90% off the cached portion. Combined, your cached system prompt in batch mode costs about 5% of the standard price.
What does a typical chatbot cost to run per month?
A chatbot handling 1,000 conversations/day with GPT-4o-mini costs $5-15/month. With Claude Sonnet, the same workload costs $100-300/month. It depends on conversation length and response size.
Does tiktoken work for counting Claude and Gemini tokens?
Not precisely. tiktoken uses OpenAI’s BPE tokenizer. Claude and Gemini use different ones. But for cost estimation, tiktoken is close enough (within 10%). For exact counts, Anthropic returns token usage in the API response header, and Google offers a count_tokens method.
What are the biggest cost monitoring tools available?
Several LLMOps platforms track costs automatically: LangSmith (LangChain’s observability tool), Helicone (proxy-based tracking), and Portkey (multi-provider gateway with cost dashboards). For simpler needs, the CostTracker class in this article gives you the core logic without external dependencies.
Complete Code
References
- OpenAI — API Pricing (March 2026). Link
- Anthropic — Claude API Pricing. Link
- Google — Gemini API Pricing. Link
- OpenAI — How to count tokens with tiktoken. Link
- OpenAI — Prompt Caching Guide. Link
- Anthropic — Prompt Caching Documentation. Link
- OpenAI — Batch API Documentation. Link
- tiktoken — GitHub Repository. Link
- Redis — LLM Token Optimization Guide. Link
- TLDL — LLM API Pricing March 2026. Link
Meta description: Learn to count LLM tokens with tiktoken, compare API pricing across OpenAI, Claude, and Gemini (March 2026), and cut costs by 90% with 10 proven strategies and code.
[SCHEMA HINTS]
– Article type: Tutorial
– Primary technology: tiktoken, OpenAI API, Claude API, Gemini API
– Programming language: Python
– Difficulty: Intermediate
– Keywords: LLM API costs, token counting, API pricing comparison, cost optimization, tiktoken, prompt caching, batch API
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →