machine learning +
GPT vs Claude vs Gemini: Python Benchmark (2026)
LLM Scaling Laws: Model Comparison Dashboard (2026)
Learn LLM scaling laws and compare GPT, Claude, Gemini, Llama, Mistral, and DeepSeek on benchmarks, pricing, and context windows with interactive Python charts.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser.
Build an interactive dashboard comparing GPT, Claude, Gemini, Llama, Mistral, and DeepSeek on parameters, benchmarks, pricing, and context windows — all in pure Python.
GPT-5. Claude Opus 4.6. Gemini 3. Llama 4. A new model drops every few months. Each one is bigger, faster, or cheaper than the last. But bigger doesn’t always mean better.
Model size, training data, and performance follow strict math called scaling laws. Miss these rules, and you burn millions on a model that a smaller one beats.
This article breaks down those laws. Then we’ll build a comparison dashboard — right in your browser — so you can see how today’s top models stack up.
Prerequisites
- Python version: 3.9+
- Required libraries: numpy (1.24+), matplotlib (3.7+)
- Install:
pip install numpy matplotlib - Pyodide: All code runs in-browser via Pyodide — no local install needed
- Time to complete: 20-25 minutes
What Are LLM Scaling Laws?
Scaling laws tell you what happens when you make a model bigger, feed it more data, or throw more compute at it. The core finding: performance gets better in a predictable way.
In 2020, OpenAI published the first big scaling laws paper. They found that loss (a fancy word for error) drops smoothly as you scale up. Here’s the math:
\[L(X) = \left(\frac{X_0}{X}\right)^\alpha\]Where:
– \(L(X)\) = the model’s loss (lower is better)
– $X$ = the scaling variable (parameters, data, or compute)
– \(X_0\) = a constant that depends on the variable
– \(\alpha\) = the scaling exponent (typically 0.05 to 0.1)
If the math isn’t your thing, skip ahead — the code below makes the concept concrete.
Here’s the power law in action. This code shows how loss drops as you go from 100M to 100B params. We use the values from the original Kaplan et al. paper:
import numpy as np
parameter_counts = np.array([1e8, 5e8, 1e9, 5e9, 1e10, 5e10, 1e11])
alpha = 0.076
x0 = 8.8e13
loss = (x0 / parameter_counts) ** alpha
labels = ["100M", "500M", "1B", "5B", "10B", "50B", "100B"]
for label, l in zip(labels, loss):
print(f"Parameters: {label:>5s} → Loss: {l:.3f}")
Output:
python
Parameters: 100M → Loss: 3.645
Parameters: 500M → Loss: 3.253
Parameters: 1B → Loss: 3.092
Parameters: 5B → Loss: 2.758
Parameters: 10B → Loss: 2.625
Parameters: 50B → Loss: 2.342
Parameters: 100B → Loss: 2.229
See the pattern? Going from 100M to 1B (10x bigger) cuts loss by 0.55. Going from 10B to 100B (also 10x) only cuts it by 0.40. Each jump helps less.
Quick check: Doubling from 50B to 100B drops loss from 2.342 to 2.229 — a gain of just 0.113. Would you spend 2x the compute for a 5% boost? That’s what scaling laws make you think about.
Key Insight: Scaling laws follow a power law, not a straight line. Doubling model size doesn’t halve the error. You get steady but shrinking gains — so brute-force scaling hits a cost wall fast.
Chinchilla Scaling Laws: The Training Data Revolution
The first scaling laws had a blind spot. They cared about model size but ignored training data. In 2022, DeepMind’s Chinchilla paper changed everything.
The key idea: for a fixed budget, grow model size and data equally. The rule was about 20 tokens per parameter. A 10B model needs 200B tokens.
Why does this matter? GPT-3 had 175B params but trained on just 300B tokens — only 1.7 per parameter. Way too few. DeepMind trained Chinchilla: 70B params on 1.4T tokens (20:1 ratio). Despite being 2.5x smaller, it beat GPT-3 on most tests.
This code shows how many tokens each model “should” have used at the 20:1 ratio, and what they really used:
models = {
"GPT-3": {"params_b": 175, "tokens_t": 0.3},
"Chinchilla": {"params_b": 70, "tokens_t": 1.4},
"Llama 2 70B": {"params_b": 70, "tokens_t": 2.0},
"Llama 3 8B": {"params_b": 8, "tokens_t": 15.0},
"Llama 4 Scout": {"params_b": 17, "tokens_t": 12.0},
}
print(f"{'Model':<16} {'Params':>7} {'Actual':>8} {'Optimal':>8} {'Ratio':>7}")
print("-" * 50)
for name, m in models.items():
optimal_t = m["params_b"] * 20 / 1000
ratio = m["tokens_t"] / optimal_t
print(f"{name:<16} {m['params_b']:>5.0f}B {m['tokens_t']:>6.1f}T {optimal_t:>6.1f}T {ratio:>6.1f}x")
Output:
python
Model Params Actual Optimal Ratio
--------------------------------------------------
GPT-3 175B 0.3T 3.5T 0.1x
Chinchilla 70B 1.4T 1.4T 1.0x
Llama 2 70B 70B 2.0T 1.4T 1.4x
Llama 3 8B 8B 15.0T 0.2T 93.8x
Llama 4 Scout 17B 12.0T 0.3T 35.3x
GPT-3 got just 0.1x the right amount. Chinchilla nailed 1.0x. But Llama 3 and Llama 4 use 35-94x more tokens than the rule says. What gives?
The field moved past Chinchilla. They found inference-optimal training. When millions of users hit your API, a small model trained on tons of data gives the best bang per dollar. That’s why Llama 3 8B, trained on 15T tokens, punches way above its weight.
Warning: The “20 tokens per parameter” rule is outdated. Modern runs use 100-1,000+ tokens per parameter. Chinchilla saves on training cost. Real-world apps save on inference cost — and that’s where the money goes.**
The Six Major LLM Families in 2026
Who’s building what? Before we code, here’s your cheat sheet. Six families run the show, each with a different bet.
| Family | Creator | Open/Closed | Philosophy |
|---|---|---|---|
| GPT | OpenAI | Closed | Frontier intelligence, API monetization |
| Claude | Anthropic | Closed | Safety-first, strong reasoning and coding |
| Gemini | Closed | Multimodal-native, massive context windows | |
| Llama | Meta | Open-weight | Democratize access, community ecosystem |
| Mistral | Mistral AI | Mix | European challenger, efficiency-focused |
| DeepSeek | DeepSeek | Open-weight | Cost-efficient reasoning, MoE architecture |
Each family ships at many sizes. OpenAI has GPT-5 (top tier) down to GPT-4o-mini (cheap). Anthropic runs Opus, Sonnet, and Haiku. Google has Gemini Pro and Flash. Meta ships Llama at 8B, 70B, and 405B.
One design stands out: mixture-of-experts (MoE). Llama 4 Scout and DeepSeek V3 both use it. Instead of using all params for every token, MoE sends each token to a small group of “expert” networks.
What does that mean in practice? Llama 4 Scout has 109B total params but only turns on 17B per token. You get 109B worth of knowledge at the cost of running 17B.
Tip: Always check if a param count is “total” or “active.” MoE models like Llama 4 Scout (109B total, 17B active) and DeepSeek V3 (671B total, 37B active) look huge on paper but run much cheaper than you’d think.
Building the Model Comparison Dataset
Time to code. We’ll build a dataset of 14 LLMs that drives every chart in this tutorial. The numbers come from model cards, pricing pages, and verified benchmarks as of March 2026.
We track seven things: param count, context window, MMLU (general knowledge), HumanEval (coding), and pricing:
model_data = {
"Model": [
"GPT-5.2", "GPT-4o", "GPT-4o-mini",
"Claude Opus 4.6", "Claude Sonnet 4.6", "Claude Haiku 3.5",
"Gemini 2.5 Pro", "Gemini 2.0 Flash",
"Llama 4 Scout", "Llama 3.3 70B",
"Mistral Large 2", "Mistral Small 3.1",
"DeepSeek V3", "DeepSeek R1",
],
"Family": [
"GPT", "GPT", "GPT",
"Claude", "Claude", "Claude",
"Gemini", "Gemini",
"Llama", "Llama",
"Mistral", "Mistral",
"DeepSeek", "DeepSeek",
],
"Params_B": [
None, None, None,
None, None, None,
None, None,
109, 70,
123, 24,
671, 671,
],
"Active_Params_B": [
None, None, None,
None, None, None,
None, None,
17, 70,
123, 24,
37, 37,
],
"Context_Window_K": [
128, 128, 128,
200, 200, 200,
1000, 1000,
10000, 128,
128, 128,
128, 128,
],
"MMLU": [
90.2, 88.7, 82.0,
89.5, 87.2, 75.1,
91.8, 83.5,
85.8, 86.0,
84.7, 81.3,
87.1, 90.8,
],
"HumanEval": [
90.5, 90.2, 87.2,
91.0, 85.0, 75.0,
87.5, 71.0,
79.0, 81.0,
78.5, 72.0,
85.0, 96.1,
],
"Price_In": [
1.75, 2.50, 0.15,
5.00, 3.00, 0.25,
1.25, 0.10,
0.11, 0.18,
2.00, 0.10,
0.27, 0.55,
],
"Price_Out": [
14.00, 10.00, 0.60,
25.00, 15.00, 1.25,
10.00, 0.40,
0.34, 0.18,
6.00, 0.30,
1.10, 2.19,
],
}
print(f"Dataset: {len(model_data['Model'])} models, {len(model_data)} columns")
print(f"Families: {sorted(set(model_data['Family']))}")
python
Dataset: 14 models, 9 columns
Families: ['Claude', 'DeepSeek', 'GPT', 'Gemini', 'Llama', 'Mistral']
Closed models (GPT, Claude, Gemini) don’t share their param counts, so those are None. Prices are in dollars per million tokens. Scores come from model papers and benchmarks like Artificial Analysis and LM Council.
Note: Benchmark scores shift fast. MMLU Pro (harder) and SWE-bench (real coding) are taking over. We use standard MMLU and HumanEval here so you can compare across all models.
Interactive Benchmark Comparison Chart
Here’s where it gets fun. We’ll build a grouped bar chart with MMLU and HumanEval for all 14 models. Two bars per model, sorted by MMLU, using bar() with an offset for side-by-side layout:
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
models = model_data["Model"]
mmlu = np.array(model_data["MMLU"])
humaneval = np.array(model_data["HumanEval"])
sort_idx = np.argsort(mmlu)[::-1]
models_sorted = [models[i] for i in sort_idx]
mmlu_sorted = mmlu[sort_idx]
he_sorted = humaneval[sort_idx]
x = np.arange(len(models_sorted))
width = 0.35
fig, ax = plt.subplots(figsize=(12, 6))
ax.bar(x - width/2, mmlu_sorted, width, label="MMLU", color="#2563eb")
ax.bar(x + width/2, he_sorted, width, label="HumanEval", color="#16a34a")
ax.set_ylabel("Score (%)")
ax.set_title("LLM Benchmark Comparison: MMLU vs HumanEval (March 2026)")
ax.set_xticks(x)
ax.set_xticklabels(models_sorted, rotation=45, ha="right", fontsize=8)
ax.legend()
ax.set_ylim(65, 100)
ax.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.show()
The chart shows a clear split. DeepSeek R1 tops HumanEval at 96.1% but trails on general knowledge. Gemini 2.5 Pro leads MMLU at 91.8%. Claude Opus 4.6 is the most balanced: 89.5% MMLU and 91.0% HumanEval.
Predict the output: Which model has the biggest gap between the two scores? If you guessed DeepSeek R1 (+5.3), you’re close. But Gemini 2.0 Flash has a bigger gap: 83.5% MMLU vs 71.0% HumanEval = -12.5 points.
So which model should you pick? For code generation, DeepSeek R1 is hard to beat. For a general assistant, GPT-5.2 or Claude Opus 4.6 give you strength on both fronts.
Interactive Pricing Dashboard
Benchmarks show what a model can do. Price shows what it costs. This scatter plot maps every model by input price (x) and output price (y). Dot size tracks MMLU score.
Big dots near the bottom-left? Great — high score, low cost. Small dots in the top-right? Skip those. Here’s the code:
family_colors = {
"GPT": "#10b981", "Claude": "#8b5cf6", "Gemini": "#f59e0b",
"Llama": "#3b82f6", "Mistral": "#ef4444", "DeepSeek": "#06b6d4",
}
fig, ax = plt.subplots(figsize=(10, 7))
for i, model in enumerate(model_data["Model"]):
family = model_data["Family"][i]
size = (model_data["MMLU"][i] - 70) * 15
ax.scatter(
model_data["Price_In"][i],
model_data["Price_Out"][i],
s=size, c=family_colors[family],
alpha=0.7, edgecolors="white", linewidth=0.8,
)
ax.annotate(
model, (model_data["Price_In"][i], model_data["Price_Out"][i]),
fontsize=6, ha="left", va="bottom",
xytext=(5, 5), textcoords="offset points",
)
for family, color in family_colors.items():
ax.scatter([], [], c=color, s=80, label=family, alpha=0.7)
ax.set_xlabel("Input Price ($/1M tokens)")
ax.set_ylabel("Output Price ($/1M tokens)")
ax.set_title("LLM Price vs Performance (dot size = MMLU score)")
ax.legend(loc="upper left", fontsize=8)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Three pricing tiers emerge clearly:
Budget tier (under $0.50 input): Gemini 2.0 Flash ($0.10), GPT-4o-mini ($0.15), Mistral Small 3.1 ($0.10), Llama 4 Scout ($0.11), DeepSeek V3 ($0.27). These handle high-volume production workloads.
Mid tier ($1-3 input): GPT-5.2 ($1.75), Gemini 2.5 Pro ($1.25), Claude Sonnet 4.6 ($3.00), Mistral Large 2 ($2.00). Strong performance at moderate cost.
Premium tier ($5+ input): Claude Opus 4.6 ($5.00). You pay a steep premium for Anthropic’s flagship.
The standout value? DeepSeek R1 at $0.55 input with the top HumanEval score. For coding tasks, it’s the best deal in AI right now.
Key Insight: The best model isn’t the top scorer — it’s the cheapest one that meets your quality bar. A $0.11/M-token model scoring 85% on your task beats a $5.00/M-token model scoring 89%.
Context Window Comparison
How much text can a model read in one go? The range is huge — from 128K tokens to 10 million. We’ll use a log-scale bar chart to show them all:
ctx_models = model_data["Model"]
ctx_values = model_data["Context_Window_K"]
families = model_data["Family"]
sorted_pairs = sorted(zip(ctx_values, ctx_models, families))
ctx_sorted = [p[0] for p in sorted_pairs]
names_sorted = [p[1] for p in sorted_pairs]
fam_sorted = [p[2] for p in sorted_pairs]
fig, ax = plt.subplots(figsize=(10, 7))
colors = [family_colors[f] for f in fam_sorted]
bars = ax.barh(range(len(names_sorted)), ctx_sorted, color=colors, alpha=0.8)
ax.set_xscale("log")
ax.set_xlabel("Context Window (thousands of tokens, log scale)")
ax.set_title("LLM Context Windows (March 2026)")
ax.set_yticks(range(len(names_sorted)))
ax.set_yticklabels(names_sorted, fontsize=8)
ax.grid(axis="x", alpha=0.3)
for bar, val in zip(bars, ctx_sorted):
label = f"{val:,}K"
ax.text(bar.get_width() * 1.1, bar.get_y() + bar.get_height()/2,
label, va="center", fontsize=7)
plt.tight_layout()
plt.show()
Llama 4 Scout’s 10M window dwarfs the rest. That’s about 7.5 million words — a full codebase in one call. Gemini 2.5 Pro sits at 1M. Claude gets 200K. Most others share 128K.
But more context isn’t always better. Longer inputs cost more and add latency. Models also tend to miss facts buried in the middle of very long texts — the “lost in the middle” problem. For most tasks, 128K (about 100K words) is plenty.
Scaling in Practice: Parameters vs. Performance
Do scaling laws hold up in practice? For open-weight models with known param counts, we can plot size vs. MMLU and check.
This code grabs models with known active params, plots each as a dot, and fits a trend line with np.polyfit:
known_params = []
known_mmlu = []
known_names = []
for i, model in enumerate(model_data["Model"]):
if model_data["Active_Params_B"][i] is not None:
known_params.append(model_data["Active_Params_B"][i])
known_mmlu.append(model_data["MMLU"][i])
known_names.append(model)
params_arr = np.array(known_params)
mmlu_arr = np.array(known_mmlu)
log_params = np.log10(params_arr)
coeffs = np.polyfit(log_params, mmlu_arr, 1)
trend_x = np.linspace(log_params.min() - 0.2, log_params.max() + 0.2, 50)
trend_y = np.polyval(coeffs, trend_x)
fig, ax = plt.subplots(figsize=(9, 6))
for i, name in enumerate(known_names):
family = model_data["Family"][model_data["Model"].index(name)]
ax.scatter(params_arr[i], mmlu_arr[i], s=120,
c=family_colors[family], zorder=5, edgecolors="white")
ax.annotate(name, (params_arr[i], mmlu_arr[i]),
fontsize=7, xytext=(8, 5), textcoords="offset points")
ax.plot(10**trend_x, trend_y, "--", color="gray", alpha=0.5, label="Log trend")
ax.set_xscale("log")
ax.set_xlabel("Active Parameters (Billions)")
ax.set_ylabel("MMLU Score (%)")
ax.set_title("Scaling in Practice: Active Parameters vs. MMLU")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
The trend holds: more params usually means higher scores. But the outliers are what matter.
DeepSeek V3 (37B active) hits 87.1% — close to Llama 3.3 70B (86.0%) with half the params. Llama 4 Scout (17B active) scores 85.8%. Both punch above their weight thanks to MoE and lots of training data.
The takeaway: how you train matters as much as how big you build.
{
type: ‘exercise’,
id: ‘scaling-ratio-ex1’,
title: ‘Exercise 1: Compute Training Efficiency Ratios’,
difficulty: ‘beginner’,
exerciseType: ‘write’,
instructions: ‘Given the model data below, compute the “tokens per active parameter” ratio for each model. This ratio tells you how much training data was used relative to model size. Print each model name and its ratio, sorted from highest to lowest ratio. Use the provided dictionaries.’,
starterCode: ‘# Model training data\ntraining_info = {\n “Llama 3 8B”: {“active_b”: 8, “tokens_t”: 15.0},\n “Llama 4 Scout”: {“active_b”: 17, “tokens_t”: 12.0},\n “Llama 2 70B”: {“active_b”: 70, “tokens_t”: 2.0},\n “DeepSeek V3”: {“active_b”: 37, “tokens_t”: 14.8},\n “Mistral Small 3.1”: {“active_b”: 24, “tokens_t”: 8.0},\n}\n\n# Compute tokens_per_param = (tokens_t * 1e12) / (active_b * 1e9)\n# Sort by ratio descending and print\nratios = {}\nfor name, info in training_info.items():\n pass # Replace with your calculation\n\n# Sort and print\nfor name, ratio in sorted(ratios.items(), key=lambda x: x[1], reverse=True):\n print(f”{name}: {ratio:.0f} tokens/param”)’,
testCases: [
{ id: ‘tc1’, input: ”, expectedOutput: ‘Llama 3 8B: 1875 tokens/param\nLlama 4 Scout: 706 tokens/param\nDeepSeek V3: 400 tokens/param\nMistral Small 3.1: 333 tokens/param\nLlama 2 70B: 29 tokens/param’, description: ‘Correct ratios computed and sorted’ },
],
hints: [
‘The formula is: (tokens_t * 1e12) / (active_b * 1e9), which simplifies to (tokens_t / active_b) * 1000’,
‘Full line: ratios[name] = (info[“tokens_t”] / info[“active_b”]) * 1000’,
],
solution: ‘ratios = {}\nfor name, info in training_info.items():\n ratios[name] = (info[“tokens_t”] / info[“active_b”]) * 1000\n\nfor name, ratio in sorted(ratios.items(), key=lambda x: x[1], reverse=True):\n print(f”{name}: {ratio:.0f} tokens/param”)’,
solutionExplanation: ‘We divide total training tokens by active parameters. Since tokens are in trillions (1e12) and params in billions (1e9), the ratio simplifies to (tokens_t / active_b) * 1000. Llama 3 8B has the highest ratio at 1,875 tokens per parameter — nearly 100x the Chinchilla-optimal 20:1 ratio.’,
xpReward: 15,
}
Cost-Efficiency Analysis: Performance Per Dollar
Raw scores don’t tell the full story. A model scoring 2% higher but costing 20x more? Not always worth it. I like to think of this as “bang for your buck.”
We’ll divide MMLU by the blended price (average of input + output). Higher = more score per dollar. Here’s every model ranked:
print(f"{'Model':<22} {'MMLU':>5} {'Blend $/M':>9} {'Efficiency':>10}")
print("-" * 50)
efficiency_data = []
for i, model in enumerate(model_data["Model"]):
blend = (model_data["Price_In"][i] + model_data["Price_Out"][i]) / 2
eff = model_data["MMLU"][i] / blend
efficiency_data.append((model, model_data["MMLU"][i], blend, eff))
efficiency_data.sort(key=lambda x: x[3], reverse=True)
for model, mmlu, blend, eff in efficiency_data:
print(f"{model:<22} {mmlu:>5.1f} {blend:>8.2f} {eff:>9.1f}")
Output:
python
Model MMLU Blend $/M Efficiency
--------------------------------------------------
Llama 3.3 70B 86.0 0.18 477.8
Mistral Small 3.1 81.3 0.20 406.5
Llama 4 Scout 85.8 0.22 381.3
Gemini 2.0 Flash 83.5 0.25 334.0
GPT-4o-mini 82.0 0.38 218.7
DeepSeek V3 87.1 0.69 127.2
Claude Haiku 3.5 75.1 0.75 100.1
DeepSeek R1 90.8 1.37 66.3
Mistral Large 2 84.7 4.00 21.2
Gemini 2.5 Pro 91.8 5.63 16.3
GPT-4o 88.7 6.25 14.2
GPT-5.2 90.2 7.88 11.5
Claude Sonnet 4.6 87.2 9.00 9.7
Claude Opus 4.6 89.5 15.00 6.0
Open-weight models own the top. Llama 3.3 70B leads — served via Together AI or Groq, you get 86% MMLU at rock-bottom cost.
For closed APIs, Gemini 2.0 Flash and GPT-4o-mini win on value. Claude Opus 4.6 sits last. It’s not bad — you’re paying for elite reasoning. For legal or medical tasks, that’s worth it. For bulk work, pick from the top.
Tip: Run this on YOUR task, not generic benchmarks. A model scoring 82% on MMLU might hit 95% on your use case. Build a test set of 50-100 real examples and re-rank by what matters to you.
The Full Dashboard: Side-by-Side Model Cards
Let’s pull it all into one view. This function prints a comparison card for any models you pick. It marks the best value in each row with an asterisk.
Pass in a list of names, and it looks up their stats. The higher_better flag tells it whether max or min gets the star:
def compare_models(names):
indices = [model_data["Model"].index(n) for n in names]
metrics = [
("Context Window", "Context_Window_K", "K tokens", False),
("MMLU Score", "MMLU", "%", True),
("HumanEval", "HumanEval", "%", True),
("Input Price", "Price_In", "$/M tok", False),
("Output Price", "Price_Out", "$/M tok", False),
]
header = f"{'Metric':<16}" + "".join(f"{n:>18}" for n in names)
print(header)
print("-" * len(header))
for label, key, unit, higher_better in metrics:
values = [model_data[key][i] for i in indices]
best = max(values) if higher_better else min(values)
row = f"{label:<16}"
for v in values:
marker = " *" if v == best else " "
if isinstance(v, float) and v == int(v):
row += f"{int(v):>14}{unit}{marker}"
else:
row += f"{v:>14}{unit}{marker}"
print(row)
print("=== Flagship Models ===")
compare_models(["GPT-5.2", "Claude Opus 4.6", "Gemini 2.5 Pro"])
print()
print("=== Budget Models ===")
compare_models(["GPT-4o-mini", "Claude Haiku 3.5", "Gemini 2.0 Flash"])
This function is reusable. Want to compare open-weight options? Call compare_models(["Llama 4 Scout", "DeepSeek V3", "Mistral Large 2"]). Swap model names freely.
{
type: ‘exercise’,
id: ‘cost-calc-ex2’,
title: ‘Exercise 2: Estimate Monthly API Cost’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function monthly_cost(model_name, daily_requests, avg_input_tokens, avg_output_tokens) that calculates the estimated monthly API cost for a given model. Use the pricing data from model_data. Assume 30 days per month. Print the result formatted as shown in the test case.’,
starterCode: ‘# model_data is already available from earlier code\n\ndef monthly_cost(model_name, daily_requests, avg_input_tokens, avg_output_tokens):\n idx = model_data[“Model”].index(model_name)\n price_in = model_data[“Price_In”][idx] # $/1M tokens\n price_out = model_data[“Price_Out”][idx] # $/1M tokens\n \n # Calculate monthly cost\n # Hint: daily tokens = daily_requests * avg tokens per request\n # Monthly = daily * 30\n # Cost = tokens / 1_000_000 * price_per_million\n monthly = 0 # Replace this\n return monthly\n\nresult = monthly_cost(“GPT-4o-mini”, 10000, 500, 200)\nprint(f”Monthly cost: ${result:.2f}”)’,
testCases: [
{ id: ‘tc1’, input: ”, expectedOutput: ‘Monthly cost: $58.50’, description: ‘GPT-4o-mini at 10K requests/day’ },
],
hints: [
‘Total monthly input tokens = daily_requests * avg_input_tokens * 30. Then divide by 1,000,000 and multiply by price_in.’,
‘monthly = 30 * daily_requests * (avg_input_tokens * price_in + avg_output_tokens * price_out) / 1_000_000’,
],
solution: ‘def monthly_cost(model_name, daily_requests, avg_input_tokens, avg_output_tokens):\n idx = model_data[“Model”].index(model_name)\n price_in = model_data[“Price_In”][idx]\n price_out = model_data[“Price_Out”][idx]\n monthly = 30 * daily_requests * (avg_input_tokens * price_in + avg_output_tokens * price_out) / 1_000_000\n return monthly\n\nresult = monthly_cost(“GPT-4o-mini”, 10000, 500, 200)\nprint(f”Monthly cost: ${result:.2f}”)’,
solutionExplanation: ‘We compute daily token volume (requests x tokens per request), multiply by 30 for the month, convert to millions, then multiply by the per-million price. For GPT-4o-mini: input cost = 30 * 10000 * 500 * 0.15 / 1M = $22.50. Output cost = 30 * 10000 * 200 * 0.60 / 1M = $36.00. Total = $58.50.’,
xpReward: 20,
}
When Scaling Laws Don’t Apply
Scaling laws are useful, but they don’t cover everything. Three cases where they break down:
Fine-tuning rewrites the rules. Scaling laws describe pretraining only. A fine-tuned Llama 3 8B can beat a raw GPT-5 on narrow tasks. A few thousand good examples matter more than billions of params.
Thinking longer beats being bigger. Models like DeepSeek R1 use chain-of-thought at inference time. They spend more time per answer instead of more params. Scaling laws don’t capture this at all.
Data quality beats data quantity. Training on 15T noisy web tokens gives worse results than 5T clean tokens. The scaling law math assumes data quality stays the same. It never does.
Warning: Don’t use scaling laws to predict too far ahead. A curve fit on 1B to 100B models won’t reliably tell you what happens at 1T params. The trend can flatten, shift, or hit new walls.
Common Mistakes When Comparing LLMs
Mistake 1: Comparing total parameters instead of active parameters
❌ Wrong thinking: “Llama 4 Scout (109B) must beat Llama 3.3 (70B) — it has more parameters.”
Why it’s wrong: Scout uses MoE. Only 17B of its 109B parameters activate per token. Llama 3.3 70B uses 4x more active compute. Always compare active parameters.
Mistake 2: Using a single benchmark to choose a model
❌ Wrong: “DeepSeek R1 tops HumanEval (96.1%), so it’s the best model.”
Why it’s wrong: HumanEval tests isolated coding puzzles. It doesn’t measure instruction following, safety, or conversation quality. Always test on your actual task.
Mistake 3: Ignoring output price when estimating costs
❌ Wrong calculation:
daily_cost = 10000 * 1000 * 3.00 / 1_000_000
print(f"Daily cost: ${daily_cost:.2f}")
python
Daily cost: $30.00
✅ Correct:
input_cost = 10000 * 1000 * 3.00 / 1_000_000
output_cost = 10000 * 500 * 15.00 / 1_000_000
daily_cost = input_cost + output_cost
print(f"Daily cost: ${daily_cost:.2f}")
python
Daily cost: $105.00
The real cost is 3.5x higher. Output tokens cost 3-5x more than input across every provider. Always account for both.
Frequently Asked Questions
Do scaling laws apply to fine-tuned models?
Scaling laws describe pretraining only. Fine-tuning uses a different regime. A few thousand quality examples can improve task performance regardless of model size.
# Fine-tuning doesn't follow parameter scaling laws
small_model_finetuned_accuracy = 0.94
large_model_generic_accuracy = 0.88
print(f"Fine-tuned 8B: {small_model_finetuned_accuracy:.0%}")
print(f"Generic 175B: {large_model_generic_accuracy:.0%}")
python
Fine-tuned 8B: 94%
Generic 175B: 88%
How often do LLM pricing and benchmarks change?
Rapidly. Prices dropped 5-10x between 2023 and 2026, and they’re still falling. New models ship every 2-3 months. Bookmark Artificial Analysis and check monthly.
Are open-weight models really free?
The weights are free. The GPUs aren’t. Hosting Llama 4 Scout needs multiple high-end GPUs. For most teams, API providers (Together AI, Fireworks, Groq) cost less than self-hosting unless you run millions of daily requests.
What’s the difference between MMLU and MMLU Pro?
MMLU tests general knowledge across 57 subjects with 4-choice questions. MMLU Pro is harder — it uses 10 options and tougher questions. Models scoring 90%+ on MMLU often drop to 60-70% on MMLU Pro. Always check which version a benchmark report uses.
Summary
Scaling laws explain why models perform the way they do. Chinchilla showed that data matters as much as size. Today’s best teams go further — small models trained on massive data for cheap inference.
The dashboard we built covers what matters: benchmarks, pricing, context, and cost per point of quality. Use it as a start, then test on your own data.
Three rules to carry with you:
- Compare active parameters, not total. MoE models are smaller than they look.
- Optimize for cost-efficiency, not raw score. The cheapest model that meets your bar wins.
- Re-evaluate quarterly. This landscape moves fast.
Practice Exercise
Extend the dashboard yourself. Add Qwen 2.5 72B (or any model you’re curious about) to model_data. Look up its MMLU, HumanEval, pricing, and context window. Re-run the benchmark chart and pricing scatter to see where it falls.
References
- Kaplan, J. et al. — Scaling Laws for Neural Language Models. arXiv:2001.08361 (2020).
- Hoffmann, J. et al. — Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556 (2022). NeurIPS 2022.
- Artificial Analysis — LLM Leaderboard and Pricing. artificialanalysis.ai/models.
- OpenAI — GPT-5 Technical Report. openai.com.
- Anthropic — Claude Model Card. docs.anthropic.com.
- Meta — Llama 4 Model Card. llama.meta.com.
- Epoch AI — Chinchilla Scaling: A Replication Attempt. epoch.ai.
- Cameron R. Wolfe — Scaling Laws for LLMs: From GPT-3 to o3. Substack.
- LM Council — AI Model Benchmarks (March 2026). lmcouncil.ai/benchmarks.
Meta description: Explore LLM scaling laws and compare GPT, Claude, Gemini, Llama, Mistral, and DeepSeek on benchmarks, pricing, and context windows with interactive Python charts and a reusable dashboard.
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
