LLM Scaling Laws: Model Comparison Dashboard (2026)

Learn LLM scaling laws and compare GPT, Claude, Gemini, Llama, Mistral, and DeepSeek on benchmarks, pricing, and context windows with interactive Python charts.

Written by Selva Prabhakaran | 24 min read

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser.

Build an interactive dashboard comparing GPT, Claude, Gemini, Llama, Mistral, and DeepSeek on parameters, benchmarks, pricing, and context windows — all in pure Python.

GPT-5. Claude Opus 4.6. Gemini 3. Llama 4. A new model drops every few months. Each one is bigger, faster, or cheaper than the last. But bigger doesn’t always mean better.

Model size, training data, and performance follow strict math called scaling laws. Miss these rules, and you burn millions on a model that a smaller one beats.

This article breaks down those laws. Then we’ll build a comparison dashboard — right in your browser — so you can see how today’s top models stack up.

Prerequisites

Python version: 3.9+
Required libraries: numpy (1.24+), matplotlib (3.7+)
Install: pip install numpy matplotlib
Pyodide: All code runs in-browser via Pyodide — no local install needed
Time to complete: 20-25 minutes

What Are LLM Scaling Laws?

Scaling laws tell you what happens when you make a model bigger, feed it more data, or throw more compute at it. The core finding: performance gets better in a predictable way.

In 2020, OpenAI published the first big scaling laws paper. They found that loss (a fancy word for error) drops smoothly as you scale up. Here’s the math:

L(X) = \left(\frac{X_0}{X}\right)^\alpha

Where:
– $L(X)$ = the model’s loss (lower is better)
– $X$ = the scaling variable (parameters, data, or compute)
– $X_0$ = a constant that depends on the variable
– $\alpha$ = the scaling exponent (typically 0.05 to 0.1)

If the math isn’t your thing, skip ahead — the code below makes the concept concrete.

Here’s the power law in action. This code shows how loss drops as you go from 100M to 100B params. We use the values from the original Kaplan et al. paper:

import numpy as np

parameter_counts = np.array([1e8, 5e8, 1e9, 5e9, 1e10, 5e10, 1e11])
alpha = 0.076
x0 = 8.8e13
loss = (x0 / parameter_counts) ** alpha

labels = ["100M", "500M", "1B", "5B", "10B", "50B", "100B"]
for label, l in zip(labels, loss):
    print(f"Parameters: {label:>5s}  →  Loss: {l:.3f}")

Output:

python

Parameters: 100M  →  Loss: 3.645
Parameters: 500M  →  Loss: 3.253
Parameters:   1B  →  Loss: 3.092
Parameters:   5B  →  Loss: 2.758
Parameters:  10B  →  Loss: 2.625
Parameters:  50B  →  Loss: 2.342
Parameters: 100B  →  Loss: 2.229

See the pattern? Going from 100M to 1B (10x bigger) cuts loss by 0.55. Going from 10B to 100B (also 10x) only cuts it by 0.40. Each jump helps less.

Quick check: Doubling from 50B to 100B drops loss from 2.342 to 2.229 — a gain of just 0.113. Would you spend 2x the compute for a 5% boost? That’s what scaling laws make you think about.

Key Insight: Scaling laws follow a power law, not a straight line. Doubling model size doesn’t halve the error. You get steady but shrinking gains — so brute-force scaling hits a cost wall fast.

Chinchilla Scaling Laws: The Training Data Revolution

The first scaling laws had a blind spot. They cared about model size but ignored training data. In 2022, DeepMind’s Chinchilla paper changed everything.

The key idea: for a fixed budget, grow model size and data equally. The rule was about 20 tokens per parameter. A 10B model needs 200B tokens.

Why does this matter? GPT-3 had 175B params but trained on just 300B tokens — only 1.7 per parameter. Way too few. DeepMind trained Chinchilla: 70B params on 1.4T tokens (20:1 ratio). Despite being 2.5x smaller, it beat GPT-3 on most tests.

This code shows how many tokens each model “should” have used at the 20:1 ratio, and what they really used:

models = {
    "GPT-3": {"params_b": 175, "tokens_t": 0.3},
    "Chinchilla": {"params_b": 70, "tokens_t": 1.4},
    "Llama 2 70B": {"params_b": 70, "tokens_t": 2.0},
    "Llama 3 8B": {"params_b": 8, "tokens_t": 15.0},
    "Llama 4 Scout": {"params_b": 17, "tokens_t": 12.0},
}

print(f"{'Model':<16} {'Params':>7} {'Actual':>8} {'Optimal':>8} {'Ratio':>7}")
print("-" * 50)
for name, m in models.items():
    optimal_t = m["params_b"] * 20 / 1000
    ratio = m["tokens_t"] / optimal_t
    print(f"{name:<16} {m['params_b']:>5.0f}B  {m['tokens_t']:>6.1f}T  {optimal_t:>6.1f}T  {ratio:>6.1f}x")

Output:

python

Model            Params   Actual  Optimal    Ratio
--------------------------------------------------
GPT-3              175B     0.3T     3.5T     0.1x
Chinchilla          70B     1.4T     1.4T     1.0x
Llama 2 70B         70B     2.0T     1.4T     1.4x
Llama 3 8B           8B    15.0T     0.2T    93.8x
Llama 4 Scout       17B    12.0T     0.3T    35.3x

GPT-3 got just 0.1x the right amount. Chinchilla nailed 1.0x. But Llama 3 and Llama 4 use 35-94x more tokens than the rule says. What gives?

The field moved past Chinchilla. They found inference-optimal training. When millions of users hit your API, a small model trained on tons of data gives the best bang per dollar. That’s why Llama 3 8B, trained on 15T tokens, punches way above its weight.

Warning: The “20 tokens per parameter” rule is outdated. Modern runs use 100-1,000+ tokens per parameter. Chinchilla saves on training cost. Real-world apps save on inference cost — and that’s where the money goes.**

The Six Major LLM Families in 2026

Who’s building what? Before we code, here’s your cheat sheet. Six families run the show, each with a different bet.

Family	Creator	Open/Closed	Philosophy
GPT	OpenAI	Closed	Frontier intelligence, API monetization
Claude	Anthropic	Closed	Safety-first, strong reasoning and coding
Gemini	Google	Closed	Multimodal-native, massive context windows
Llama	Meta	Open-weight	Democratize access, community ecosystem
Mistral	Mistral AI	Mix	European challenger, efficiency-focused
DeepSeek	DeepSeek	Open-weight	Cost-efficient reasoning, MoE architecture

Each family ships at many sizes. OpenAI has GPT-5 (top tier) down to GPT-4o-mini (cheap). Anthropic runs Opus, Sonnet, and Haiku. Google has Gemini Pro and Flash. Meta ships Llama at 8B, 70B, and 405B.

One design stands out: mixture-of-experts (MoE). Llama 4 Scout and DeepSeek V3 both use it. Instead of using all params for every token, MoE sends each token to a small group of “expert” networks.

What does that mean in practice? Llama 4 Scout has 109B total params but only turns on 17B per token. You get 109B worth of knowledge at the cost of running 17B.

Tip: Always check if a param count is “total” or “active.” MoE models like Llama 4 Scout (109B total, 17B active) and DeepSeek V3 (671B total, 37B active) look huge on paper but run much cheaper than you’d think.

Building the Model Comparison Dataset

Time to code. We’ll build a dataset of 14 LLMs that drives every chart in this tutorial. The numbers come from model cards, pricing pages, and verified benchmarks as of March 2026.

We track seven things: param count, context window, MMLU (general knowledge), HumanEval (coding), and pricing:

model_data = {
    "Model": [
        "GPT-5.2", "GPT-4o", "GPT-4o-mini",
        "Claude Opus 4.6", "Claude Sonnet 4.6", "Claude Haiku 3.5",
        "Gemini 2.5 Pro", "Gemini 2.0 Flash",
        "Llama 4 Scout", "Llama 3.3 70B",
        "Mistral Large 2", "Mistral Small 3.1",
        "DeepSeek V3", "DeepSeek R1",
    ],
    "Family": [
        "GPT", "GPT", "GPT",
        "Claude", "Claude", "Claude",
        "Gemini", "Gemini",
        "Llama", "Llama",
        "Mistral", "Mistral",
        "DeepSeek", "DeepSeek",
    ],
    "Params_B": [
        None, None, None,
        None, None, None,
        None, None,
        109, 70,
        123, 24,
        671, 671,
    ],
    "Active_Params_B": [
        None, None, None,
        None, None, None,
        None, None,
        17, 70,
        123, 24,
        37, 37,
    ],
    "Context_Window_K": [
        128, 128, 128,
        200, 200, 200,
        1000, 1000,
        10000, 128,
        128, 128,
        128, 128,
    ],
    "MMLU": [
        90.2, 88.7, 82.0,
        89.5, 87.2, 75.1,
        91.8, 83.5,
        85.8, 86.0,
        84.7, 81.3,
        87.1, 90.8,
    ],
    "HumanEval": [
        90.5, 90.2, 87.2,
        91.0, 85.0, 75.0,
        87.5, 71.0,
        79.0, 81.0,
        78.5, 72.0,
        85.0, 96.1,
    ],
    "Price_In": [
        1.75, 2.50, 0.15,
        5.00, 3.00, 0.25,
        1.25, 0.10,
        0.11, 0.18,
        2.00, 0.10,
        0.27, 0.55,
    ],
    "Price_Out": [
        14.00, 10.00, 0.60,
        25.00, 15.00, 1.25,
        10.00, 0.40,
        0.34, 0.18,
        6.00, 0.30,
        1.10, 2.19,
    ],
}

print(f"Dataset: {len(model_data['Model'])} models, {len(model_data)} columns")
print(f"Families: {sorted(set(model_data['Family']))}")

python

Dataset: 14 models, 9 columns
Families: ['Claude', 'DeepSeek', 'GPT', 'Gemini', 'Llama', 'Mistral']

Closed models (GPT, Claude, Gemini) don’t share their param counts, so those are None. Prices are in dollars per million tokens. Scores come from model papers and benchmarks like Artificial Analysis and LM Council.

Note: Benchmark scores shift fast. MMLU Pro (harder) and SWE-bench (real coding) are taking over. We use standard MMLU and HumanEval here so you can compare across all models.

Interactive Benchmark Comparison Chart

Here’s where it gets fun. We’ll build a grouped bar chart with MMLU and HumanEval for all 14 models. Two bars per model, sorted by MMLU, using bar() with an offset for side-by-side layout:

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np

models = model_data["Model"]
mmlu = np.array(model_data["MMLU"])
humaneval = np.array(model_data["HumanEval"])

sort_idx = np.argsort(mmlu)[::-1]
models_sorted = [models[i] for i in sort_idx]
mmlu_sorted = mmlu[sort_idx]
he_sorted = humaneval[sort_idx]

x = np.arange(len(models_sorted))
width = 0.35

fig, ax = plt.subplots(figsize=(12, 6))
ax.bar(x - width/2, mmlu_sorted, width, label="MMLU", color="#2563eb")
ax.bar(x + width/2, he_sorted, width, label="HumanEval", color="#16a34a")

ax.set_ylabel("Score (%)")
ax.set_title("LLM Benchmark Comparison: MMLU vs HumanEval (March 2026)")
ax.set_xticks(x)
ax.set_xticklabels(models_sorted, rotation=45, ha="right", fontsize=8)
ax.legend()
ax.set_ylim(65, 100)
ax.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

The chart shows a clear split. DeepSeek R1 tops HumanEval at 96.1% but trails on general knowledge. Gemini 2.5 Pro leads MMLU at 91.8%. Claude Opus 4.6 is the most balanced: 89.5% MMLU and 91.0% HumanEval.

Predict the output: Which model has the biggest gap between the two scores? If you guessed DeepSeek R1 (+5.3), you’re close. But Gemini 2.0 Flash has a bigger gap: 83.5% MMLU vs 71.0% HumanEval = -12.5 points.

So which model should you pick? For code generation, DeepSeek R1 is hard to beat. For a general assistant, GPT-5.2 or Claude Opus 4.6 give you strength on both fronts.

Interactive Pricing Dashboard

Benchmarks show what a model can do. Price shows what it costs. This scatter plot maps every model by input price (x) and output price (y). Dot size tracks MMLU score.

Big dots near the bottom-left? Great — high score, low cost. Small dots in the top-right? Skip those. Here’s the code:

family_colors = {
    "GPT": "#10b981", "Claude": "#8b5cf6", "Gemini": "#f59e0b",
    "Llama": "#3b82f6", "Mistral": "#ef4444", "DeepSeek": "#06b6d4",
}

fig, ax = plt.subplots(figsize=(10, 7))
for i, model in enumerate(model_data["Model"]):
    family = model_data["Family"][i]
    size = (model_data["MMLU"][i] - 70) * 15
    ax.scatter(
        model_data["Price_In"][i],
        model_data["Price_Out"][i],
        s=size, c=family_colors[family],
        alpha=0.7, edgecolors="white", linewidth=0.8,
    )
    ax.annotate(
        model, (model_data["Price_In"][i], model_data["Price_Out"][i]),
        fontsize=6, ha="left", va="bottom",
        xytext=(5, 5), textcoords="offset points",
    )

for family, color in family_colors.items():
    ax.scatter([], [], c=color, s=80, label=family, alpha=0.7)

ax.set_xlabel("Input Price ($/1M tokens)")
ax.set_ylabel("Output Price ($/1M tokens)")
ax.set_title("LLM Price vs Performance (dot size = MMLU score)")
ax.legend(loc="upper left", fontsize=8)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Three pricing tiers emerge clearly:

Budget tier (under $0.50 input): Gemini 2.0 Flash ($0.10), GPT-4o-mini ($0.15), Mistral Small 3.1 ($0.10), Llama 4 Scout ($0.11), DeepSeek V3 ($0.27). These handle high-volume production workloads.

Mid tier ($1-3 input): GPT-5.2 ($1.75), Gemini 2.5 Pro ($1.25), Claude Sonnet 4.6 ($3.00), Mistral Large 2 ($2.00). Strong performance at moderate cost.

Premium tier ($5+ input): Claude Opus 4.6 ($5.00). You pay a steep premium for Anthropic’s flagship.

The standout value? DeepSeek R1 at $0.55 input with the top HumanEval score. For coding tasks, it’s the best deal in AI right now.

Key Insight: The best model isn’t the top scorer — it’s the cheapest one that meets your quality bar. A $0.11/M-token model scoring 85% on your task beats a $5.00/M-token model scoring 89%.

Context Window Comparison

How much text can a model read in one go? The range is huge — from 128K tokens to 10 million. We’ll use a log-scale bar chart to show them all:

ctx_models = model_data["Model"]
ctx_values = model_data["Context_Window_K"]
families = model_data["Family"]

sorted_pairs = sorted(zip(ctx_values, ctx_models, families))
ctx_sorted = [p[0] for p in sorted_pairs]
names_sorted = [p[1] for p in sorted_pairs]
fam_sorted = [p[2] for p in sorted_pairs]

fig, ax = plt.subplots(figsize=(10, 7))
colors = [family_colors[f] for f in fam_sorted]
bars = ax.barh(range(len(names_sorted)), ctx_sorted, color=colors, alpha=0.8)

ax.set_xscale("log")
ax.set_xlabel("Context Window (thousands of tokens, log scale)")
ax.set_title("LLM Context Windows (March 2026)")
ax.set_yticks(range(len(names_sorted)))
ax.set_yticklabels(names_sorted, fontsize=8)
ax.grid(axis="x", alpha=0.3)

for bar, val in zip(bars, ctx_sorted):
    label = f"{val:,}K"
    ax.text(bar.get_width() * 1.1, bar.get_y() + bar.get_height()/2,
            label, va="center", fontsize=7)

plt.tight_layout()
plt.show()

Llama 4 Scout’s 10M window dwarfs the rest. That’s about 7.5 million words — a full codebase in one call. Gemini 2.5 Pro sits at 1M. Claude gets 200K. Most others share 128K.

But more context isn’t always better. Longer inputs cost more and add latency. Models also tend to miss facts buried in the middle of very long texts — the “lost in the middle” problem. For most tasks, 128K (about 100K words) is plenty.

Scaling in Practice: Parameters vs. Performance

Do scaling laws hold up in practice? For open-weight models with known param counts, we can plot size vs. MMLU and check.

This code grabs models with known active params, plots each as a dot, and fits a trend line with np.polyfit:

known_params = []
known_mmlu = []
known_names = []

for i, model in enumerate(model_data["Model"]):
    if model_data["Active_Params_B"][i] is not None:
        known_params.append(model_data["Active_Params_B"][i])
        known_mmlu.append(model_data["MMLU"][i])
        known_names.append(model)

params_arr = np.array(known_params)
mmlu_arr = np.array(known_mmlu)

log_params = np.log10(params_arr)
coeffs = np.polyfit(log_params, mmlu_arr, 1)
trend_x = np.linspace(log_params.min() - 0.2, log_params.max() + 0.2, 50)
trend_y = np.polyval(coeffs, trend_x)

fig, ax = plt.subplots(figsize=(9, 6))
for i, name in enumerate(known_names):
    family = model_data["Family"][model_data["Model"].index(name)]
    ax.scatter(params_arr[i], mmlu_arr[i], s=120,
               c=family_colors[family], zorder=5, edgecolors="white")
    ax.annotate(name, (params_arr[i], mmlu_arr[i]),
                fontsize=7, xytext=(8, 5), textcoords="offset points")

ax.plot(10**trend_x, trend_y, "--", color="gray", alpha=0.5, label="Log trend")
ax.set_xscale("log")
ax.set_xlabel("Active Parameters (Billions)")
ax.set_ylabel("MMLU Score (%)")
ax.set_title("Scaling in Practice: Active Parameters vs. MMLU")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

The trend holds: more params usually means higher scores. But the outliers are what matter.

DeepSeek V3 (37B active) hits 87.1% — close to Llama 3.3 70B (86.0%) with half the params. Llama 4 Scout (17B active) scores 85.8%. Both punch above their weight thanks to MoE and lots of training data.

The takeaway: how you train matters as much as how big you build.

{
type: ‘exercise’,
id: ‘scaling-ratio-ex1’,
title: ‘Exercise 1: Compute Training Efficiency Ratios’,
difficulty: ‘beginner’,
exerciseType: ‘write’,
instructions: ‘Given the model data below, compute the “tokens per active parameter” ratio for each model. This ratio tells you how much training data was used relative to model size. Print each model name and its ratio, sorted from highest to lowest ratio. Use the provided dictionaries.’,
starterCode: ‘# Model training data\ntraining_info = {\n “Llama 3 8B”: {“active_b”: 8, “tokens_t”: 15.0},\n “Llama 4 Scout”: {“active_b”: 17, “tokens_t”: 12.0},\n “Llama 2 70B”: {“active_b”: 70, “tokens_t”: 2.0},\n “DeepSeek V3”: {“active_b”: 37, “tokens_t”: 14.8},\n “Mistral Small 3.1”: {“active_b”: 24, “tokens_t”: 8.0},\n}\n\n# Compute tokens_per_param = (tokens_t * 1e12) / (active_b * 1e9)\n# Sort by ratio descending and print\nratios = {}\nfor name, info in training_info.items():\n pass # Replace with your calculation\n\n# Sort and print\nfor name, ratio in sorted(ratios.items(), key=lambda x: x[1], reverse=True):\n print(f”{name}: {ratio:.0f} tokens/param”)’,
testCases: [
{ id: ‘tc1’, input: ”, expectedOutput: ‘Llama 3 8B: 1875 tokens/param\nLlama 4 Scout: 706 tokens/param\nDeepSeek V3: 400 tokens/param\nMistral Small 3.1: 333 tokens/param\nLlama 2 70B: 29 tokens/param’, description: ‘Correct ratios computed and sorted’ },
],
hints: [
‘The formula is: (tokens_t * 1e12) / (active_b * 1e9), which simplifies to (tokens_t / active_b) * 1000’,
‘Full line: ratios[name] = (info[“tokens_t”] / info[“active_b”]) * 1000’,
],
solution: ‘ratios = {}\nfor name, info in training_info.items():\n ratios[name] = (info[“tokens_t”] / info[“active_b”]) * 1000\n\nfor name, ratio in sorted(ratios.items(), key=lambda x: x[1], reverse=True):\n print(f”{name}: {ratio:.0f} tokens/param”)’,
solutionExplanation: ‘We divide total training tokens by active parameters. Since tokens are in trillions (1e12) and params in billions (1e9), the ratio simplifies to (tokens_t / active_b) * 1000. Llama 3 8B has the highest ratio at 1,875 tokens per parameter — nearly 100x the Chinchilla-optimal 20:1 ratio.’,
xpReward: 15,
}

Cost-Efficiency Analysis: Performance Per Dollar

Raw scores don’t tell the full story. A model scoring 2% higher but costing 20x more? Not always worth it. I like to think of this as “bang for your buck.”

We’ll divide MMLU by the blended price (average of input + output). Higher = more score per dollar. Here’s every model ranked:

print(f"{'Model':<22} {'MMLU':>5} {'Blend $/M':>9} {'Efficiency':>10}")
print("-" * 50)

efficiency_data = []
for i, model in enumerate(model_data["Model"]):
    blend = (model_data["Price_In"][i] + model_data["Price_Out"][i]) / 2
    eff = model_data["MMLU"][i] / blend
    efficiency_data.append((model, model_data["MMLU"][i], blend, eff))

efficiency_data.sort(key=lambda x: x[3], reverse=True)
for model, mmlu, blend, eff in efficiency_data:
    print(f"{model:<22} {mmlu:>5.1f} {blend:>8.2f}  {eff:>9.1f}")

Output:

python

Model                   MMLU  Blend $/M  Efficiency
--------------------------------------------------
Llama 3.3 70B           86.0      0.18      477.8
Mistral Small 3.1       81.3      0.20      406.5
Llama 4 Scout           85.8      0.22      381.3
Gemini 2.0 Flash        83.5      0.25      334.0
GPT-4o-mini             82.0      0.38      218.7
DeepSeek V3             87.1      0.69      127.2
Claude Haiku 3.5        75.1      0.75      100.1
DeepSeek R1             90.8      1.37       66.3
Mistral Large 2         84.7      4.00       21.2
Gemini 2.5 Pro          91.8      5.63       16.3
GPT-4o                  88.7      6.25       14.2
GPT-5.2                 90.2      7.88       11.5
Claude Sonnet 4.6       87.2      9.00        9.7
Claude Opus 4.6         89.5     15.00        6.0

Open-weight models own the top. Llama 3.3 70B leads — served via Together AI or Groq, you get 86% MMLU at rock-bottom cost.

For closed APIs, Gemini 2.0 Flash and GPT-4o-mini win on value. Claude Opus 4.6 sits last. It’s not bad — you’re paying for elite reasoning. For legal or medical tasks, that’s worth it. For bulk work, pick from the top.

Tip: Run this on YOUR task, not generic benchmarks. A model scoring 82% on MMLU might hit 95% on your use case. Build a test set of 50-100 real examples and re-rank by what matters to you.

The Full Dashboard: Side-by-Side Model Cards

Let’s pull it all into one view. This function prints a comparison card for any models you pick. It marks the best value in each row with an asterisk.

Pass in a list of names, and it looks up their stats. The higher_better flag tells it whether max or min gets the star:

def compare_models(names):
    indices = [model_data["Model"].index(n) for n in names]
    metrics = [
        ("Context Window", "Context_Window_K", "K tokens", False),
        ("MMLU Score", "MMLU", "%", True),
        ("HumanEval", "HumanEval", "%", True),
        ("Input Price", "Price_In", "$/M tok", False),
        ("Output Price", "Price_Out", "$/M tok", False),
    ]

    header = f"{'Metric':<16}" + "".join(f"{n:>18}" for n in names)
    print(header)
    print("-" * len(header))

    for label, key, unit, higher_better in metrics:
        values = [model_data[key][i] for i in indices]
        best = max(values) if higher_better else min(values)
        row = f"{label:<16}"
        for v in values:
            marker = " *" if v == best else "  "
            if isinstance(v, float) and v == int(v):
                row += f"{int(v):>14}{unit}{marker}"
            else:
                row += f"{v:>14}{unit}{marker}"
        print(row)

print("=== Flagship Models ===")
compare_models(["GPT-5.2", "Claude Opus 4.6", "Gemini 2.5 Pro"])
print()
print("=== Budget Models ===")
compare_models(["GPT-4o-mini", "Claude Haiku 3.5", "Gemini 2.0 Flash"])

This function is reusable. Want to compare open-weight options? Call compare_models(["Llama 4 Scout", "DeepSeek V3", "Mistral Large 2"]). Swap model names freely.

{
type: ‘exercise’,
id: ‘cost-calc-ex2’,
title: ‘Exercise 2: Estimate Monthly API Cost’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function monthly_cost(model_name, daily_requests, avg_input_tokens, avg_output_tokens) that calculates the estimated monthly API cost for a given model. Use the pricing data from model_data. Assume 30 days per month. Print the result formatted as shown in the test case.’,
starterCode: ‘# model_data is already available from earlier code\n\ndef monthly_cost(model_name, daily_requests, avg_input_tokens, avg_output_tokens):\n idx = model_data[“Model”].index(model_name)\n price_in = model_data[“Price_In”][idx] # $/1M tokens\n price_out = model_data[“Price_Out”][idx] # $/1M tokens\n \n # Calculate monthly cost\n # Hint: daily tokens = daily_requests * avg tokens per request\n # Monthly = daily * 30\n # Cost = tokens / 1_000_000 * price_per_million\n monthly = 0 # Replace this\n return monthly\n\nresult = monthly_cost(“GPT-4o-mini”, 10000, 500, 200)\nprint(f”Monthly cost: ${result:.2f}”)’,
testCases: [
{ id: ‘tc1’, input: ”, expectedOutput: ‘Monthly cost: $58.50’, description: ‘GPT-4o-mini at 10K requests/day’ },
],
hints: [
‘Total monthly input tokens = daily_requests * avg_input_tokens * 30. Then divide by 1,000,000 and multiply by price_in.’,
‘monthly = 30 * daily_requests * (avg_input_tokens * price_in + avg_output_tokens * price_out) / 1_000_000’,
],
solution: ‘def monthly_cost(model_name, daily_requests, avg_input_tokens, avg_output_tokens):\n idx = model_data[“Model”].index(model_name)\n price_in = model_data[“Price_In”][idx]\n price_out = model_data[“Price_Out”][idx]\n monthly = 30 * daily_requests * (avg_input_tokens * price_in + avg_output_tokens * price_out) / 1_000_000\n return monthly\n\nresult = monthly_cost(“GPT-4o-mini”, 10000, 500, 200)\nprint(f”Monthly cost: ${result:.2f}”)’,
solutionExplanation: ‘We compute daily token volume (requests x tokens per request), multiply by 30 for the month, convert to millions, then multiply by the per-million price. For GPT-4o-mini: input cost = 30 * 10000 * 500 * 0.15 / 1M = $22.50. Output cost = 30 * 10000 * 200 * 0.60 / 1M = $36.00. Total = $58.50.’,
xpReward: 20,
}

When Scaling Laws Don’t Apply

Scaling laws are useful, but they don’t cover everything. Three cases where they break down:

Fine-tuning rewrites the rules. Scaling laws describe pretraining only. A fine-tuned Llama 3 8B can beat a raw GPT-5 on narrow tasks. A few thousand good examples matter more than billions of params.

Thinking longer beats being bigger. Models like DeepSeek R1 use chain-of-thought at inference time. They spend more time per answer instead of more params. Scaling laws don’t capture this at all.

Data quality beats data quantity. Training on 15T noisy web tokens gives worse results than 5T clean tokens. The scaling law math assumes data quality stays the same. It never does.

Warning: Don’t use scaling laws to predict too far ahead. A curve fit on 1B to 100B models won’t reliably tell you what happens at 1T params. The trend can flatten, shift, or hit new walls.

Common Mistakes When Comparing LLMs

Mistake 1: Comparing total parameters instead of active parameters

❌ Wrong thinking: “Llama 4 Scout (109B) must beat Llama 3.3 (70B) — it has more parameters.”

Why it’s wrong: Scout uses MoE. Only 17B of its 109B parameters activate per token. Llama 3.3 70B uses 4x more active compute. Always compare active parameters.

Mistake 2: Using a single benchmark to choose a model

❌ Wrong: “DeepSeek R1 tops HumanEval (96.1%), so it’s the best model.”

Why it’s wrong: HumanEval tests isolated coding puzzles. It doesn’t measure instruction following, safety, or conversation quality. Always test on your actual task.

Mistake 3: Ignoring output price when estimating costs

❌ Wrong calculation:

daily_cost = 10000 * 1000 * 3.00 / 1_000_000
print(f"Daily cost: ${daily_cost:.2f}")

python

Daily cost: $30.00

✅ Correct:

input_cost = 10000 * 1000 * 3.00 / 1_000_000
output_cost = 10000 * 500 * 15.00 / 1_000_000
daily_cost = input_cost + output_cost
print(f"Daily cost: ${daily_cost:.2f}")

python

Daily cost: $105.00

The real cost is 3.5x higher. Output tokens cost 3-5x more than input across every provider. Always account for both.

Frequently Asked Questions

Do scaling laws apply to fine-tuned models?

Scaling laws describe pretraining only. Fine-tuning uses a different regime. A few thousand quality examples can improve task performance regardless of model size.

# Fine-tuning doesn't follow parameter scaling laws
small_model_finetuned_accuracy = 0.94
large_model_generic_accuracy = 0.88
print(f"Fine-tuned 8B: {small_model_finetuned_accuracy:.0%}")
print(f"Generic 175B:  {large_model_generic_accuracy:.0%}")

python

Fine-tuned 8B: 94%
Generic 175B:  88%

How often do LLM pricing and benchmarks change?

Rapidly. Prices dropped 5-10x between 2023 and 2026, and they’re still falling. New models ship every 2-3 months. Bookmark Artificial Analysis and check monthly.

Are open-weight models really free?

The weights are free. The GPUs aren’t. Hosting Llama 4 Scout needs multiple high-end GPUs. For most teams, API providers (Together AI, Fireworks, Groq) cost less than self-hosting unless you run millions of daily requests.

What’s the difference between MMLU and MMLU Pro?

MMLU tests general knowledge across 57 subjects with 4-choice questions. MMLU Pro is harder — it uses 10 options and tougher questions. Models scoring 90%+ on MMLU often drop to 60-70% on MMLU Pro. Always check which version a benchmark report uses.

Summary

Scaling laws explain why models perform the way they do. Chinchilla showed that data matters as much as size. Today’s best teams go further — small models trained on massive data for cheap inference.

The dashboard we built covers what matters: benchmarks, pricing, context, and cost per point of quality. Use it as a start, then test on your own data.

Three rules to carry with you:

Compare active parameters, not total. MoE models are smaller than they look.
Optimize for cost-efficiency, not raw score. The cheapest model that meets your bar wins.
Re-evaluate quarterly. This landscape moves fast.

Click to expand the full script (copy-paste and run)

# Complete code from: LLM Scaling Laws and Model Comparison
# Requires: pip install numpy matplotlib
# Python 3.9+

import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

# --- Section 1: Scaling Law Power Curve ---
parameter_counts = np.array([1e8, 5e8, 1e9, 5e9, 1e10, 5e10, 1e11])
alpha = 0.076
x0 = 8.8e13
loss = (x0 / parameter_counts) ** alpha

labels = ["100M", "500M", "1B", "5B", "10B", "50B", "100B"]
for label, l in zip(labels, loss):
    print(f"Parameters: {label:>5s}  →  Loss: {l:.3f}")

# --- Section 2: Chinchilla Ratio Analysis ---
models_chinchilla = {
    "GPT-3": {"params_b": 175, "tokens_t": 0.3},
    "Chinchilla": {"params_b": 70, "tokens_t": 1.4},
    "Llama 2 70B": {"params_b": 70, "tokens_t": 2.0},
    "Llama 3 8B": {"params_b": 8, "tokens_t": 15.0},
    "Llama 4 Scout": {"params_b": 17, "tokens_t": 12.0},
}

print(f"\n{'Model':<16} {'Params':>7} {'Actual':>8} {'Optimal':>8} {'Ratio':>7}")
print("-" * 50)
for name, m in models_chinchilla.items():
    optimal_t = m["params_b"] * 20 / 1000
    ratio = m["tokens_t"] / optimal_t
    print(f"{name:<16} {m['params_b']:>5.0f}B  {m['tokens_t']:>6.1f}T  {optimal_t:>6.1f}T  {ratio:>6.1f}x")

# --- Section 3: Model Dataset ---
model_data = {
    "Model": [
        "GPT-5.2", "GPT-4o", "GPT-4o-mini",
        "Claude Opus 4.6", "Claude Sonnet 4.6", "Claude Haiku 3.5",
        "Gemini 2.5 Pro", "Gemini 2.0 Flash",
        "Llama 4 Scout", "Llama 3.3 70B",
        "Mistral Large 2", "Mistral Small 3.1",
        "DeepSeek V3", "DeepSeek R1",
    ],
    "Family": [
        "GPT", "GPT", "GPT",
        "Claude", "Claude", "Claude",
        "Gemini", "Gemini",
        "Llama", "Llama",
        "Mistral", "Mistral",
        "DeepSeek", "DeepSeek",
    ],
    "Params_B": [
        None, None, None, None, None, None,
        None, None, 109, 70, 123, 24, 671, 671,
    ],
    "Active_Params_B": [
        None, None, None, None, None, None,
        None, None, 17, 70, 123, 24, 37, 37,
    ],
    "Context_Window_K": [
        128, 128, 128, 200, 200, 200,
        1000, 1000, 10000, 128, 128, 128, 128, 128,
    ],
    "MMLU": [
        90.2, 88.7, 82.0, 89.5, 87.2, 75.1,
        91.8, 83.5, 85.8, 86.0, 84.7, 81.3, 87.1, 90.8,
    ],
    "HumanEval": [
        90.5, 90.2, 87.2, 91.0, 85.0, 75.0,
        87.5, 71.0, 79.0, 81.0, 78.5, 72.0, 85.0, 96.1,
    ],
    "Price_In": [
        1.75, 2.50, 0.15, 5.00, 3.00, 0.25,
        1.25, 0.10, 0.11, 0.18, 2.00, 0.10, 0.27, 0.55,
    ],
    "Price_Out": [
        14.00, 10.00, 0.60, 25.00, 15.00, 1.25,
        10.00, 0.40, 0.34, 0.18, 6.00, 0.30, 1.10, 2.19,
    ],
}

# --- Section 4: Benchmark Chart ---
family_colors = {
    "GPT": "#10b981", "Claude": "#8b5cf6", "Gemini": "#f59e0b",
    "Llama": "#3b82f6", "Mistral": "#ef4444", "DeepSeek": "#06b6d4",
}

models = model_data["Model"]
mmlu = np.array(model_data["MMLU"])
humaneval = np.array(model_data["HumanEval"])

sort_idx = np.argsort(mmlu)[::-1]
models_sorted = [models[i] for i in sort_idx]
mmlu_sorted = mmlu[sort_idx]
he_sorted = humaneval[sort_idx]

x = np.arange(len(models_sorted))
width = 0.35

fig, ax = plt.subplots(figsize=(12, 6))
ax.bar(x - width/2, mmlu_sorted, width, label="MMLU", color="#2563eb")
ax.bar(x + width/2, he_sorted, width, label="HumanEval", color="#16a34a")
ax.set_ylabel("Score (%)")
ax.set_title("LLM Benchmark Comparison: MMLU vs HumanEval (March 2026)")
ax.set_xticks(x)
ax.set_xticklabels(models_sorted, rotation=45, ha="right", fontsize=8)
ax.legend()
ax.set_ylim(65, 100)
ax.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

# --- Section 5: Pricing Scatter ---
fig, ax = plt.subplots(figsize=(10, 7))
for i, model in enumerate(model_data["Model"]):
    family = model_data["Family"][i]
    size = (model_data["MMLU"][i] - 70) * 15
    ax.scatter(model_data["Price_In"][i], model_data["Price_Out"][i],
               s=size, c=family_colors[family], alpha=0.7,
               edgecolors="white", linewidth=0.8)
    ax.annotate(model, (model_data["Price_In"][i], model_data["Price_Out"][i]),
                fontsize=6, ha="left", va="bottom",
                xytext=(5, 5), textcoords="offset points")

for family, color in family_colors.items():
    ax.scatter([], [], c=color, s=80, label=family, alpha=0.7)
ax.set_xlabel("Input Price ($/1M tokens)")
ax.set_ylabel("Output Price ($/1M tokens)")
ax.set_title("LLM Price vs Performance (dot size = MMLU score)")
ax.legend(loc="upper left", fontsize=8)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# --- Section 6: Context Window Chart ---
ctx_sorted_pairs = sorted(zip(
    model_data["Context_Window_K"], model_data["Model"], model_data["Family"]))
ctx_sorted = [p[0] for p in ctx_sorted_pairs]
names_sorted_ctx = [p[1] for p in ctx_sorted_pairs]
fam_sorted = [p[2] for p in ctx_sorted_pairs]

fig, ax = plt.subplots(figsize=(10, 7))
colors = [family_colors[f] for f in fam_sorted]
bars = ax.barh(range(len(names_sorted_ctx)), ctx_sorted, color=colors, alpha=0.8)
ax.set_xscale("log")
ax.set_xlabel("Context Window (thousands of tokens, log scale)")
ax.set_title("LLM Context Windows (March 2026)")
ax.set_yticks(range(len(names_sorted_ctx)))
ax.set_yticklabels(names_sorted_ctx, fontsize=8)
ax.grid(axis="x", alpha=0.3)
for bar, val in zip(bars, ctx_sorted):
    ax.text(bar.get_width() * 1.1, bar.get_y() + bar.get_height()/2,
            f"{val:,}K", va="center", fontsize=7)
plt.tight_layout()
plt.show()

# --- Section 7: Scaling Visualization ---
known_params = []
known_mmlu = []
known_names = []
for i, model in enumerate(model_data["Model"]):
    if model_data["Active_Params_B"][i] is not None:
        known_params.append(model_data["Active_Params_B"][i])
        known_mmlu.append(model_data["MMLU"][i])
        known_names.append(model)

params_arr = np.array(known_params)
mmlu_arr = np.array(known_mmlu)
log_params = np.log10(params_arr)
coeffs = np.polyfit(log_params, mmlu_arr, 1)
trend_x = np.linspace(log_params.min() - 0.2, log_params.max() + 0.2, 50)
trend_y = np.polyval(coeffs, trend_x)

fig, ax = plt.subplots(figsize=(9, 6))
for i, name in enumerate(known_names):
    family = model_data["Family"][model_data["Model"].index(name)]
    ax.scatter(params_arr[i], mmlu_arr[i], s=120,
               c=family_colors[family], zorder=5, edgecolors="white")
    ax.annotate(name, (params_arr[i], mmlu_arr[i]),
                fontsize=7, xytext=(8, 5), textcoords="offset points")
ax.plot(10**trend_x, trend_y, "--", color="gray", alpha=0.5, label="Log trend")
ax.set_xscale("log")
ax.set_xlabel("Active Parameters (Billions)")
ax.set_ylabel("MMLU Score (%)")
ax.set_title("Scaling in Practice: Active Parameters vs. MMLU")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# --- Section 8: Cost Efficiency ---
print(f"\n{'Model':<22} {'MMLU':>5} {'Blend $/M':>9} {'Efficiency':>10}")
print("-" * 50)
efficiency_data = []
for i, model in enumerate(model_data["Model"]):
    blend = (model_data["Price_In"][i] + model_data["Price_Out"][i]) / 2
    eff = model_data["MMLU"][i] / blend
    efficiency_data.append((model, model_data["MMLU"][i], blend, eff))
efficiency_data.sort(key=lambda x: x[3], reverse=True)
for model, mmlu_val, blend, eff in efficiency_data:
    print(f"{model:<22} {mmlu_val:>5.1f} {blend:>8.2f}  {eff:>9.1f}")

# --- Section 9: Model Comparison Function ---
def compare_models(names):
    indices = [model_data["Model"].index(n) for n in names]
    metrics = [
        ("Context Window", "Context_Window_K", "K tokens", False),
        ("MMLU Score", "MMLU", "%", True),
        ("HumanEval", "HumanEval", "%", True),
        ("Input Price", "Price_In", "$/M tok", False),
        ("Output Price", "Price_Out", "$/M tok", False),
    ]
    header = f"{'Metric':<16}" + "".join(f"{n:>18}" for n in names)
    print(header)
    print("-" * len(header))
    for label, key, unit, higher_better in metrics:
        values = [model_data[key][i] for i in indices]
        best = max(values) if higher_better else min(values)
        row = f"{label:<16}"
        for v in values:
            marker = " *" if v == best else "  "
            row += f"{v:>14}{unit}{marker}"
        print(row)

print("\n=== Flagship Models ===")
compare_models(["GPT-5.2", "Claude Opus 4.6", "Gemini 2.5 Pro"])
print("\n=== Budget Models ===")
compare_models(["GPT-4o-mini", "Claude Haiku 3.5", "Gemini 2.0 Flash"])

print("\nScript completed successfully.")

Practice Exercise

Extend the dashboard yourself. Add Qwen 2.5 72B (or any model you’re curious about) to model_data. Look up its MMLU, HumanEval, pricing, and context window. Re-run the benchmark chart and pricing scatter to see where it falls.

Click to see the solution

# Add Qwen 2.5 72B to the dataset
model_data["Model"].append("Qwen 2.5 72B")
model_data["Family"].append("Qwen")
model_data["Params_B"].append(72)
model_data["Active_Params_B"].append(72)
model_data["Context_Window_K"].append(128)
model_data["MMLU"].append(85.3)
model_data["HumanEval"].append(86.4)
model_data["Price_In"].append(0.20)
model_data["Price_Out"].append(0.60)

# Add Qwen to the color map
family_colors["Qwen"] = "#ec4899"

# Re-run any chart code block above
print(f"Dataset now has {len(model_data['Model'])} models")

References

Kaplan, J. et al. — Scaling Laws for Neural Language Models. arXiv:2001.08361 (2020).
Hoffmann, J. et al. — Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556 (2022). NeurIPS 2022.
Artificial Analysis — LLM Leaderboard and Pricing. artificialanalysis.ai/models.
OpenAI — GPT-5 Technical Report. openai.com.
Anthropic — Claude Model Card. docs.anthropic.com.
Meta — Llama 4 Model Card. llama.meta.com.
Epoch AI — Chinchilla Scaling: A Replication Attempt. epoch.ai.
Cameron R. Wolfe — Scaling Laws for LLMs: From GPT-3 to o3. Substack.
LM Council — AI Model Benchmarks (March 2026). lmcouncil.ai/benchmarks.

Meta description: Explore LLM scaling laws and compare GPT, Claude, Gemini, Llama, Mistral, and DeepSeek on benchmarks, pricing, and context windows with interactive Python charts and a reusable dashboard.

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

LLM Scaling Laws: Model Comparison Dashboard (2026)

Prerequisites

What Are LLM Scaling Laws?

Chinchilla Scaling Laws: The Training Data Revolution

The Six Major LLM Families in 2026

Building the Model Comparison Dataset

Interactive Benchmark Comparison Chart

Interactive Pricing Dashboard

Context Window Comparison

Scaling in Practice: Parameters vs. Performance

Cost-Efficiency Analysis: Performance Per Dollar

The Full Dashboard: Side-by-Side Model Cards

When Scaling Laws Don’t Apply

Common Mistakes When Comparing LLMs

Mistake 1: Comparing total parameters instead of active parameters

Mistake 2: Using a single benchmark to choose a model

Mistake 3: Ignoring output price when estimating costs

Frequently Asked Questions

Do scaling laws apply to fine-tuned models?

How often do LLM pricing and benchmarks change?

Are open-weight models really free?

What’s the difference between MMLU and MMLU Pro?

Summary

Practice Exercise

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Prerequisites

What Are LLM Scaling Laws?

Chinchilla Scaling Laws: The Training Data Revolution

The Six Major LLM Families in 2026

Building the Model Comparison Dataset

Interactive Benchmark Comparison Chart

Interactive Pricing Dashboard

Context Window Comparison

Scaling in Practice: Parameters vs. Performance

Cost-Efficiency Analysis: Performance Per Dollar

The Full Dashboard: Side-by-Side Model Cards

When Scaling Laws Don’t Apply

Common Mistakes When Comparing LLMs

Mistake 1: Comparing total parameters instead of active parameters

Mistake 2: Using a single benchmark to choose a model

Mistake 3: Ignoring output price when estimating costs

Frequently Asked Questions

Do scaling laws apply to fine-tuned models?

How often do LLM pricing and benchmarks change?

Are open-weight models really free?

What’s the difference between MMLU and MMLU Pro?

Summary

Practice Exercise

References

Related Articles

GPT vs Claude vs Gemini: Python Benchmark (2026)

How to Evaluate LLMs — Metrics, Benchmarks & Python Code

Zero-Shot vs Few-Shot Prompting: Complete Guide

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.