GPT vs Claude vs Gemini: Python Benchmark (2026)

Build a Python benchmarking harness to compare GPT-4o, Claude, Gemini, and Llama on quality, latency, and cost with LLM-as-judge and radar charts.

Written by Selva Prabhakaran | 30 min read

Leaderboards say one model is “best.” Your real tasks disagree. Here’s how to run your own benchmark and pick the right LLM.

⚡ This post has interactive code — click ▶ Run or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

Why Public Benchmarks Fall Short

LMSYS says Claude wins at coding. Artificial Analysis puts Gemini ahead. OpenAI’s blog claims GPT-4o tops everything. So who’s right?

Here’s the catch. Public benchmarks test tasks you’ll never run. Your workload is unique. Maybe you need a model that writes Python with type hints. Or one that follows a strict style guide. Or one that sums up legal briefs in plain English.

The only test that counts is the one you run on your own prompts.

Key Insight: Public leaderboards rank models on tasks you’ll never run. The model that tops MMLU might finish last on your use case. Always benchmark on YOUR data.

That’s what we’ll build here. A reusable harness that sends tasks to four LLM providers, scores quality with an LLM-as-judge, tracks latency and cost, and draws a radar chart so you see the tradeoffs at a glance.

Prerequisites

Python version: 3.10+
Required libraries: openai (1.30+), anthropic (0.25+), google-generativeai (0.5+), requests, matplotlib, tabulate
Install: pip install openai anthropic google-generativeai requests matplotlib tabulate python-dotenv
API keys: OpenAI, Anthropic, Google Gemini (setup below). Llama runs via Ollama locally.
Time to complete: ~30 minutes
Cost: Under $0.50 total for all benchmark runs

How the Benchmark Harness Works

Before we write any code, let me walk you through the big picture.

You start with a set of test tasks. Each task has a prompt, a category, and grading rules. We use five categories: summary, code, instructions, creative writing, and reasoning. These cover most real LLM use cases.

The harness sends each task to every provider. For each call, it saves three things: the text response, the time it took, and the token count (for cost math).

Raw text isn’t enough though. You need quality scores. A human can’t grade hundreds of outputs by hand. So we use an LLM-as-judge. A strong model reads each response and rates it 1-10 on a clear rubric.

Finally, we group the scores by provider, compute averages, and plot a radar chart. One look tells you which model wins — and where.

Let’s build each piece.

Set Up Provider Clients

Each provider has its own SDK. We’ll wrap them behind a shared function so the rest of the code stays clean.

First, store your API keys in a .env file. Never put them in your scripts.

python

# .env file (create this in your project root)
# OPENAI_API_KEY=sk-your-key-here
# ANTHROPIC_API_KEY=sk-ant-your-key-here
# GOOGLE_API_KEY=your-google-key-here

The setup code loads these keys and creates a client for each provider. We also store the model name for each one in a simple dict.

import micropip
await micropip.install(['requests'])

import os
from js import prompt
OPENAI_API_KEY = prompt("Enter your OpenAI API key:")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
ANTHROPIC_API_KEY = prompt("Enter your Anthropic API key:")
os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY
GOOGLE_API_KEY = prompt("Enter your Google API key:")
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

import os
import time
import json
import requests
from dotenv import load_dotenv
from openai import OpenAI
import anthropic
import google.generativeai as genai

load_dotenv()

openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
anthropic_client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
gemini_model = genai.GenerativeModel("gemini-2.0-flash")

OLLAMA_URL = "http://localhost:11434/api/generate"

PROVIDER_MODELS = {
    "openai": "gpt-4o",
    "anthropic": "claude-sonnet-4-20250514",
    "gemini": "gemini-2.0-flash",
    "llama": "llama3.1:8b",
}

Now the key piece: a single call_llm function. You pass it a provider name and a prompt. It calls the right API and returns a dict with the response text, latency, and token counts. Every provider returns the same shape — so all code downstream just works.

Here’s how it handles OpenAI and Anthropic. Both use chat-style APIs, but the SDK syntax differs slightly.

def call_llm(provider: str, prompt: str,
             system: str = "You are a helpful assistant.") -> dict:
    start = time.time()

    if provider == "openai":
        resp = openai_client.chat.completions.create(
            model=PROVIDER_MODELS["openai"],
            messages=[{"role": "system", "content": system},
                      {"role": "user", "content": prompt}],
            temperature=0.3,
        )
        text = resp.choices[0].message.content
        tokens_in = resp.usage.prompt_tokens
        tokens_out = resp.usage.completion_tokens

    elif provider == "anthropic":
        resp = anthropic_client.messages.create(
            model=PROVIDER_MODELS["anthropic"],
            max_tokens=1024, system=system,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
        )
        text = resp.content[0].text
        tokens_in = resp.usage.input_tokens
        tokens_out = resp.usage.output_tokens

Gemini and Llama follow the same pattern. Gemini uses its own SDK. Llama talks to Ollama’s REST API on localhost.

    elif provider == "gemini":
        resp = gemini_model.generate_content(prompt)
        text = resp.text
        tokens_in = resp.usage_metadata.prompt_token_count
        tokens_out = resp.usage_metadata.candidates_token_count

    elif provider == "llama":
        resp = requests.post(OLLAMA_URL, json={
            "model": PROVIDER_MODELS["llama"],
            "prompt": f"{system}\n\nUser: {prompt}",
            "stream": False,
        })
        data = resp.json()
        text = data["response"]
        tokens_in = data.get("prompt_eval_count", 0)
        tokens_out = data.get("eval_count", 0)

    latency = time.time() - start
    return {"provider": provider, "text": text,
            "latency": latency, "tokens_in": tokens_in,
            "tokens_out": tokens_out}

Notice how every branch returns the same dict keys. That’s the key design choice. Nothing downstream cares which provider made the call.

Tip: Set `temperature=0.3` for all providers when benchmarking. Lower values make outputs more stable. Your quality scores won’t bounce between runs. Test higher values later for creative tasks.

Define Your Benchmark Tasks

Good benchmarks start with good tasks. Generic prompts give generic results. Tasks that match your real work give answers you can trust.

We’ll write tasks across five categories. Each one is a dict with a category, a prompt, and grading criteria for the judge.

BENCHMARK_TASKS = [
    {
        "id": "summ-1",
        "category": "summarization",
        "prompt": (
            "Summarize in 2-3 sentences:\n\n"
            "Transfer learning lets a model trained on one task "
            "be reused on a related task. You start with a pretrained "
            "model and fine-tune it on your data. This cuts training "
            "time and data needs. Common methods include feature "
            "extraction (freeze pretrained layers, train a new head) "
            "and full fine-tuning (update all weights slowly)."
        ),
        "criteria": "Accurate, concise, hits all key points in 2-3 sentences.",
    },
    {
        "id": "code-1",
        "category": "code_generation",
        "prompt": (
            "Write a Python function called `find_duplicates` that "
            "takes a list and returns items that appear more than "
            "once, in the order they first repeat. Add type hints "
            "and a docstring."
        ),
        "criteria": "Correct Python, type hints, docstring, handles edge cases.",
    },
    {
        "id": "inst-1",
        "category": "instruction_following",
        "prompt": (
            "List exactly 5 benefits of unit testing. Each item: "
            "one sentence, starts with a verb, numbered 1-5. "
            "No intro or outro."
        ),
        "criteria": "Exactly 5 items, numbered, verb-first, no fluff.",
    },
    {
        "id": "creat-1",
        "category": "creative_writing",
        "prompt": "Write a 4-line poem about debugging at 2 AM. Rhyme: ABAB.",
        "criteria": "4 lines, ABAB rhyme, on-topic, creative quality.",
    },
    {
        "id": "reason-1",
        "category": "reasoning",
        "prompt": (
            "A farmer has 17 sheep. All but 9 die. "
            "How many are left? Show your steps."
        ),
        "criteria": "Correct answer (9), clear step-by-step logic.",
    },
]

Five tasks is a good start. For real decisions, I’d go with 5-10 per category. More tasks means less noise in the averages.

Why these five types? They cover the full range of what teams ask LLMs to do. Summary tests compression. Code tests structured output. Instructions test strict rule-following. Creative tests flair. Reasoning tests pure logic.

Warning: Don’t use tasks from public datasets like MMLU or HumanEval. Providers train on those. Your scores won’t match real-world results. Write tasks that look like YOUR actual prompts.

Run the Benchmark Loop

The loop is simple. For each task, call each provider. Save the result. Print progress. If a call fails, log the error and move on.

PROVIDERS = ["openai", "anthropic", "gemini", "llama"]

def run_benchmark(tasks: list, providers: list) -> list:
    results = []
    total = len(tasks) * len(providers)
    count = 0

    for task in tasks:
        for provider in providers:
            count += 1
            print(f"[{count}/{total}] {provider} | {task['id']}...")
            try:
                result = call_llm(provider, task["prompt"])
                result["task_id"] = task["id"]
                result["category"] = task["category"]
                result["criteria"] = task["criteria"]
                results.append(result)
            except Exception as e:
                print(f"  ERROR: {e}")
                results.append({
                    "provider": provider,
                    "task_id": task["id"],
                    "category": task["category"],
                    "text": f"ERROR: {e}",
                    "latency": 0, "tokens_in": 0,
                    "tokens_out": 0,
                    "criteria": task["criteria"],
                })
    return results

benchmark_results = run_benchmark(BENCHMARK_TASKS, PROVIDERS)
print(f"\nDone: {len(benchmark_results)} calls completed.")

The try/except block matters. API calls fail — rate limits, timeouts, network blips. One bad call shouldn’t kill your whole run. The error gets logged, and the loop keeps going.

Score Quality with LLM-as-Judge

Reading 20 outputs by hand is slow. With 10 tasks and 4 providers, that’s 40 responses. Scale to 50 tasks and it’s 200. You need a judge that works fast.

LLM-as-judge does exactly that. You hand a strong model the task prompt, the response, and a rubric. It returns a 1-10 score with a short reason. Research shows top judges agree with humans 80-90% of the time. That’s on par with how well two human graders agree with each other.

Here’s the judge prompt. It tells the model exactly what to check and how to score.

JUDGE_PROMPT = """Score this LLM response from 1-10.

**Task:** {task_prompt}
**Criteria:** {criteria}

**Response:**
{response}

**Rubric:**
- 9-10: Excellent. Meets all criteria.
- 7-8: Good. Minor gaps.
- 5-6: OK. Basic needs met, clear gaps.
- 3-4: Poor. Big issues.
- 1-2: Fails the task.

Reply as JSON: {{"score": <int>, "reasoning": "<brief>"}}
"""

The judge_response function sends this to GPT-4o with temperature=0.0 for stable scores. The response_format flag forces JSON output so parsing never breaks.

def judge_response(task_prompt: str, criteria: str,
                   response: str) -> dict:
    prompt = JUDGE_PROMPT.format(
        task_prompt=task_prompt,
        criteria=criteria,
        response=response,
    )
    resp = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        response_format={"type": "json_object"},
    )
    return json.loads(resp.choices[0].message.content)

Key Insight: LLM-as-judge isn’t perfect, but it’s steady. A human might score the same response 7 on Monday and 8 on Friday. The judge gives 7 every time. For benchmarks, that matters more than being right on every call.

Now let’s score all results. The loop calls the judge for each response and adds the score to the result dict.

def score_all_results(results: list, tasks: list) -> list:
    task_map = {t["id"]: t for t in tasks}

    for i, r in enumerate(results):
        if r["text"].startswith("ERROR"):
            r["score"] = 0
            r["reasoning"] = "API call failed"
            continue

        task = task_map[r["task_id"]]
        print(f"Judging [{i+1}/{len(results)}] "
              f"{r['provider']} | {r['task_id']}...")
        judgment = judge_response(
            task["prompt"], r["criteria"], r["text"])
        r["score"] = judgment["score"]
        r["reasoning"] = judgment["reasoning"]
    return results

scored_results = score_all_results(benchmark_results, BENCHMARK_TASKS)
print("All responses scored.")

{
type: ‘exercise’,
id: ‘benchmark-ex1’,
title: ‘Exercise 1: Add a New Benchmark Task’,
difficulty: ‘beginner’,
exerciseType: ‘write’,
instructions: ‘Add a new “reasoning” task to the benchmark. It should ask why 0.1 + 0.2 != 0.3 in Python. Include clear grading criteria. Print the task dict.’,
starterCode: ‘# Add a floating-point reasoning task\nnew_task = {\n “id”: “reason-2”,\n “category”: “reasoning”,\n “prompt”: # YOUR CODE: write the prompt\n “criteria”: # YOUR CODE: write grading criteria\n}\nprint(new_task[“id”], “|”, new_task[“category”])\nprint(“Prompt:”, new_task[“prompt”][:60], “…”)\nprint(“Criteria:”, new_task[“criteria”])’,
testCases: [
{ id: ‘tc1’, input: ‘print(new_task[“category”])’, expectedOutput: ‘reasoning’, description: ‘Category should be reasoning’ },
{ id: ‘tc2’, input: ‘print(“0.1” in new_task[“prompt”] and “0.2” in new_task[“prompt”])’, expectedOutput: ‘True’, description: ‘Prompt should mention 0.1 and 0.2’ },
{ id: ‘tc3’, input: ‘print(len(new_task[“criteria”]) > 10)’, expectedOutput: ‘True’, description: ‘Criteria must be meaningful’, hidden: true },
],
hints: [
‘Ask the model to explain the floating-point issue. Something like “Explain why 0.1 + 0.2 does not equal 0.3 in Python.”‘,
‘Full: prompt=”Explain why 0.1 + 0.2 != 0.3 in Python. Cover IEEE 754 and how to compare floats safely.”, criteria=”Explains IEEE 754, shows actual result, suggests math.isclose.”‘,
],
solution: ‘new_task = {\n “id”: “reason-2”,\n “category”: “reasoning”,\n “prompt”: “Explain why 0.1 + 0.2 != 0.3 in Python. Cover floating-point representation and how to compare floats safely.”,\n “criteria”: “Explains IEEE 754, shows the actual result, suggests math.isclose or rounding.”,\n}\nprint(new_task[“id”], “|”, new_task[“category”])\nprint(“Prompt:”, new_task[“prompt”][:60], “…”)\nprint(“Criteria:”, new_task[“criteria”])’,
solutionExplanation: ‘The task checks if an LLM can explain a classic Python gotcha. The criteria tell the judge to look for three things: the tech reason (IEEE 754), a demo (print 0.1 + 0.2), and a fix (math.isclose).’,
xpReward: 15,
}

Calculate Cost Per Task

Quality alone doesn’t pick a winner. A model that scores 9/10 but costs 50x more might not be worth it. We need to factor in price.

Each provider charges per token. We store pricing in a dict and compute cost from the token counts we already captured. Llama is free — you just pay for electricity.

# Prices per 1M tokens (March 2026 -- update as needed)
PRICING = {
    "openai":    {"input": 2.50, "output": 10.00},
    "anthropic": {"input": 3.00, "output": 15.00},
    "gemini":    {"input": 0.10, "output": 0.40},
    "llama":     {"input": 0.00, "output": 0.00},
}

def calculate_cost(result: dict) -> float:
    pricing = PRICING[result["provider"]]
    cost_in = (result["tokens_in"] / 1_000_000) * pricing["input"]
    cost_out = (result["tokens_out"] / 1_000_000) * pricing["output"]
    return cost_in + cost_out

for r in scored_results:
    r["cost"] = calculate_cost(r)

Let’s print a quick cost table. It shows total spend per provider across all tasks.

from tabulate import tabulate

cost_data = {}
for r in scored_results:
    p = r["provider"]
    if p not in cost_data:
        cost_data[p] = {"total": 0, "tasks": 0}
    cost_data[p]["total"] += r["cost"]
    cost_data[p]["tasks"] += 1

rows = [[p, f"${d['total']:.6f}", d["tasks"]]
        for p, d in cost_data.items()]
print(tabulate(rows, headers=["Provider", "Total Cost", "Tasks"]))

Gemini Flash tends to cost 10-25x less than GPT-4o or Claude for the same tasks. But cost means nothing without quality context.

Tip: Compare cost per quality point, not raw cost. A model at $0.001 that scores 5/10 is worse value than one at $0.005 that scores 9/10. We’ll build this ratio next.

Build the Report and Radar Chart

Time to pull it all together. The build_report function groups results by provider and category, then computes average score, latency, and cost for each group.

import statistics

def build_report(results: list) -> dict:
    report = {}
    for r in results:
        p, cat = r["provider"], r["category"]
        if p not in report:
            report[p] = {}
        if cat not in report[p]:
            report[p][cat] = {"scores": [], "latencies": [], "costs": []}
        report[p][cat]["scores"].append(r["score"])
        report[p][cat]["latencies"].append(r["latency"])
        report[p][cat]["costs"].append(r["cost"])

    for p in report:
        for cat in report[p]:
            d = report[p][cat]
            d["avg_score"] = statistics.mean(d["scores"])
            d["avg_latency"] = statistics.mean(d["latencies"])
            d["avg_cost"] = statistics.mean(d["costs"])
    return report

report = build_report(scored_results)

Here’s the summary table. It shows each provider’s score per category plus an overall average.

categories = ["summarization", "code_generation",
              "instruction_following", "creative_writing", "reasoning"]

header = ["Provider"] + [c.replace("_", " ").title()
          for c in categories] + ["Overall"]
rows = []
for provider in PROVIDERS:
    row = [provider]
    scores = []
    for cat in categories:
        if cat in report.get(provider, {}):
            avg = report[provider][cat]["avg_score"]
            row.append(f"{avg:.1f}")
            scores.append(avg)
        else:
            row.append("--")
    overall = statistics.mean(scores) if scores else 0
    row.append(f"{overall:.1f}")
    rows.append(row)

print(tabulate(rows, headers=header, tablefmt="grid"))

Numbers are useful. But a chart shows tradeoffs faster. The radar chart plots one polygon per provider on five axes — you instantly see strengths and gaps.

import matplotlib.pyplot as plt
import numpy as np

def plot_radar(report: dict, providers: list, categories: list):
    labels = [c.replace("_", " ").title() for c in categories]
    angles = np.linspace(0, 2 * np.pi, len(categories),
                         endpoint=False).tolist()
    angles += angles[:1]

    fig, ax = plt.subplots(figsize=(8, 8),
                           subplot_kw=dict(polar=True))
    colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"]

    for prov, color in zip(providers, colors):
        vals = [report.get(prov, {}).get(c, {}).get("avg_score", 0)
                for c in categories]
        vals += vals[:1]
        ax.plot(angles, vals, "o-", linewidth=2,
                label=prov, color=color)
        ax.fill(angles, vals, alpha=0.1, color=color)

    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(labels, size=11)
    ax.set_ylim(0, 10)
    ax.set_yticks([2, 4, 6, 8, 10])
    ax.legend(loc="upper right", bbox_to_anchor=(1.3, 1.1))
    ax.set_title("LLM Provider Benchmark — Quality by Category",
                 size=14, pad=20)
    plt.tight_layout()
    plt.show()

plot_radar(report, PROVIDERS, categories)

The chart makes tradeoffs pop. GPT-4o and Claude tend to lead on code and instructions. Gemini Flash holds up on summaries at a tiny fraction of the cost. Llama 3.1 8B is solid on reasoning but weaker on creative tasks.

The Model Selection Framework

Here’s a practical framework for picking the right model.

Step 1: Set your quality bar. What’s the lowest score you’d accept? For customer-facing text, aim for 8+. For internal logs, 6 might be fine.

Step 2: Filter on quality. Drop any provider that falls below your bar on must-have categories. If code gen is your core use case and a model scores 5, it’s out.

Step 3: Compare cost among survivors. Of the models that pass, which one gives the best score per dollar?

def cost_efficiency(report: dict, providers: list) -> list:
    rows = []
    for p in providers:
        total_score, total_cost, count = 0, 0, 0
        for cat in report.get(p, {}):
            d = report[p][cat]
            total_score += d["avg_score"]
            total_cost += d["avg_cost"]
            count += 1
        if count > 0 and total_cost > 0:
            avg_s = total_score / count
            eff = avg_s / (total_cost * 1000)
            rows.append([p, f"{avg_s:.1f}", f"${total_cost:.6f}",
                         f"{eff:.1f}"])
        elif total_cost == 0:
            avg_s = total_score / count if count else 0
            rows.append([p, f"{avg_s:.1f}", "Free (local)", "N/A"])

    print(tabulate(rows, headers=["Provider", "Avg Score",
                                  "Total Cost", "Score/$0.001"]))
    return rows

cost_efficiency(report, PROVIDERS)

Step 4: Factor in latency for live use cases. A chatbot can’t wait 10 seconds. Batch jobs don’t care about speed.

Use Case	What Matters Most
Customer chatbot	Quality > Latency > Cost
Batch processing	Quality > Cost > Latency
Quick prototyping	Cost > Quality > Latency
Code generation CI	Quality > Cost > Latency

Key Insight: There’s no single “best” LLM. There’s only the best one for your use case, your budget, and your speed needs. The benchmark gives you real data so you stop guessing.

{
type: ‘exercise’,
id: ‘benchmark-ex2’,
title: ‘Exercise 2: Build a Provider Ranking Function’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write rank_providers(report, category) that returns provider names sorted by avg score (highest first). Break ties alphabetically.’,
starterCode: ‘def rank_providers(report: dict, category: str) -> list:\n # YOUR CODE: get avg_score per provider for this category\n # Sort by score desc, then name asc for ties\n pass\n\n# Test data\nsample = {\n “openai”: {“summarization”: {“avg_score”: 8.5}},\n “anthropic”: {“summarization”: {“avg_score”: 9.0}},\n “gemini”: {“summarization”: {“avg_score”: 8.5}},\n “llama”: {“summarization”: {“avg_score”: 7.0}},\n}\nprint(rank_providers(sample, “summarization”))’,
testCases: [
{ id: ‘tc1’, input: ‘sample = {“openai”: {“summarization”: {“avg_score”: 8.5}}, “anthropic”: {“summarization”: {“avg_score”: 9.0}}, “gemini”: {“summarization”: {“avg_score”: 8.5}}, “llama”: {“summarization”: {“avg_score”: 7.0}}}\nprint(rank_providers(sample, “summarization”))’, expectedOutput: “[‘anthropic’, ‘gemini’, ‘openai’, ‘llama’]”, description: ‘Anthropic first, tied pair alphabetical, llama last’ },
{ id: ‘tc2’, input: ‘sample2 = {“a”: {“code”: {“avg_score”: 5.0}}, “b”: {“code”: {“avg_score”: 5.0}}}\nprint(rank_providers(sample2, “code”))’, expectedOutput: “[‘a’, ‘b’]”, description: ‘Ties sorted alphabetically’ },
],
hints: [
‘Build (provider, score) pairs with a list comprehension, then sort with a key that negates the score.’,
‘pairs = [(p, report[p][cat][“avg_score”]) for p in report if cat in report[p]]; sort by (-score, name).’,
],
solution: ‘def rank_providers(report: dict, category: str) -> list:\n pairs = [(p, report[p][category][“avg_score”])\n for p in report if category in report[p]]\n pairs.sort(key=lambda x: (-x[1], x[0]))\n return [p for p, _ in pairs]\n\nsample = {“openai”: {“summarization”: {“avg_score”: 8.5}}, “anthropic”: {“summarization”: {“avg_score”: 9.0}}, “gemini”: {“summarization”: {“avg_score”: 8.5}}, “llama”: {“summarization”: {“avg_score”: 7.0}}}\nprint(rank_providers(sample, “summarization”))’,
solutionExplanation: ‘We build (name, score) tuples, but only for providers that have data in the given category. The sort key uses -score for descending order and name for tiebreaking. Then we pull out just the names.’,
xpReward: 20,
}

Make Your Benchmarks Better

One round is a start. Here’s what turns a toy test into a real eval.

Add more tasks. Five per category is a floor. With fewer, one odd response skews the average. Aim for 10+ per category when making real choices.

Run multiple trials. Even at low temperature, outputs vary a bit. Run each task 3 times and average the scores. This smooths out noise.

Rotate the judge. Using GPT-4o as judge adds bias — it may score its own style higher. Run the same eval with Claude as judge too. If rankings flip, you have a bias problem.

Warning: Judge bias is real. Research shows models rate their own outputs 0.5-1.0 points higher than a rival judge would. Validate with a second judge or spot-check by hand.

Pin model versions. Providers update models without notice. Record the full model name (with date suffix like gpt-4o-2024-08-06) so you can rerun later.

Test at your scale. A model that nails a 500-word summary might choke on 5,000 words. Include tasks at different lengths.

When NOT to Build a Custom Benchmark

This harness isn’t always the right call.

Skip it for prototyping. If you’re just exploring what LLMs can do, pick any good model and start building. Benchmarking before you have a clear use case wastes time.

Skip it for simple tasks. Basic Q&A, short summaries, text classification — the gap between top models is tiny. Pick the cheapest one and move on.

Use it when mistakes cost money. Customer-facing content, code gen in CI/CD, legal summaries, regulatory docs. That’s when the benchmark pays for itself.

Common Mistakes and How to Fix Them

Mistake 1: Same raw prompt for all providers

❌ Wrong:

# Identical prompt to every provider
prompt = "Summarize this text"

Why it’s wrong: Each provider handles prompts differently. Claude likes XML tags. GPT likes role-based system messages. Identical prompts test “how well does this provider handle lazy input” — not “what’s the best each one can do.”

✅ Correct:

python

# Either optimize per provider (fair ceiling test)
# or use identical prompts on purpose (robustness test)
# Be clear about which one you're measuring

Mistake 2: Counting the first call’s latency

❌ Wrong: Using the first API call’s speed as your latency number.

Why it’s wrong: The first call warms up the connection. It’s often 2-5x slower. Your data will look wrong.

✅ Correct: Run a throwaway call per provider first. Or drop the first result and start from the second.

Mistake 3: Judge model = benchmark contestant

❌ Wrong: Using GPT-4o as both judge AND contestant.

Why it’s wrong: Self-eval bias. The model knows its own style and scores it higher. Your benchmark becomes “which model writes most like GPT-4o?”

✅ Correct: Use a judge from a different provider. Or spot-check 10% of scores by hand.

{
type: ‘exercise’,
id: ‘benchmark-ex3’,
title: ‘Exercise 3: Add Latency Reporting’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write latency_summary(report) that returns a tabulate-formatted string showing avg latency (seconds, 2 decimals) per provider per category.’,
starterCode: ‘from tabulate import tabulate\n\ndef latency_summary(report: dict) -> str:\n cats = sorted({c for p in report for c in report[p]})\n # YOUR CODE: build rows with provider + latency per category\n pass\n\n# Test\nsample = {\n “openai”: {“summarization”: {“avg_latency”: 1.23},\n “code_generation”: {“avg_latency”: 2.45}},\n “gemini”: {“summarization”: {“avg_latency”: 0.89},\n “code_generation”: {“avg_latency”: 1.12}},\n}\nprint(latency_summary(sample))’,
testCases: [
{ id: ‘tc1’, input: ‘sample = {“openai”: {“summarization”: {“avg_latency”: 1.23}}, “gemini”: {“summarization”: {“avg_latency”: 0.89}}}\nresult = latency_summary(sample)\nprint(“openai” in result and “gemini” in result)’, expectedOutput: ‘True’, description: ‘Should contain both provider names’ },
{ id: ‘tc2’, input: ‘sample = {“openai”: {“summarization”: {“avg_latency”: 1.234}}}\nresult = latency_summary(sample)\nprint(“1.23″ in result)’, expectedOutput: ‘True’, description: ‘Latency rounded to 2 decimals’ },
],
hints: [
‘Loop providers, then for each category check if it exists. Build a row list. Use tabulate to format.’,
‘rows = [[p] + [f”{report[p][c][\”avg_latency\”]:.2f}” if c in report[p] else “–” for c in cats] for p in sorted(report)]’,
],
solution: ‘from tabulate import tabulate\n\ndef latency_summary(report: dict) -> str:\n cats = sorted({c for p in report for c in report[p]})\n rows = []\n for p in sorted(report):\n row = [p]\n for c in cats:\n if c in report[p]:\n row.append(f\”{report[p][c][\’avg_latency\’]:.2f}\”)\n else:\n row.append(\”–\”)\n rows.append(row)\n hdrs = [\”Provider\”] + [c.replace(\”_\”, \” \”).title() for c in cats]\n return tabulate(rows, headers=hdrs)\n\nsample = {\n “openai”: {“summarization”: {“avg_latency”: 1.23}, “code_generation”: {“avg_latency”: 2.45}},\n “gemini”: {“summarization”: {“avg_latency”: 0.89}, “code_generation”: {“avg_latency”: 1.12}},\n}\nprint(latency_summary(sample))’,
solutionExplanation: ‘Collect all unique categories. For each provider, grab the latency (or “–” if missing). Tabulate handles the nice formatting.’,
xpReward: 20,
}

Summary

You now have a working benchmark harness that does four things: sends tasks to multiple LLM providers through one function, scores quality with an LLM-as-judge, tracks cost from token pricing, and plots a radar chart for visual comparison.

Here’s what to remember:

Public benchmarks don’t predict how models do on YOUR tasks. Run your own.
LLM-as-judge matches human graders 80-90% of the time. Good enough for model picks.
Score per dollar beats raw quality or raw cost as a metric.
Use a different model for judging than the ones you’re testing.
Pin versions. Re-run monthly. Rankings shift fast.

Practice exercise: Add a sixth category called “data_analysis.” Write 3 tasks that give the model a small inline dataset and ask it to draw conclusions. Run the full pipeline, score everything, and update the radar chart with 6 axes.

Click to see a solution outline

data_tasks = [
    {
        "id": "data-1",
        "category": "data_analysis",
        "prompt": (
            "Monthly revenue: Jan $45K, Feb $42K, Mar $58K, "
            "Apr $51K, May $67K, Jun $63K. "
            "Find the trend, best month, and predict July."
        ),
        "criteria": "Spots upward trend, names May, gives a "
                    "reasonable July prediction with reasoning.",
    },
    {
        "id": "data-2",
        "category": "data_analysis",
        "prompt": (
            "Developer survey (100 people, multi-select): "
            "Python 68, JavaScript 52, Rust 15, Go 23, Java 31. "
            "Give your top 3 insights."
        ),
        "criteria": "Notes Python dominance, that total > 100 "
                    "(multi-select), and one non-obvious insight.",
    },
    {
        "id": "data-3",
        "category": "data_analysis",
        "prompt": (
            "A/B test: A had 1200 visits, 48 conversions. "
            "B had 1150 visits, 62 conversions. "
            "Which wins? Is it significant? Explain."
        ),
        "criteria": "Correct rates (A: 4.0%, B: 5.4%), discusses "
                    "significance, mentions sample size.",
    },
]

# Add to BENCHMARK_TASKS and re-run the full pipeline
BENCHMARK_TASKS.extend(data_tasks)
# benchmark_results = run_benchmark(...)
# scored_results = score_all_results(...)
# report = build_report(...)
# Update categories list, then plot_radar(...)

Complete Code

Click to expand the full script (copy-paste and run)

# Complete code from: Benchmark LLM Providers in Python
# Requires: pip install openai anthropic google-generativeai requests matplotlib tabulate python-dotenv
# Python 3.10+

import os
import time
import json
import statistics
import requests
import numpy as np
import matplotlib.pyplot as plt
from dotenv import load_dotenv
from openai import OpenAI
import anthropic
import google.generativeai as genai
from tabulate import tabulate

# --- Setup ---
load_dotenv()
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
anthropic_client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
gemini_model = genai.GenerativeModel("gemini-2.0-flash")
OLLAMA_URL = "http://localhost:11434/api/generate"

PROVIDER_MODELS = {
    "openai": "gpt-4o",
    "anthropic": "claude-sonnet-4-20250514",
    "gemini": "gemini-2.0-flash",
    "llama": "llama3.1:8b",
}
PRICING = {
    "openai":    {"input": 2.50, "output": 10.00},
    "anthropic": {"input": 3.00, "output": 15.00},
    "gemini":    {"input": 0.10, "output": 0.40},
    "llama":     {"input": 0.00, "output": 0.00},
}

# --- Unified Provider Call ---
def call_llm(provider, prompt, system="You are a helpful assistant."):
    start = time.time()
    if provider == "openai":
        resp = openai_client.chat.completions.create(
            model=PROVIDER_MODELS["openai"],
            messages=[{"role": "system", "content": system},
                      {"role": "user", "content": prompt}],
            temperature=0.3)
        text = resp.choices[0].message.content
        tokens_in = resp.usage.prompt_tokens
        tokens_out = resp.usage.completion_tokens
    elif provider == "anthropic":
        resp = anthropic_client.messages.create(
            model=PROVIDER_MODELS["anthropic"], max_tokens=1024,
            system=system,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3)
        text = resp.content[0].text
        tokens_in = resp.usage.input_tokens
        tokens_out = resp.usage.output_tokens
    elif provider == "gemini":
        resp = gemini_model.generate_content(prompt)
        text = resp.text
        tokens_in = resp.usage_metadata.prompt_token_count
        tokens_out = resp.usage_metadata.candidates_token_count
    elif provider == "llama":
        resp = requests.post(OLLAMA_URL, json={
            "model": PROVIDER_MODELS["llama"],
            "prompt": f"{system}\n\nUser: {prompt}",
            "stream": False})
        data = resp.json()
        text = data["response"]
        tokens_in = data.get("prompt_eval_count", 0)
        tokens_out = data.get("eval_count", 0)
    latency = time.time() - start
    return {"provider": provider, "text": text, "latency": latency,
            "tokens_in": tokens_in, "tokens_out": tokens_out}

# --- Tasks ---
BENCHMARK_TASKS = [
    {"id": "summ-1", "category": "summarization",
     "prompt": "Summarize in 2-3 sentences:\n\nTransfer learning lets a model trained on one task be reused on a related task. You start with a pretrained model and fine-tune it on your data. This cuts training time and data needs. Common methods include feature extraction and full fine-tuning.",
     "criteria": "Accurate, concise, captures key points."},
    {"id": "code-1", "category": "code_generation",
     "prompt": "Write a Python function `find_duplicates(items: list) -> list` that returns elements appearing more than once, in order of first repeat. Add a docstring.",
     "criteria": "Correct Python, type hints, docstring, edge cases."},
    {"id": "inst-1", "category": "instruction_following",
     "prompt": "List exactly 5 benefits of unit testing. Each: one sentence, starts with a verb, numbered 1-5. No intro or outro.",
     "criteria": "Exactly 5, numbered, verb-first, no fluff."},
    {"id": "creat-1", "category": "creative_writing",
     "prompt": "Write a 4-line poem about debugging at 2 AM. ABAB rhyme.",
     "criteria": "4 lines, ABAB rhyme, on-topic, creative."},
    {"id": "reason-1", "category": "reasoning",
     "prompt": "A farmer has 17 sheep. All but 9 die. How many left? Show steps.",
     "criteria": "Answer is 9, clear reasoning."},
]

# --- Benchmark Loop ---
PROVIDERS = ["openai", "anthropic", "gemini", "llama"]

def run_benchmark(tasks, providers):
    results = []
    for i, (task, prov) in enumerate(
            [(t, p) for t in tasks for p in providers], 1):
        print(f"[{i}/{len(tasks)*len(providers)}] {prov} | {task['id']}...")
        try:
            r = call_llm(prov, task["prompt"])
            r.update(task_id=task["id"], category=task["category"],
                     criteria=task["criteria"])
            results.append(r)
        except Exception as e:
            print(f"  ERROR: {e}")
            results.append({"provider": prov, "task_id": task["id"],
                "category": task["category"], "text": f"ERROR: {e}",
                "latency": 0, "tokens_in": 0, "tokens_out": 0,
                "criteria": task["criteria"]})
    return results

# --- Judge ---
JUDGE_PROMPT = """Score this LLM response 1-10.
**Task:** {task_prompt}
**Criteria:** {criteria}
**Response:** {response}
**Rubric:** 9-10 Excellent | 7-8 Good | 5-6 OK | 3-4 Poor | 1-2 Fail
Reply JSON: {{"score": <int>, "reasoning": "<brief>"}}"""

def judge_response(task_prompt, criteria, response):
    prompt = JUDGE_PROMPT.format(
        task_prompt=task_prompt, criteria=criteria, response=response)
    resp = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0, response_format={"type": "json_object"})
    return json.loads(resp.choices[0].message.content)

def score_all_results(results, tasks):
    task_map = {t["id"]: t for t in tasks}
    for i, r in enumerate(results):
        if r["text"].startswith("ERROR"):
            r["score"], r["reasoning"] = 0, "API call failed"
            continue
        task = task_map[r["task_id"]]
        print(f"Judging [{i+1}/{len(results)}] {r['provider']}|{r['task_id']}")
        j = judge_response(task["prompt"], r["criteria"], r["text"])
        r["score"], r["reasoning"] = j["score"], j["reasoning"]
    return results

# --- Cost ---
def calculate_cost(r):
    pr = PRICING[r["provider"]]
    return (r["tokens_in"]/1e6)*pr["input"] + (r["tokens_out"]/1e6)*pr["output"]

# --- Report ---
def build_report(results):
    report = {}
    for r in results:
        p, cat = r["provider"], r["category"]
        report.setdefault(p, {}).setdefault(cat,
            {"scores": [], "latencies": [], "costs": []})
        report[p][cat]["scores"].append(r["score"])
        report[p][cat]["latencies"].append(r["latency"])
        report[p][cat]["costs"].append(r["cost"])
    for p in report:
        for cat in report[p]:
            d = report[p][cat]
            d["avg_score"] = statistics.mean(d["scores"])
            d["avg_latency"] = statistics.mean(d["latencies"])
            d["avg_cost"] = statistics.mean(d["costs"])
    return report

# --- Radar ---
def plot_radar(report, providers, categories):
    labels = [c.replace("_", " ").title() for c in categories]
    angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False).tolist()
    angles += angles[:1]
    fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
    colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"]
    for prov, col in zip(providers, colors):
        vals = [report.get(prov,{}).get(c,{}).get("avg_score",0)
                for c in categories] + [0]
        vals[-1] = vals[0]
        ax.plot(angles, vals, "o-", lw=2, label=prov, color=col)
        ax.fill(angles, vals, alpha=0.1, color=col)
    ax.set_xticks(angles[:-1]); ax.set_xticklabels(labels, size=11)
    ax.set_ylim(0, 10); ax.set_yticks([2,4,6,8,10])
    ax.legend(loc="upper right", bbox_to_anchor=(1.3,1.1))
    ax.set_title("LLM Provider Benchmark", size=14, pad=20)
    plt.tight_layout(); plt.show()

# --- Main ---
if __name__ == "__main__":
    results = run_benchmark(BENCHMARK_TASKS, PROVIDERS)
    for r in results: r["cost"] = calculate_cost(r)
    results = score_all_results(results, BENCHMARK_TASKS)
    report = build_report(results)
    cats = ["summarization","code_generation","instruction_following",
            "creative_writing","reasoning"]
    hdr = ["Provider"]+[c.replace("_"," ").title() for c in cats]+["Overall"]
    rows = []
    for p in PROVIDERS:
        row = [p]
        sc = [report[p][c]["avg_score"] for c in cats if c in report.get(p,{})]
        row += [f"{report[p][c]['avg_score']:.1f}" if c in report.get(p,{})
                else "--" for c in cats]
        row.append(f"{statistics.mean(sc):.1f}" if sc else "--")
        rows.append(row)
    print(tabulate(rows, headers=hdr, tablefmt="grid"))
    plot_radar(report, PROVIDERS, cats)
    print("\nScript completed successfully.")

Frequently Asked Questions

Can I use Claude or Gemini as the judge instead of GPT-4o?

Yes. Any strong model works. Using a judge from a different provider actually reduces bias. Swap the judge_response function to call anthropic_client.messages.create() or gemini_model.generate_content() and parse the JSON the same way.

How many tasks do I need for good results?

Five per category is the bare minimum. For real decisions, aim for 10-20. With 5 tasks, one outlier shifts the average by 2 points. With 20, it shifts by 0.5.

How often should I re-run benchmarks?

At least once a month. Providers update models quietly. GPT-4o from January may differ from March’s version. Pin model names with date tags and re-test after each update.

Does task order affect scores?

No. Each API call is standalone — no chat history carries over. But rate limits can slow later calls. Add time.sleep(1) between calls if you hit limits.

How do I test a custom fine-tuned model?

Add a new branch in call_llm. The function just needs a prompt in and text + token counts out. If your model sits behind a REST endpoint, use requests.post() like we do for Ollama. The judge and report code works as-is.

References

OpenAI API Docs — Chat Completions. Link
Anthropic API Docs — Messages. Link
Google Gemini API Docs. Link
Zheng, L. et al. — “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. Link
Artificial Analysis — LLM Leaderboard. Link
Ollama Docs — Local LLM Inference. Link
LMSYS Chatbot Arena. Link
Evidently AI — “LLM-as-a-Judge: A Complete Guide.” Link

Reviewed: March 2026 | Python: 3.10+ | openai: 1.30+ | anthropic: 0.25+ | google-generativeai: 0.5+

Meta description: Build a Python harness to benchmark GPT-4o, Claude, Gemini, and Llama on quality, latency, and cost. Includes LLM-as-judge scoring and radar charts.

[SCHEMA HINTS]
– Article type: Tutorial
– Primary technology: OpenAI API, Anthropic API, Google Gemini API, Ollama
– Programming language: Python
– Difficulty: Intermediate
– Keywords: benchmark LLM providers Python, GPT vs Claude vs Gemini, LLM comparison, model selection, LLM-as-judge

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

GPT vs Claude vs Gemini: Python Benchmark (2026)

Why Public Benchmarks Fall Short

Prerequisites

How the Benchmark Harness Works

Set Up Provider Clients

Define Your Benchmark Tasks

Run the Benchmark Loop

Score Quality with LLM-as-Judge

Calculate Cost Per Task

Build the Report and Radar Chart

The Model Selection Framework

Make Your Benchmarks Better

When NOT to Build a Custom Benchmark

Common Mistakes and How to Fix Them

Mistake 1: Same raw prompt for all providers

Mistake 2: Counting the first call’s latency

Mistake 3: Judge model = benchmark contestant

Summary

Complete Code

Frequently Asked Questions

Can I use Claude or Gemini as the judge instead of GPT-4o?

How many tasks do I need for good results?

How often should I re-run benchmarks?

Does task order affect scores?

How do I test a custom fine-tuned model?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Why Public Benchmarks Fall Short

Prerequisites

How the Benchmark Harness Works

Set Up Provider Clients

Define Your Benchmark Tasks

Run the Benchmark Loop

Score Quality with LLM-as-Judge

Calculate Cost Per Task

Build the Report and Radar Chart

The Model Selection Framework

Make Your Benchmarks Better

When NOT to Build a Custom Benchmark

Common Mistakes and How to Fix Them

Mistake 1: Same raw prompt for all providers

Mistake 2: Counting the first call’s latency

Mistake 3: Judge model = benchmark contestant

Summary

Complete Code

Frequently Asked Questions

Can I use Claude or Gemini as the judge instead of GPT-4o?

How many tasks do I need for good results?

How often should I re-run benchmarks?

Does task order affect scores?

How do I test a custom fine-tuned model?

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Build an LLM Benchmarking Platform (Python Project)

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.