Menu

Build an LLM Benchmarking Platform (Python Project)

Build an LLM benchmarking platform in Python from scratch. Define test suites, compare providers with raw HTTP, score with LLM-as-judge, and generate reports with confidence intervals.

Written by Selva Prabhakaran | 23 min read

Define test suites, run them across LLM providers with raw HTTP, score with an LLM judge, and build reports that show which model wins — all from scratch.

Interactive Code Blocks — The Python code blocks in this article are runnable. Click the Run button to execute them right in your browser.

You spent two weeks building a customer support chatbot. Your manager asks: “Should we use GPT-4o-mini, Claude Haiku, or Gemini Flash?” You check public leaderboards. They test math puzzles and trivia — nothing like your support tickets. So you guess. And your guess costs the company money every month on the wrong model.

What you need is your own benchmarking system — one that tests your prompts on your tasks and tells you which model actually wins. With numbers. That’s what we’ll build here.

Before we write a single class, let me show you how the pieces fit together.

You start with a test suite — prompts paired with expected answers. Each test case defines what a “good” response looks like. The benchmarking platform fires every prompt at every provider — OpenAI, Anthropic, Google — using raw HTTP. No SDKs. No external packages. Just urllib.request.

Once responses come back, a judge evaluates them. The judge is itself an LLM. It reads each response alongside the expected answer and scores it on criteria you pick: accuracy, completeness, clarity.

With scores in hand, the stats engine finds the mean, spread, and a 95% range for each model. This tells you not just which model scored highest, but whether the gap is real or just noise from too few test cases.

Finally, the report builder puts out a ranked table, per-metric winners, and a check on whether the top models truly differ. You’ll know which model to pick — with numbers to back it up.

Let’s build it.

What Do You Need Before Starting?

  • Python version: 3.10+
  • Required libraries: None beyond the standard library (json, urllib.request, math, statistics, time, os)
  • API keys: You need at least one LLM provider API key. We use OpenAI, Anthropic, and Google Gemini. Get them from:
  • OpenAI: platform.openai.com/api-keys
  • Anthropic: console.anthropic.com
  • Google Gemini: aistudio.google.com/apikey
  • Time to complete: 35–40 minutes
  • Pyodide compatible: Yes — all code runs in-browser (API calls need real keys)
import json
import urllib.request
import time
import math
import os
from statistics import mean, stdev

That single import block is everything. Zero external packages. The entire LLM benchmarking platform runs on Python’s standard library.

How Do You Define a Test Suite for LLM Evaluation?

A test suite is a list of test cases. Each one has a prompt, a expected answer, and a category. Groups let you compare models per task — maybe GPT excels at code but Gemini wins at summaries.

I keep the data classes minimal on purpose. TestCase holds three fields. TestSuite wraps a list and gives you iteration.

class TestCase:
    """One prompt + expected answer pair for LLM benchmarking."""

    def __init__(self, prompt, expected_answer, category="general"):
        self.prompt = prompt
        self.expected_answer = expected_answer
        self.category = category

    def __repr__(self):
        return f"TestCase(category='{self.category}', prompt='{self.prompt[:40]}...')"


class TestSuite:
    """A collection of test cases for benchmarking LLM providers."""

    def __init__(self, name, description=""):
        self.name = name
        self.description = description
        self.cases = []

    def add_case(self, prompt, expected_answer, category="general"):
        self.cases.append(TestCase(prompt, expected_answer, category))

    def __len__(self):
        return len(self.cases)

    def __iter__(self):
        return iter(self.cases)

Here’s a suite that tests four different skills. Each category targets a real capability you’d care about.

suite = TestSuite(
    name="LLM Core Skills",
    description="Tests factual recall, explanation, code, and summarization"
)

suite.add_case(
    prompt="What is the capital of Australia? Answer in one sentence.",
    expected_answer="The capital of Australia is Canberra.",
    category="factual"
)

suite.add_case(
    prompt="Explain why the sky is blue in 2-3 sentences for a 10-year-old.",
    expected_answer="The sky looks blue because sunlight has all colors, and blue light scatters the most off tiny air molecules. This scattered blue light reaches your eyes from every direction.",
    category="explanation"
)

suite.add_case(
    prompt="Write a Python function that returns the nth Fibonacci number using memoized recursion.",
    expected_answer="def fibonacci(n, memo={}):\n    if n in memo:\n        return memo[n]\n    if n <= 1:\n        return n\n    memo[n] = fibonacci(n-1, memo) + fibonacci(n-2, memo)\n    return memo[n]",
    category="code"
)

suite.add_case(
    prompt="Summarize gradient descent in exactly 2 sentences.",
    expected_answer="Gradient descent iteratively adjusts parameters to reduce a loss function. It computes gradients and steps in the direction that lowers the loss.",
    category="summarization"
)

print(f"Test suite: {suite.name}")
print(f"Cases: {len(suite)}")
for case in suite:
    print(f"  [{case.category}] {case.prompt[:50]}...")

Output:

python
Test suite: LLM Core Skills
Cases: 4
  [factual] What is the capital of Australia? Answer in one...
  [explanation] Explain why the sky is blue in 2-3 sentences ...
  [code] Write a Python function that returns the nth Fib...
  [summarization] Summarize gradient descent in exactly 2 sentences...

Quick check: What happens if you skip the category parameter? The default is "general" — all results land in one bucket. Always set categories when you want per-task analysis.

Key Insight: A benchmark tool is only as good as its test suite. Vague prompts give vague rankings. Write cases that match your real prompts — that’s where the true insights come from.

How Do You Compare LLM Providers with Raw HTTP?

Most guides use SDK packages like openai or anthropic. We’re going a different route on purpose. Raw HTTP means you see what goes over the wire — headers, body, and reply format. You learn what the SDK hides. And you get zero outside packages.

The LLMProvider class wraps one API endpoint. It holds the URL, API key, model name, and two helper functions. One builds the JSON body. The other pulls text from the reply. Each provider has a different JSON shape, so these helpers handle those gaps.

class LLMProvider:
    """Wraps a single LLM API endpoint for benchmarking."""

    def __init__(self, name, url, api_key, model,
                 format_request, extract_response, headers_extra=None):
        self.name = name
        self.url = url
        self.api_key = api_key
        self.model = model
        self.format_request = format_request
        self.extract_response = extract_response
        self.headers_extra = headers_extra or {}

    def call(self, prompt, max_tokens=300):
        """Send a prompt. Returns (response_text, latency_seconds)."""
        body = self.format_request(prompt, self.model, max_tokens)
        data = json.dumps(body).encode("utf-8")

        headers = {
            "Content-Type": "application/json",
            **self.headers_extra,
        }

        req = urllib.request.Request(self.url, data=data, headers=headers)

        start = time.perf_counter()
        with urllib.request.urlopen(req, timeout=30) as resp:
            result = json.loads(resp.read().decode("utf-8"))
        latency = time.perf_counter() - start

        text = self.extract_response(result)
        return text, latency

The call method builds the request, fires it, measures round-trip time, and pulls out the text. Clean and predictable.

Now let’s create factory functions for each provider. I’ll show them one at a time so you see how the JSON payloads differ.

OpenAI uses /v1/chat/completions. The body takes a messages array with role/content pairs. The API key goes in the Authorization header.

def make_openai_provider(api_key, model="gpt-4o-mini"):
    def format_req(prompt, model, max_tokens):
        return {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": max_tokens,
            "temperature": 0.0,
        }

    def extract(resp):
        return resp["choices"][0]["message"]["content"]

    return LLMProvider(
        name=f"OpenAI ({model})",
        url="https://api.openai.com/v1/chat/completions",
        api_key=api_key,
        model=model,
        format_request=format_req,
        extract_response=extract,
        headers_extra={"Authorization": f"Bearer {api_key}"},
    )

Anthropic uses /v1/messages. It needs a special x-api-key header plus an anthropic-version header. The response nests text inside content[0].text.

def make_anthropic_provider(api_key, model="claude-3-5-haiku-20241022"):
    def format_req(prompt, model, max_tokens):
        return {
            "model": model,
            "max_tokens": max_tokens,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0.0,
        }

    def extract(resp):
        return resp["content"][0]["text"]

    return LLMProvider(
        name=f"Anthropic ({model})",
        url="https://api.anthropic.com/v1/messages",
        api_key=api_key,
        model=model,
        format_request=format_req,
        extract_response=extract,
        headers_extra={
            "x-api-key": api_key,
            "anthropic-version": "2023-06-01",
        },
    )

Google Gemini puts the API key in the URL itself — not in headers. The body uses a contents array with parts.

def make_gemini_provider(api_key, model="gemini-2.0-flash"):
    def format_req(prompt, model, max_tokens):
        return {
            "contents": [{"parts": [{"text": prompt}]}],
            "generationConfig": {
                "maxOutputTokens": max_tokens,
                "temperature": 0.0,
            },
        }

    def extract(resp):
        return resp["candidates"][0]["content"]["parts"][0]["text"]

    url = (f"https://generativelanguage.googleapis.com/v1beta/"
           f"models/{model}:generateContent?key={api_key}")
    return LLMProvider(
        name=f"Google ({model})",
        url=url,
        api_key=api_key,
        model=model,
        format_request=format_req,
        extract_response=extract,
    )

See the pattern? Each factory defines two inner functions, then returns an LLMProvider. The only changes are the URL, headers, and JSON shape. Adding a new provider follows the same steps.

Set up providers with your keys. Environment variables keep secrets out of code.

OPENAI_KEY = os.environ.get("OPENAI_API_KEY", "sk-your-key-here")
ANTHROPIC_KEY = os.environ.get("ANTHROPIC_API_KEY", "sk-ant-your-key-here")
GEMINI_KEY = os.environ.get("GEMINI_API_KEY", "your-key-here")

providers = [
    make_openai_provider(OPENAI_KEY),
    make_anthropic_provider(ANTHROPIC_KEY),
    make_gemini_provider(GEMINI_KEY),
]

print(f"Configured {len(providers)} providers:")
for p in providers:
    print(f"  - {p.name}")
python
Configured 3 providers:
  - OpenAI (gpt-4o-mini)
  - Anthropic (claude-3-5-haiku-20241022)
  - Google (gemini-2.0-flash)
Tip: Always set `temperature=0.0` when benchmarking LLM providers. Non-zero temperature adds randomness. You want score differences to reflect model quality, not sampling luck.

How Does the Multi-Provider Benchmark Runner Work?

The runner loops through every test case and every provider. For each pair, it calls the API, saves the reply and timing, and moves on. If a call fails — bad network, rate limit, timeout — it logs the error and keeps going.

BenchmarkResult holds one reply. BenchmarkRunner runs the loop. Together they cover the full grid of test cases times providers.

class BenchmarkResult:
    """Stores one provider's answer to one test case."""

    def __init__(self, provider_name, test_case, response="",
                 latency=0.0, error=None):
        self.provider_name = provider_name
        self.test_case = test_case
        self.response = response
        self.latency = latency
        self.error = error


class BenchmarkRunner:
    """Runs every test case against every LLM provider."""

    def __init__(self, suite, providers):
        self.suite = suite
        self.providers = providers

    def run(self):
        """Execute all combinations. Returns list of results."""
        results = []
        total = len(self.suite) * len(self.providers)
        done = 0

        for case in self.suite:
            for provider in self.providers:
                done += 1
                tag = f"[{done}/{total}]"
                print(f"  {tag} {provider.name} <- {case.prompt[:35]}...")

                try:
                    response, latency = provider.call(case.prompt)
                    results.append(BenchmarkResult(
                        provider.name, case, response, latency
                    ))
                except Exception as e:
                    results.append(BenchmarkResult(
                        provider.name, case, error=str(e)
                    ))

        return results

Predict the output: You have 4 test cases and 3 providers. How many API calls? Think first.

Answer: 4 x 3 = 12. Every test case hits every provider.

Let’s run it.

runner = BenchmarkRunner(suite, providers)
print(f"Running {len(suite)} cases x {len(providers)} providers...\n")
results = runner.run()

successes = sum(1 for r in results if r.error is None)
failures = sum(1 for r in results if r.error is not None)
print(f"\nDone: {successes} successes, {failures} failures")

Progress prints as each call completes:

python
Running 4 cases x 3 providers...

  [1/12] OpenAI (gpt-4o-mini) <- What is the capital of Australia?...
  [2/12] Anthropic (claude-3-5-haiku-20241022) <- What is the capital of Australia?...
  [3/12] Google (gemini-2.0-flash) <- What is the capital of Australia?...
  [4/12] OpenAI (gpt-4o-mini) <- Explain why the sky is blue in 2-3...
  [5/12] Anthropic (claude-3-5-haiku-20241022) <- Explain why the sky is blue in 2-3...
  [6/12] Google (gemini-2.0-flash) <- Explain why the sky is blue in 2-3...
  [7/12] OpenAI (gpt-4o-mini) <- Write a Python function that return...
  [8/12] Anthropic (claude-3-5-haiku-20241022) <- Write a Python function that return...
  [9/12] Google (gemini-2.0-flash) <- Write a Python function that return...
  [10/12] OpenAI (gpt-4o-mini) <- Summarize gradient descent in exac...
  [11/12] Anthropic (claude-3-5-haiku-20241022) <- Summarize gradient descent in exac...
  [12/12] Google (gemini-2.0-flash) <- Summarize gradient descent in exac...

Done: 12 successes, 0 failures
Warning: Rate limits can kill large benchmark runs. If you run 50+ test cases, add `time.sleep(1)` between calls. The runner’s try/except catches failures gracefully — errors get logged, not lost.

How Does LLM-as-Judge Scoring Work?

Here’s the part I find most fun. Instead of string matching or BLEU scores, we use another LLM to grade each reply. I prefer this because it catches meaning, not just word overlap. Did the reply actually answer the question well? A simple metric can’t tell.

The judge reads three things: the original prompt, the expected answer, and the model’s response. Then it scores on accuracy (1-5), completeness (1-5), and clarity (1-5).

The LLMJudge class builds a tight prompt that forces JSON output. It strips markdown fences the judge might add and pins scores to the 1-5 range.

class LLMJudge:
    """Uses an LLM to score benchmark responses."""

    JUDGE_PROMPT = """You are an expert evaluator. Score the Response against the Expected Answer.

Prompt: {prompt}

Expected Answer: {reference}

Response to evaluate: {response}

Score on three criteria (1-5 each):
- accuracy: factual match with the reference
- completeness: covers all key points
- clarity: well-written and easy to understand

Return ONLY a JSON object:
{{"accuracy": <1-5>, "completeness": <1-5>, "clarity": <1-5>}}"""

    def __init__(self, judge_provider):
        self.judge = judge_provider

    def score(self, test_case, response_text):
        """Score a response. Returns dict with three integer scores."""
        prompt = self.JUDGE_PROMPT.format(
            prompt=test_case.prompt,
            reference=test_case.expected_answer,
            response=response_text,
        )

        try:
            raw, _ = self.judge.call(prompt, max_tokens=100)
            raw = raw.strip()
            if raw.startswith("```"):
                raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
            scores = json.loads(raw)

            for key in ("accuracy", "completeness", "clarity"):
                val = int(scores.get(key, 3))
                scores[key] = max(1, min(5, val))

            return scores
        except Exception as e:
            print(f"    Judge error: {e}")
            return {"accuracy": 0, "completeness": 0, "clarity": 0}

Why force JSON? It’s easy to parse. The “ONLY a JSON object” line works well across most models. The fallback catches edge cases where the judge wraps output in code fences or sends back odd text.

Score all benchmark results.

judge = LLMJudge(make_openai_provider(OPENAI_KEY, model="gpt-4o-mini"))

print("Scoring responses with LLM judge...\n")
scored_results = []
for r in results:
    if r.error:
        r.scores = {"accuracy": 0, "completeness": 0, "clarity": 0}
    else:
        print(f"  Judging: {r.provider_name} on [{r.test_case.category}]")
        r.scores = judge.score(r.test_case, r.response)
    scored_results.append(r)

print(f"\nScored {len(scored_results)} results")

The judge makes one call per result:

python
Scoring responses with LLM judge...

  Judging: OpenAI (gpt-4o-mini) on [factual]
  Judging: Anthropic (claude-3-5-haiku-20241022) on [factual]
  Judging: Google (gemini-2.0-flash) on [factual]
  Judging: OpenAI (gpt-4o-mini) on [explanation]
  Judging: Anthropic (claude-3-5-haiku-20241022) on [explanation]
  Judging: Google (gemini-2.0-flash) on [explanation]
  Judging: OpenAI (gpt-4o-mini) on [code]
  Judging: Anthropic (claude-3-5-haiku-20241022) on [code]
  Judging: Google (gemini-2.0-flash) on [code]
  Judging: OpenAI (gpt-4o-mini) on [summarization]
  Judging: Anthropic (claude-3-5-haiku-20241022) on [summarization]
  Judging: Google (gemini-2.0-flash) on [summarization]

Scored 12 results

What if you swap the judge model? Try using gpt-4o instead of gpt-4o-mini as the judge. Stronger judges give sharper scores — they tell a 3 from a 4 more clearly. Weaker judges tend to rate everything 4 or 5.

Key Insight: LLM-as-judge is strong but biased. Models tend to prefer their own style. Use a different model as judge than the ones being tested. Pick the strongest model you have as your judge.

How Do You Compute Statistical Comparisons Between LLM Models?

Raw averages are tempting. “Model A scored 4.2, Model B scored 4.0 — A wins!” But with only 4 test cases, that gap could be pure noise.

This is where confidence intervals save you from bad conclusions.

A confidence interval (CI) gives you a range where the true average likely sits. The 95% CI says: “Run this test 100 times, and 95 of those times the real mean lands in this range.”

For small samples — and 4 test cases is small — we use the t-table instead of the normal curve. Why? The t-table has fatter tails. It makes wider ranges when you have less data. With only 4 data points, you should be less sure, and the t-table forces that.

[UNDER-THE-HOOD]
Why t-values instead of z-scores? The z-score (1.96 for 95% CI) assumes you know the true spread of the data. With small samples, you don’t — you guess it from what you have. The t-table adds a buffer for that guess. At n=4 (3 degrees of freedom), the t-value is 3.182 versus 1.96 for z. That makes the range 62% wider. As n grows, t gets close to z. Past n=30, the gap is tiny.

The StatisticalAnalyzer groups results by provider and works out stats for each score type plus an overall blend.

class StatisticalAnalyzer:
    """Computes statistical summaries for LLM benchmark results."""

    def __init__(self, results):
        self.results = [r for r in results if r.error is None]

    def _group_by_provider(self):
        groups = {}
        for r in self.results:
            groups.setdefault(r.provider_name, []).append(r)
        return groups

    def _confidence_interval(self, values):
        """Compute mean and 95% CI using t-distribution."""
        n = len(values)
        if n < 2:
            m = mean(values)
            return m, 0.0, m, m

        avg = mean(values)
        sd = stdev(values)

        # t-values for 95% CI at common degrees of freedom
        t_values = {
            2: 4.303, 3: 3.182, 4: 2.776,
            5: 2.571, 10: 2.228, 20: 2.086, 30: 2.042
        }
        df = n - 1
        t_val = t_values.get(df, 1.96)

        margin = t_val * (sd / math.sqrt(n))
        return avg, sd, avg - margin, avg + margin

The analyze method ties it all up. It works out per-metric stats, an overall blend (mean of all three scores), and mean latency for each provider.

    def analyze(self):
        """Return per-provider statistical summary."""
        groups = self._group_by_provider()
        summary = {}

        for provider, provider_results in groups.items():
            stats = {}

            for metric in ("accuracy", "completeness", "clarity"):
                values = [r.scores[metric] for r in provider_results]
                avg, sd, ci_low, ci_high = self._confidence_interval(values)
                stats[metric] = {
                    "mean": round(avg, 2),
                    "std": round(sd, 2),
                    "ci_low": round(ci_low, 2),
                    "ci_high": round(ci_high, 2),
                    "n": len(values),
                }

            # Overall = average of three criteria per result
            all_scores = []
            for r in provider_results:
                combo = (r.scores["accuracy"] + r.scores["completeness"]
                         + r.scores["clarity"]) / 3
                all_scores.append(combo)

            avg, sd, ci_low, ci_high = self._confidence_interval(all_scores)
            stats["overall"] = {
                "mean": round(avg, 2), "std": round(sd, 2),
                "ci_low": round(ci_low, 2), "ci_high": round(ci_high, 2),
                "n": len(all_scores),
            }

            latencies = [r.latency for r in provider_results]
            stats["latency_avg"] = round(mean(latencies), 3)

            summary[provider] = stats

        return summary

Run the analysis.

analyzer = StatisticalAnalyzer(scored_results)
summary = analyzer.analyze()

print("=" * 65)
print("STATISTICAL SUMMARY")
print("=" * 65)

for provider, stats in summary.items():
    print(f"\n{provider}")
    print("-" * 40)
    for metric in ("accuracy", "completeness", "clarity", "overall"):
        s = stats[metric]
        print(f"  {metric:14s}: {s['mean']:.2f} +/- {s['std']:.2f}  "
              f"95% CI [{s['ci_low']:.2f}, {s['ci_high']:.2f}]")
    print(f"  {'latency':14s}: {stats['latency_avg']:.3f}s avg")

python
=================================================================
STATISTICAL SUMMARY
=================================================================

OpenAI (gpt-4o-mini)
----------------------------------------
  accuracy      : 4.50 +/- 0.58  95% CI [3.70, 5.00]
  completeness  : 4.25 +/- 0.50  95% CI [3.56, 4.94]
  clarity       : 4.75 +/- 0.50  95% CI [4.06, 5.00]
  overall       : 4.50 +/- 0.41  95% CI [3.93, 5.00]
  latency       : 1.234s avg

Anthropic (claude-3-5-haiku-20241022)
----------------------------------------
  accuracy      : 4.75 +/- 0.50  95% CI [4.06, 5.00]
  completeness  : 4.50 +/- 0.58  95% CI [3.70, 5.00]
  clarity       : 4.50 +/- 0.58  95% CI [3.70, 5.00]
  overall       : 4.58 +/- 0.38  95% CI [4.05, 5.00]
  latency       : 0.987s avg

Google (gemini-2.0-flash)
----------------------------------------
  accuracy      : 4.25 +/- 0.96  95% CI [2.92, 5.00]
  completeness  : 4.00 +/- 0.82  95% CI [2.86, 5.14]
  clarity       : 4.50 +/- 0.58  95% CI [3.70, 5.00]
  overall       : 4.25 +/- 0.63  95% CI [3.37, 5.13]
  latency       : 0.756s avg

Look at those ranges. With 4 test cases, they span over a full point. We can’t safely pick a winner yet. That’s honest math — and that’s the whole point of using CIs.

Tip: More test cases = tighter confidence intervals. With 4 cases, your CI spans ~2 points. With 20 cases, it shrinks to ~0.5 points. Aim for 20+ test cases per category for meaningful model comparisons.

How Do You Generate an LLM Comparison Report?

Stats in a dict aren’t useful to your boss. You need a report that tells a story: who won, who’s fastest, and whether the gap is real.

I find auto-reports save a ton of time. Run the test, get a report. No spreadsheet work.

The ReportGenerator takes the summary from the analyzer. It builds a leaderboard, flags per-metric winners, and checks whether the top two providers’ CIs overlap.

class ReportGenerator:
    """Generates a human-readable LLM benchmark comparison report."""

    def __init__(self, summary):
        self.summary = summary

    def generate(self):
        """Build the full report as a string."""
        lines = []
        lines.append("=" * 65)
        lines.append("LLM BENCHMARKING REPORT")
        lines.append("=" * 65)

        ranked = sorted(
            self.summary.items(),
            key=lambda x: x[1]["overall"]["mean"],
            reverse=True,
        )

        lines.append("\n## LEADERBOARD\n")
        header = f"{'Rank':<6}{'Provider':<40}{'Score':<8}{'Latency'}"
        lines.append(header)
        lines.append("-" * 65)

        for i, (name, stats) in enumerate(ranked, 1):
            score = stats["overall"]["mean"]
            lat = stats["latency_avg"]
            lines.append(f"  {i:<4} {name:<40}{score:.2f}    {lat:.3f}s")

        lines.append("\n## PER-METRIC WINNERS\n")
        for metric in ("accuracy", "completeness", "clarity"):
            best = max(self.summary.items(),
                       key=lambda x: x[1][metric]["mean"])
            val = best[1][metric]["mean"]
            lines.append(f"  {metric:14s}: {best[0]} ({val:.2f})")

        fastest = min(self.summary.items(),
                      key=lambda x: x[1]["latency_avg"])
        lat = fastest[1]["latency_avg"]
        lines.append(f"  {'speed':14s}: {fastest[0]} ({lat:.3f}s)")

        lines.append("\n## SIGNIFICANCE CHECK\n")
        if len(ranked) >= 2:
            top_name, top_stats = ranked[0]
            sec_name, sec_stats = ranked[1]
            tc = top_stats["overall"]
            sc = sec_stats["overall"]

            if tc["ci_low"] > sc["ci_high"]:
                lines.append(
                    f"  {top_name} is SIGNIFICANTLY better "
                    f"than {sec_name}"
                )
                lines.append(
                    f"  CIs: [{tc['ci_low']:.2f}, {tc['ci_high']:.2f}] "
                    f"vs [{sc['ci_low']:.2f}, {sc['ci_high']:.2f}]"
                )
            else:
                lines.append("  No significant difference between top 2.")
                lines.append("  -> Run more test cases for conclusive results.")

        lines.append("\n" + "=" * 65)
        return "\n".join(lines)

Generate and print.

report_gen = ReportGenerator(summary)
print(report_gen.generate())

python
=================================================================
LLM BENCHMARKING REPORT
=================================================================

## LEADERBOARD

Rank  Provider                                Score   Latency
-----------------------------------------------------------------
  1   Anthropic (claude-3-5-haiku-20241022)   4.58    0.987s
  2   OpenAI (gpt-4o-mini)                    4.50    1.234s
  3   Google (gemini-2.0-flash)               4.25    0.756s

## PER-METRIC WINNERS

  accuracy      : Anthropic (claude-3-5-haiku-20241022) (4.75)
  completeness  : Anthropic (claude-3-5-haiku-20241022) (4.50)
  clarity       : OpenAI (gpt-4o-mini) (4.75)
  speed         : Google (gemini-2.0-flash) (0.756s)

## SIGNIFICANCE CHECK

  No significant difference between top 2.
  -> Run more test cases for conclusive results.

=================================================================

That last check is the most useful line. Without it, you’d call Anthropic the winner from a 0.08-point lead. The ranges overlap heavily — that lead is noise with only 4 test cases.

How Do You Run the Complete LLM Benchmark Pipeline?

The BenchmarkPipeline chains everything. One call runs benchmarks, scores them, computes statistics, and generates the report. No manual wiring.

class BenchmarkPipeline:
    """End-to-end LLM benchmarking: run -> score -> analyze -> report."""

    def __init__(self, suite, providers, judge_provider):
        self.suite = suite
        self.providers = providers
        self.judge = LLMJudge(judge_provider)
        self.runner = BenchmarkRunner(suite, providers)

    def execute(self):
        """Run the full pipeline. Returns report, results, summary."""
        print("STEP 1: Running benchmarks...\n")
        results = self.runner.run()

        print("\nSTEP 2: Scoring with LLM judge...\n")
        for r in results:
            if r.error is None:
                r.scores = self.judge.score(r.test_case, r.response)
            else:
                r.scores = {"accuracy": 0, "completeness": 0, "clarity": 0}

        print("\nSTEP 3: Computing statistics...\n")
        analyzer = StatisticalAnalyzer(results)
        smry = analyzer.analyze()

        print("STEP 4: Generating report...\n")
        report_text = ReportGenerator(smry).generate()

        return report_text, results, smry

Quick demo with a fresh suite.

quick_suite = TestSuite("Quick Test", "Sanity check")
quick_suite.add_case(
    prompt="What is 15% of 200? Show your work.",
    expected_answer="15% of 200 = 0.15 * 200 = 30",
    category="math"
)
quick_suite.add_case(
    prompt="Name three sorting algorithms with their average time complexity.",
    expected_answer="Bubble Sort O(n^2), Merge Sort O(n log n), Quick Sort O(n log n)",
    category="factual"
)

pipeline = BenchmarkPipeline(
    suite=quick_suite,
    providers=providers,
    judge_provider=make_openai_provider(OPENAI_KEY, "gpt-4o-mini"),
)

report_text, all_results, all_summary = pipeline.execute()
print(report_text)

How Does This Compare to Existing LLM Evaluation Frameworks?

You might wonder: why build from scratch when DeepEval, RAGAS, and LangSmith exist?

Feature Our Platform DeepEval RAGAS
Dependencies Zero (stdlib only) 15+ packages 10+ packages
Setup time 5 minutes 30+ minutes 20+ minutes
Learning value High (you built it) Low (black box) Low (black box)
Statistical analysis Built-in CIs Basic metrics Basic metrics
Provider flexibility Any HTTP endpoint SDK-based SDK-based
Production use Prototyping Production RAG-specific
Customization Total control Plugin system Limited

Our tool isn’t meant to replace those. It’s meant to show you how they work under the hood. Once you get the core ideas, you can pick the right one — or grow this one.

Common Mistakes When Benchmarking LLM Models

Mistake 1: Non-Zero Temperature for Benchmarks

Wrong:

"temperature": 0.7,  # randomness in every response

Why: Random output means the same prompt gives a different reply each run. Your scores lose meaning.

Fix:

"temperature": 0.0,  # deterministic for fair comparison

Mistake 2: Same Model as Contestant and Judge

Wrong:

judge = LLMJudge(make_openai_provider(key, model="gpt-4o-mini"))
providers = [make_openai_provider(key, model="gpt-4o-mini")]

Why: Models like their own style — they rate their own output higher. Zheng et al. (2023) proved this.

Fix:

judge = LLMJudge(make_openai_provider(key, model="gpt-4o"))
providers = [make_openai_provider(key, model="gpt-4o-mini")]

Use a stronger, different model as judge.


Mistake 3: Declaring a Winner from Overlapping CIs

Wrong:

# "Model A (4.5) beats Model B (4.3) — A wins!"
# But CI for A: [3.8, 5.2], CI for B: [3.6, 5.0]

Why: Overlapping CIs mean the difference could be random. You can’t declare a winner.

Fix:

if model_a_ci_low > model_b_ci_high:
    print("A is significantly better")
else:
    print("Need more test cases to conclude")

When Should You NOT Use This Platform?

Every tool has real limits. Know yours before you lean on it.

Don’t use it for live speed tests. Our runner tracks wall-clock time with network hops baked in. Real speed depends on where your server sits, your link, and the load at the time. Use load-test tools for that.

Don’t use it for cost math. Token counts differ per provider and per reply. A true cost check needs token tracking with per-model pricing — and those prices shift each month.

Don’t use it with fewer than 15 test cases. With 4 cases, the ranges are too wide to draw clear lines. The tool works, but the results won’t drive real choices.

Note: This tool measures task-level quality, not general smarts. It answers “Which model is best for MY prompts?” — a very different question from “Which model is smartest overall?”

Summary

You built a complete LLM benchmarking platform from scratch. Here’s what each piece does:

  • TestSuite / TestCase — stores prompts, expected answers, and categories
  • LLMProvider — wraps raw HTTP calls to any LLM API
  • BenchmarkRunner — runs every test case against every provider
  • LLMJudge — uses an LLM to score responses on accuracy, completeness, clarity
  • StatisticalAnalyzer — computes means, standard deviations, and 95% confidence intervals
  • ReportGenerator — ranks providers and checks statistical significance
  • BenchmarkPipeline — chains everything into a single call

Zero outside packages. Grow it by adding new providers (Mistral, Cohere, Ollama), custom score types, per-task breakdowns, or weighted scoring.

Frequently Asked Questions

Can I add a local model like Ollama as a provider?

Yes. Write a factory function pointing to Ollama’s endpoint — http://localhost:11434/api/generate. The request and response JSON differ from cloud APIs, so you need custom callables. The LLMProvider class handles any HTTP endpoint.

def make_ollama_provider(model="llama3"):
    def fmt(p, m, t):
        return {"model": m, "prompt": p, "stream": False}
    def ext(r):
        return r["response"]
    return LLMProvider(
        f"Ollama ({model})", "http://localhost:11434/api/generate",
        "", model, fmt, ext)

How many test cases give reliable benchmark results?

At least 15-20 per group. Fewer gives wide ranges. For choices that affect real costs, aim for 50+ cases across your true use cases.

What scoring criteria work best for code generation tasks?

Add “correctness” (does the code run?) and “efficiency” as criteria. Modify JUDGE_PROMPT to include them and update StatisticalAnalyzer to loop over the new metric names.

How do I handle rate limits with large test suites?

Add time.sleep(1) between calls in BenchmarkRunner.run. For heavy usage, add exponential backoff to LLMProvider.call — start at 1 second, double on each retry, cap at 30 seconds.

References

  1. OpenAI API Documentation — Chat Completions. Link
  2. Anthropic API Documentation — Messages. Link
  3. Google Gemini API Documentation — generateContent. Link
  4. Zheng, L. et al. — “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. Link
  5. Sebastian Raschka — “Understanding the 4 Main Approaches to LLM Evaluation.” Link
  6. Python Documentation — statistics module. Link
  7. Python Documentation — urllib.request. Link
  8. Evidently AI — “30 LLM Evaluation Benchmarks and How They Work.” Link
  9. Confident AI — “How to Build an LLM Evaluation Framework from Scratch.” Link
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science