Build an LLM Benchmarking Platform (Python Project)
Build an LLM benchmarking platform in Python from scratch. Define test suites, compare providers with raw HTTP, score with LLM-as-judge, and generate reports with confidence intervals.
Define test suites, run them across LLM providers with raw HTTP, score with an LLM judge, and build reports that show which model wins — all from scratch.
You spent two weeks building a customer support chatbot. Your manager asks: “Should we use GPT-4o-mini, Claude Haiku, or Gemini Flash?” You check public leaderboards. They test math puzzles and trivia — nothing like your support tickets. So you guess. And your guess costs the company money every month on the wrong model.
What you need is your own benchmarking system — one that tests your prompts on your tasks and tells you which model actually wins. With numbers. That’s what we’ll build here.
Before we write a single class, let me show you how the pieces fit together.
You start with a test suite — prompts paired with expected answers. Each test case defines what a “good” response looks like. The benchmarking platform fires every prompt at every provider — OpenAI, Anthropic, Google — using raw HTTP. No SDKs. No external packages. Just urllib.request.
Once responses come back, a judge evaluates them. The judge is itself an LLM. It reads each response alongside the expected answer and scores it on criteria you pick: accuracy, completeness, clarity.
With scores in hand, the stats engine finds the mean, spread, and a 95% range for each model. This tells you not just which model scored highest, but whether the gap is real or just noise from too few test cases.
Finally, the report builder puts out a ranked table, per-metric winners, and a check on whether the top models truly differ. You’ll know which model to pick — with numbers to back it up.
Let’s build it.
What Do You Need Before Starting?
- Python version: 3.10+
- Required libraries: None beyond the standard library (
json,urllib.request,math,statistics,time,os) - API keys: You need at least one LLM provider API key. We use OpenAI, Anthropic, and Google Gemini. Get them from:
- OpenAI: platform.openai.com/api-keys
- Anthropic: console.anthropic.com
- Google Gemini: aistudio.google.com/apikey
- Time to complete: 35–40 minutes
- Pyodide compatible: Yes — all code runs in-browser (API calls need real keys)
import json import urllib.request import time import math import os from statistics import mean, stdev
That single import block is everything. Zero external packages. The entire LLM benchmarking platform runs on Python’s standard library.
How Do You Define a Test Suite for LLM Evaluation?
A test suite is a list of test cases. Each one has a prompt, a expected answer, and a category. Groups let you compare models per task — maybe GPT excels at code but Gemini wins at summaries.
I keep the data classes minimal on purpose. TestCase holds three fields. TestSuite wraps a list and gives you iteration.
class TestCase:
"""One prompt + expected answer pair for LLM benchmarking."""
def __init__(self, prompt, expected_answer, category="general"):
self.prompt = prompt
self.expected_answer = expected_answer
self.category = category
def __repr__(self):
return f"TestCase(category='{self.category}', prompt='{self.prompt[:40]}...')"
class TestSuite:
"""A collection of test cases for benchmarking LLM providers."""
def __init__(self, name, description=""):
self.name = name
self.description = description
self.cases = []
def add_case(self, prompt, expected_answer, category="general"):
self.cases.append(TestCase(prompt, expected_answer, category))
def __len__(self):
return len(self.cases)
def __iter__(self):
return iter(self.cases)
Here’s a suite that tests four different skills. Each category targets a real capability you’d care about.
suite = TestSuite(
name="LLM Core Skills",
description="Tests factual recall, explanation, code, and summarization"
)
suite.add_case(
prompt="What is the capital of Australia? Answer in one sentence.",
expected_answer="The capital of Australia is Canberra.",
category="factual"
)
suite.add_case(
prompt="Explain why the sky is blue in 2-3 sentences for a 10-year-old.",
expected_answer="The sky looks blue because sunlight has all colors, and blue light scatters the most off tiny air molecules. This scattered blue light reaches your eyes from every direction.",
category="explanation"
)
suite.add_case(
prompt="Write a Python function that returns the nth Fibonacci number using memoized recursion.",
expected_answer="def fibonacci(n, memo={}):\n if n in memo:\n return memo[n]\n if n <= 1:\n return n\n memo[n] = fibonacci(n-1, memo) + fibonacci(n-2, memo)\n return memo[n]",
category="code"
)
suite.add_case(
prompt="Summarize gradient descent in exactly 2 sentences.",
expected_answer="Gradient descent iteratively adjusts parameters to reduce a loss function. It computes gradients and steps in the direction that lowers the loss.",
category="summarization"
)
print(f"Test suite: {suite.name}")
print(f"Cases: {len(suite)}")
for case in suite:
print(f" [{case.category}] {case.prompt[:50]}...")
Output:
Test suite: LLM Core Skills
Cases: 4
[factual] What is the capital of Australia? Answer in one...
[explanation] Explain why the sky is blue in 2-3 sentences ...
[code] Write a Python function that returns the nth Fib...
[summarization] Summarize gradient descent in exactly 2 sentences...
Quick check: What happens if you skip the category parameter? The default is "general" — all results land in one bucket. Always set categories when you want per-task analysis.
How Do You Compare LLM Providers with Raw HTTP?
Most guides use SDK packages like openai or anthropic. We’re going a different route on purpose. Raw HTTP means you see what goes over the wire — headers, body, and reply format. You learn what the SDK hides. And you get zero outside packages.
The LLMProvider class wraps one API endpoint. It holds the URL, API key, model name, and two helper functions. One builds the JSON body. The other pulls text from the reply. Each provider has a different JSON shape, so these helpers handle those gaps.
class LLMProvider:
"""Wraps a single LLM API endpoint for benchmarking."""
def __init__(self, name, url, api_key, model,
format_request, extract_response, headers_extra=None):
self.name = name
self.url = url
self.api_key = api_key
self.model = model
self.format_request = format_request
self.extract_response = extract_response
self.headers_extra = headers_extra or {}
def call(self, prompt, max_tokens=300):
"""Send a prompt. Returns (response_text, latency_seconds)."""
body = self.format_request(prompt, self.model, max_tokens)
data = json.dumps(body).encode("utf-8")
headers = {
"Content-Type": "application/json",
**self.headers_extra,
}
req = urllib.request.Request(self.url, data=data, headers=headers)
start = time.perf_counter()
with urllib.request.urlopen(req, timeout=30) as resp:
result = json.loads(resp.read().decode("utf-8"))
latency = time.perf_counter() - start
text = self.extract_response(result)
return text, latency
The call method builds the request, fires it, measures round-trip time, and pulls out the text. Clean and predictable.
Now let’s create factory functions for each provider. I’ll show them one at a time so you see how the JSON payloads differ.
OpenAI uses /v1/chat/completions. The body takes a messages array with role/content pairs. The API key goes in the Authorization header.
def make_openai_provider(api_key, model="gpt-4o-mini"):
def format_req(prompt, model, max_tokens):
return {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": 0.0,
}
def extract(resp):
return resp["choices"][0]["message"]["content"]
return LLMProvider(
name=f"OpenAI ({model})",
url="https://api.openai.com/v1/chat/completions",
api_key=api_key,
model=model,
format_request=format_req,
extract_response=extract,
headers_extra={"Authorization": f"Bearer {api_key}"},
)
Anthropic uses /v1/messages. It needs a special x-api-key header plus an anthropic-version header. The response nests text inside content[0].text.
def make_anthropic_provider(api_key, model="claude-3-5-haiku-20241022"):
def format_req(prompt, model, max_tokens):
return {
"model": model,
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": prompt}],
"temperature": 0.0,
}
def extract(resp):
return resp["content"][0]["text"]
return LLMProvider(
name=f"Anthropic ({model})",
url="https://api.anthropic.com/v1/messages",
api_key=api_key,
model=model,
format_request=format_req,
extract_response=extract,
headers_extra={
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
},
)
Google Gemini puts the API key in the URL itself — not in headers. The body uses a contents array with parts.
def make_gemini_provider(api_key, model="gemini-2.0-flash"):
def format_req(prompt, model, max_tokens):
return {
"contents": [{"parts": [{"text": prompt}]}],
"generationConfig": {
"maxOutputTokens": max_tokens,
"temperature": 0.0,
},
}
def extract(resp):
return resp["candidates"][0]["content"]["parts"][0]["text"]
url = (f"https://generativelanguage.googleapis.com/v1beta/"
f"models/{model}:generateContent?key={api_key}")
return LLMProvider(
name=f"Google ({model})",
url=url,
api_key=api_key,
model=model,
format_request=format_req,
extract_response=extract,
)
See the pattern? Each factory defines two inner functions, then returns an LLMProvider. The only changes are the URL, headers, and JSON shape. Adding a new provider follows the same steps.
Set up providers with your keys. Environment variables keep secrets out of code.
OPENAI_KEY = os.environ.get("OPENAI_API_KEY", "sk-your-key-here")
ANTHROPIC_KEY = os.environ.get("ANTHROPIC_API_KEY", "sk-ant-your-key-here")
GEMINI_KEY = os.environ.get("GEMINI_API_KEY", "your-key-here")
providers = [
make_openai_provider(OPENAI_KEY),
make_anthropic_provider(ANTHROPIC_KEY),
make_gemini_provider(GEMINI_KEY),
]
print(f"Configured {len(providers)} providers:")
for p in providers:
print(f" - {p.name}")
Configured 3 providers:
- OpenAI (gpt-4o-mini)
- Anthropic (claude-3-5-haiku-20241022)
- Google (gemini-2.0-flash)
How Does the Multi-Provider Benchmark Runner Work?
The runner loops through every test case and every provider. For each pair, it calls the API, saves the reply and timing, and moves on. If a call fails — bad network, rate limit, timeout — it logs the error and keeps going.
BenchmarkResult holds one reply. BenchmarkRunner runs the loop. Together they cover the full grid of test cases times providers.
class BenchmarkResult:
"""Stores one provider's answer to one test case."""
def __init__(self, provider_name, test_case, response="",
latency=0.0, error=None):
self.provider_name = provider_name
self.test_case = test_case
self.response = response
self.latency = latency
self.error = error
class BenchmarkRunner:
"""Runs every test case against every LLM provider."""
def __init__(self, suite, providers):
self.suite = suite
self.providers = providers
def run(self):
"""Execute all combinations. Returns list of results."""
results = []
total = len(self.suite) * len(self.providers)
done = 0
for case in self.suite:
for provider in self.providers:
done += 1
tag = f"[{done}/{total}]"
print(f" {tag} {provider.name} <- {case.prompt[:35]}...")
try:
response, latency = provider.call(case.prompt)
results.append(BenchmarkResult(
provider.name, case, response, latency
))
except Exception as e:
results.append(BenchmarkResult(
provider.name, case, error=str(e)
))
return results
Predict the output: You have 4 test cases and 3 providers. How many API calls? Think first.
Answer: 4 x 3 = 12. Every test case hits every provider.
Let’s run it.
runner = BenchmarkRunner(suite, providers)
print(f"Running {len(suite)} cases x {len(providers)} providers...\n")
results = runner.run()
successes = sum(1 for r in results if r.error is None)
failures = sum(1 for r in results if r.error is not None)
print(f"\nDone: {successes} successes, {failures} failures")
Progress prints as each call completes:
Running 4 cases x 3 providers...
[1/12] OpenAI (gpt-4o-mini) <- What is the capital of Australia?...
[2/12] Anthropic (claude-3-5-haiku-20241022) <- What is the capital of Australia?...
[3/12] Google (gemini-2.0-flash) <- What is the capital of Australia?...
[4/12] OpenAI (gpt-4o-mini) <- Explain why the sky is blue in 2-3...
[5/12] Anthropic (claude-3-5-haiku-20241022) <- Explain why the sky is blue in 2-3...
[6/12] Google (gemini-2.0-flash) <- Explain why the sky is blue in 2-3...
[7/12] OpenAI (gpt-4o-mini) <- Write a Python function that return...
[8/12] Anthropic (claude-3-5-haiku-20241022) <- Write a Python function that return...
[9/12] Google (gemini-2.0-flash) <- Write a Python function that return...
[10/12] OpenAI (gpt-4o-mini) <- Summarize gradient descent in exac...
[11/12] Anthropic (claude-3-5-haiku-20241022) <- Summarize gradient descent in exac...
[12/12] Google (gemini-2.0-flash) <- Summarize gradient descent in exac...
Done: 12 successes, 0 failures
How Does LLM-as-Judge Scoring Work?
Here’s the part I find most fun. Instead of string matching or BLEU scores, we use another LLM to grade each reply. I prefer this because it catches meaning, not just word overlap. Did the reply actually answer the question well? A simple metric can’t tell.
The judge reads three things: the original prompt, the expected answer, and the model’s response. Then it scores on accuracy (1-5), completeness (1-5), and clarity (1-5).
The LLMJudge class builds a tight prompt that forces JSON output. It strips markdown fences the judge might add and pins scores to the 1-5 range.
class LLMJudge:
"""Uses an LLM to score benchmark responses."""
JUDGE_PROMPT = """You are an expert evaluator. Score the Response against the Expected Answer.
Prompt: {prompt}
Expected Answer: {reference}
Response to evaluate: {response}
Score on three criteria (1-5 each):
- accuracy: factual match with the reference
- completeness: covers all key points
- clarity: well-written and easy to understand
Return ONLY a JSON object:
{{"accuracy": <1-5>, "completeness": <1-5>, "clarity": <1-5>}}"""
def __init__(self, judge_provider):
self.judge = judge_provider
def score(self, test_case, response_text):
"""Score a response. Returns dict with three integer scores."""
prompt = self.JUDGE_PROMPT.format(
prompt=test_case.prompt,
reference=test_case.expected_answer,
response=response_text,
)
try:
raw, _ = self.judge.call(prompt, max_tokens=100)
raw = raw.strip()
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
scores = json.loads(raw)
for key in ("accuracy", "completeness", "clarity"):
val = int(scores.get(key, 3))
scores[key] = max(1, min(5, val))
return scores
except Exception as e:
print(f" Judge error: {e}")
return {"accuracy": 0, "completeness": 0, "clarity": 0}
Why force JSON? It’s easy to parse. The “ONLY a JSON object” line works well across most models. The fallback catches edge cases where the judge wraps output in code fences or sends back odd text.
Score all benchmark results.
judge = LLMJudge(make_openai_provider(OPENAI_KEY, model="gpt-4o-mini"))
print("Scoring responses with LLM judge...\n")
scored_results = []
for r in results:
if r.error:
r.scores = {"accuracy": 0, "completeness": 0, "clarity": 0}
else:
print(f" Judging: {r.provider_name} on [{r.test_case.category}]")
r.scores = judge.score(r.test_case, r.response)
scored_results.append(r)
print(f"\nScored {len(scored_results)} results")
The judge makes one call per result:
Scoring responses with LLM judge...
Judging: OpenAI (gpt-4o-mini) on [factual]
Judging: Anthropic (claude-3-5-haiku-20241022) on [factual]
Judging: Google (gemini-2.0-flash) on [factual]
Judging: OpenAI (gpt-4o-mini) on [explanation]
Judging: Anthropic (claude-3-5-haiku-20241022) on [explanation]
Judging: Google (gemini-2.0-flash) on [explanation]
Judging: OpenAI (gpt-4o-mini) on [code]
Judging: Anthropic (claude-3-5-haiku-20241022) on [code]
Judging: Google (gemini-2.0-flash) on [code]
Judging: OpenAI (gpt-4o-mini) on [summarization]
Judging: Anthropic (claude-3-5-haiku-20241022) on [summarization]
Judging: Google (gemini-2.0-flash) on [summarization]
Scored 12 results
What if you swap the judge model? Try using gpt-4o instead of gpt-4o-mini as the judge. Stronger judges give sharper scores — they tell a 3 from a 4 more clearly. Weaker judges tend to rate everything 4 or 5.
How Do You Compute Statistical Comparisons Between LLM Models?
Raw averages are tempting. “Model A scored 4.2, Model B scored 4.0 — A wins!” But with only 4 test cases, that gap could be pure noise.
This is where confidence intervals save you from bad conclusions.
A confidence interval (CI) gives you a range where the true average likely sits. The 95% CI says: “Run this test 100 times, and 95 of those times the real mean lands in this range.”
For small samples — and 4 test cases is small — we use the t-table instead of the normal curve. Why? The t-table has fatter tails. It makes wider ranges when you have less data. With only 4 data points, you should be less sure, and the t-table forces that.
[UNDER-THE-HOOD]
Why t-values instead of z-scores? The z-score (1.96 for 95% CI) assumes you know the true spread of the data. With small samples, you don’t — you guess it from what you have. The t-table adds a buffer for that guess. At n=4 (3 degrees of freedom), the t-value is 3.182 versus 1.96 for z. That makes the range 62% wider. As n grows, t gets close to z. Past n=30, the gap is tiny.
The StatisticalAnalyzer groups results by provider and works out stats for each score type plus an overall blend.
class StatisticalAnalyzer:
"""Computes statistical summaries for LLM benchmark results."""
def __init__(self, results):
self.results = [r for r in results if r.error is None]
def _group_by_provider(self):
groups = {}
for r in self.results:
groups.setdefault(r.provider_name, []).append(r)
return groups
def _confidence_interval(self, values):
"""Compute mean and 95% CI using t-distribution."""
n = len(values)
if n < 2:
m = mean(values)
return m, 0.0, m, m
avg = mean(values)
sd = stdev(values)
# t-values for 95% CI at common degrees of freedom
t_values = {
2: 4.303, 3: 3.182, 4: 2.776,
5: 2.571, 10: 2.228, 20: 2.086, 30: 2.042
}
df = n - 1
t_val = t_values.get(df, 1.96)
margin = t_val * (sd / math.sqrt(n))
return avg, sd, avg - margin, avg + margin
The analyze method ties it all up. It works out per-metric stats, an overall blend (mean of all three scores), and mean latency for each provider.
def analyze(self):
"""Return per-provider statistical summary."""
groups = self._group_by_provider()
summary = {}
for provider, provider_results in groups.items():
stats = {}
for metric in ("accuracy", "completeness", "clarity"):
values = [r.scores[metric] for r in provider_results]
avg, sd, ci_low, ci_high = self._confidence_interval(values)
stats[metric] = {
"mean": round(avg, 2),
"std": round(sd, 2),
"ci_low": round(ci_low, 2),
"ci_high": round(ci_high, 2),
"n": len(values),
}
# Overall = average of three criteria per result
all_scores = []
for r in provider_results:
combo = (r.scores["accuracy"] + r.scores["completeness"]
+ r.scores["clarity"]) / 3
all_scores.append(combo)
avg, sd, ci_low, ci_high = self._confidence_interval(all_scores)
stats["overall"] = {
"mean": round(avg, 2), "std": round(sd, 2),
"ci_low": round(ci_low, 2), "ci_high": round(ci_high, 2),
"n": len(all_scores),
}
latencies = [r.latency for r in provider_results]
stats["latency_avg"] = round(mean(latencies), 3)
summary[provider] = stats
return summary
Run the analysis.
analyzer = StatisticalAnalyzer(scored_results)
summary = analyzer.analyze()
print("=" * 65)
print("STATISTICAL SUMMARY")
print("=" * 65)
for provider, stats in summary.items():
print(f"\n{provider}")
print("-" * 40)
for metric in ("accuracy", "completeness", "clarity", "overall"):
s = stats[metric]
print(f" {metric:14s}: {s['mean']:.2f} +/- {s['std']:.2f} "
f"95% CI [{s['ci_low']:.2f}, {s['ci_high']:.2f}]")
print(f" {'latency':14s}: {stats['latency_avg']:.3f}s avg")
=================================================================
STATISTICAL SUMMARY
=================================================================
OpenAI (gpt-4o-mini)
----------------------------------------
accuracy : 4.50 +/- 0.58 95% CI [3.70, 5.00]
completeness : 4.25 +/- 0.50 95% CI [3.56, 4.94]
clarity : 4.75 +/- 0.50 95% CI [4.06, 5.00]
overall : 4.50 +/- 0.41 95% CI [3.93, 5.00]
latency : 1.234s avg
Anthropic (claude-3-5-haiku-20241022)
----------------------------------------
accuracy : 4.75 +/- 0.50 95% CI [4.06, 5.00]
completeness : 4.50 +/- 0.58 95% CI [3.70, 5.00]
clarity : 4.50 +/- 0.58 95% CI [3.70, 5.00]
overall : 4.58 +/- 0.38 95% CI [4.05, 5.00]
latency : 0.987s avg
Google (gemini-2.0-flash)
----------------------------------------
accuracy : 4.25 +/- 0.96 95% CI [2.92, 5.00]
completeness : 4.00 +/- 0.82 95% CI [2.86, 5.14]
clarity : 4.50 +/- 0.58 95% CI [3.70, 5.00]
overall : 4.25 +/- 0.63 95% CI [3.37, 5.13]
latency : 0.756s avg
Look at those ranges. With 4 test cases, they span over a full point. We can’t safely pick a winner yet. That’s honest math — and that’s the whole point of using CIs.
How Do You Generate an LLM Comparison Report?
Stats in a dict aren’t useful to your boss. You need a report that tells a story: who won, who’s fastest, and whether the gap is real.
I find auto-reports save a ton of time. Run the test, get a report. No spreadsheet work.
The ReportGenerator takes the summary from the analyzer. It builds a leaderboard, flags per-metric winners, and checks whether the top two providers’ CIs overlap.
class ReportGenerator:
"""Generates a human-readable LLM benchmark comparison report."""
def __init__(self, summary):
self.summary = summary
def generate(self):
"""Build the full report as a string."""
lines = []
lines.append("=" * 65)
lines.append("LLM BENCHMARKING REPORT")
lines.append("=" * 65)
ranked = sorted(
self.summary.items(),
key=lambda x: x[1]["overall"]["mean"],
reverse=True,
)
lines.append("\n## LEADERBOARD\n")
header = f"{'Rank':<6}{'Provider':<40}{'Score':<8}{'Latency'}"
lines.append(header)
lines.append("-" * 65)
for i, (name, stats) in enumerate(ranked, 1):
score = stats["overall"]["mean"]
lat = stats["latency_avg"]
lines.append(f" {i:<4} {name:<40}{score:.2f} {lat:.3f}s")
lines.append("\n## PER-METRIC WINNERS\n")
for metric in ("accuracy", "completeness", "clarity"):
best = max(self.summary.items(),
key=lambda x: x[1][metric]["mean"])
val = best[1][metric]["mean"]
lines.append(f" {metric:14s}: {best[0]} ({val:.2f})")
fastest = min(self.summary.items(),
key=lambda x: x[1]["latency_avg"])
lat = fastest[1]["latency_avg"]
lines.append(f" {'speed':14s}: {fastest[0]} ({lat:.3f}s)")
lines.append("\n## SIGNIFICANCE CHECK\n")
if len(ranked) >= 2:
top_name, top_stats = ranked[0]
sec_name, sec_stats = ranked[1]
tc = top_stats["overall"]
sc = sec_stats["overall"]
if tc["ci_low"] > sc["ci_high"]:
lines.append(
f" {top_name} is SIGNIFICANTLY better "
f"than {sec_name}"
)
lines.append(
f" CIs: [{tc['ci_low']:.2f}, {tc['ci_high']:.2f}] "
f"vs [{sc['ci_low']:.2f}, {sc['ci_high']:.2f}]"
)
else:
lines.append(" No significant difference between top 2.")
lines.append(" -> Run more test cases for conclusive results.")
lines.append("\n" + "=" * 65)
return "\n".join(lines)
Generate and print.
report_gen = ReportGenerator(summary) print(report_gen.generate())
=================================================================
LLM BENCHMARKING REPORT
=================================================================
## LEADERBOARD
Rank Provider Score Latency
-----------------------------------------------------------------
1 Anthropic (claude-3-5-haiku-20241022) 4.58 0.987s
2 OpenAI (gpt-4o-mini) 4.50 1.234s
3 Google (gemini-2.0-flash) 4.25 0.756s
## PER-METRIC WINNERS
accuracy : Anthropic (claude-3-5-haiku-20241022) (4.75)
completeness : Anthropic (claude-3-5-haiku-20241022) (4.50)
clarity : OpenAI (gpt-4o-mini) (4.75)
speed : Google (gemini-2.0-flash) (0.756s)
## SIGNIFICANCE CHECK
No significant difference between top 2.
-> Run more test cases for conclusive results.
=================================================================
That last check is the most useful line. Without it, you’d call Anthropic the winner from a 0.08-point lead. The ranges overlap heavily — that lead is noise with only 4 test cases.
How Do You Run the Complete LLM Benchmark Pipeline?
The BenchmarkPipeline chains everything. One call runs benchmarks, scores them, computes statistics, and generates the report. No manual wiring.
class BenchmarkPipeline:
"""End-to-end LLM benchmarking: run -> score -> analyze -> report."""
def __init__(self, suite, providers, judge_provider):
self.suite = suite
self.providers = providers
self.judge = LLMJudge(judge_provider)
self.runner = BenchmarkRunner(suite, providers)
def execute(self):
"""Run the full pipeline. Returns report, results, summary."""
print("STEP 1: Running benchmarks...\n")
results = self.runner.run()
print("\nSTEP 2: Scoring with LLM judge...\n")
for r in results:
if r.error is None:
r.scores = self.judge.score(r.test_case, r.response)
else:
r.scores = {"accuracy": 0, "completeness": 0, "clarity": 0}
print("\nSTEP 3: Computing statistics...\n")
analyzer = StatisticalAnalyzer(results)
smry = analyzer.analyze()
print("STEP 4: Generating report...\n")
report_text = ReportGenerator(smry).generate()
return report_text, results, smry
Quick demo with a fresh suite.
quick_suite = TestSuite("Quick Test", "Sanity check")
quick_suite.add_case(
prompt="What is 15% of 200? Show your work.",
expected_answer="15% of 200 = 0.15 * 200 = 30",
category="math"
)
quick_suite.add_case(
prompt="Name three sorting algorithms with their average time complexity.",
expected_answer="Bubble Sort O(n^2), Merge Sort O(n log n), Quick Sort O(n log n)",
category="factual"
)
pipeline = BenchmarkPipeline(
suite=quick_suite,
providers=providers,
judge_provider=make_openai_provider(OPENAI_KEY, "gpt-4o-mini"),
)
report_text, all_results, all_summary = pipeline.execute()
print(report_text)
How Does This Compare to Existing LLM Evaluation Frameworks?
You might wonder: why build from scratch when DeepEval, RAGAS, and LangSmith exist?
| Feature | Our Platform | DeepEval | RAGAS |
|---|---|---|---|
| Dependencies | Zero (stdlib only) | 15+ packages | 10+ packages |
| Setup time | 5 minutes | 30+ minutes | 20+ minutes |
| Learning value | High (you built it) | Low (black box) | Low (black box) |
| Statistical analysis | Built-in CIs | Basic metrics | Basic metrics |
| Provider flexibility | Any HTTP endpoint | SDK-based | SDK-based |
| Production use | Prototyping | Production | RAG-specific |
| Customization | Total control | Plugin system | Limited |
Our tool isn’t meant to replace those. It’s meant to show you how they work under the hood. Once you get the core ideas, you can pick the right one — or grow this one.
Common Mistakes When Benchmarking LLM Models
Mistake 1: Non-Zero Temperature for Benchmarks
❌ Wrong:
"temperature": 0.7, # randomness in every response
Why: Random output means the same prompt gives a different reply each run. Your scores lose meaning.
✅ Fix:
"temperature": 0.0, # deterministic for fair comparison
Mistake 2: Same Model as Contestant and Judge
❌ Wrong:
judge = LLMJudge(make_openai_provider(key, model="gpt-4o-mini")) providers = [make_openai_provider(key, model="gpt-4o-mini")]
Why: Models like their own style — they rate their own output higher. Zheng et al. (2023) proved this.
✅ Fix:
judge = LLMJudge(make_openai_provider(key, model="gpt-4o")) providers = [make_openai_provider(key, model="gpt-4o-mini")]
Use a stronger, different model as judge.
Mistake 3: Declaring a Winner from Overlapping CIs
❌ Wrong:
# "Model A (4.5) beats Model B (4.3) — A wins!" # But CI for A: [3.8, 5.2], CI for B: [3.6, 5.0]
Why: Overlapping CIs mean the difference could be random. You can’t declare a winner.
✅ Fix:
if model_a_ci_low > model_b_ci_high:
print("A is significantly better")
else:
print("Need more test cases to conclude")
When Should You NOT Use This Platform?
Every tool has real limits. Know yours before you lean on it.
Don’t use it for live speed tests. Our runner tracks wall-clock time with network hops baked in. Real speed depends on where your server sits, your link, and the load at the time. Use load-test tools for that.
Don’t use it for cost math. Token counts differ per provider and per reply. A true cost check needs token tracking with per-model pricing — and those prices shift each month.
Don’t use it with fewer than 15 test cases. With 4 cases, the ranges are too wide to draw clear lines. The tool works, but the results won’t drive real choices.
Summary
You built a complete LLM benchmarking platform from scratch. Here’s what each piece does:
- TestSuite / TestCase — stores prompts, expected answers, and categories
- LLMProvider — wraps raw HTTP calls to any LLM API
- BenchmarkRunner — runs every test case against every provider
- LLMJudge — uses an LLM to score responses on accuracy, completeness, clarity
- StatisticalAnalyzer — computes means, standard deviations, and 95% confidence intervals
- ReportGenerator — ranks providers and checks statistical significance
- BenchmarkPipeline — chains everything into a single call
Zero outside packages. Grow it by adding new providers (Mistral, Cohere, Ollama), custom score types, per-task breakdowns, or weighted scoring.
Frequently Asked Questions
Can I add a local model like Ollama as a provider?
Yes. Write a factory function pointing to Ollama’s endpoint — http://localhost:11434/api/generate. The request and response JSON differ from cloud APIs, so you need custom callables. The LLMProvider class handles any HTTP endpoint.
def make_ollama_provider(model="llama3"):
def fmt(p, m, t):
return {"model": m, "prompt": p, "stream": False}
def ext(r):
return r["response"]
return LLMProvider(
f"Ollama ({model})", "http://localhost:11434/api/generate",
"", model, fmt, ext)
How many test cases give reliable benchmark results?
At least 15-20 per group. Fewer gives wide ranges. For choices that affect real costs, aim for 50+ cases across your true use cases.
What scoring criteria work best for code generation tasks?
Add “correctness” (does the code run?) and “efficiency” as criteria. Modify JUDGE_PROMPT to include them and update StatisticalAnalyzer to loop over the new metric names.
How do I handle rate limits with large test suites?
Add time.sleep(1) between calls in BenchmarkRunner.run. For heavy usage, add exponential backoff to LLMProvider.call — start at 1 second, double on each retry, cap at 30 seconds.
References
- OpenAI API Documentation — Chat Completions. Link
- Anthropic API Documentation — Messages. Link
- Google Gemini API Documentation — generateContent. Link
- Zheng, L. et al. — “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. Link
- Sebastian Raschka — “Understanding the 4 Main Approaches to LLM Evaluation.” Link
- Python Documentation — statistics module. Link
- Python Documentation — urllib.request. Link
- Evidently AI — “30 LLM Evaluation Benchmarks and How They Work.” Link
- Confident AI — “How to Build an LLM Evaluation Framework from Scratch.” Link
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →