How to Evaluate LLMs — Metrics, Benchmarks & Python Code
You just deployed an LLM-powered feature. Users are complaining — some answers are wrong, some are great, and you can’t tell which is which. Why? Because you never set up proper evaluation.
I’ve seen this pattern dozens of times. Teams spend weeks on prompt engineering and fine-tuning, then ship without measuring whether any of it actually worked.
Evaluation isn’t optional. It’s the difference between “we think the model is good” and “we know the model scores 78% on our test set, up from 71% last week.”
In this guide you’ll learn:
- What the major LLM benchmarks (MMLU, HumanEval, MT-Bench) actually measure
- How to compute perplexity, BLEU, ROUGE, and BERTScore from scratch in Python
- How to build an LLM-as-judge pipeline for open-ended evaluation
- When to use human evaluation and how to measure annotator agreement
- How to evaluate RAG systems and avoid common evaluation mistakes
- Which evaluation frameworks (DeepEval, RAGAS) to use in production
Before we write any code, here’s how the evaluation landscape fits together.
There are three layers to LLM evaluation. The first is benchmarks — standardized tests like MMLU and HumanEval that let you compare models against each other. They answer: “How does this model rank overall?”
The second layer is metrics — quantitative scores you compute on your own data. BLEU, ROUGE, perplexity, pass@k. They answer: “How well does this model perform on my specific task?”
The third layer is judgment — human evaluation and LLM-as-judge. These catch what automated metrics miss: tone, helpfulness, safety. They answer: “Is this output actually good?”
Most teams need all three. Benchmarks for model selection. Metrics for tracking improvement. Judgment for catching edge cases.
We’ll build each layer from scratch. By the end, you’ll have a working evaluation pipeline you can adapt to any LLM project.
Prerequisites
- Python version: 3.9+
- Required libraries: numpy (1.24+), collections (stdlib)
- Optional libraries: openai (1.0+), rouge-score (0.1.2+), nltk (3.8+)
- Install:
pip install numpy openai rouge-score nltk - Time to complete: ~45 minutes
- Last reviewed: March 2026
What Is LLM Evaluation?
LLM evaluation is the process of measuring how well a language model performs on tasks you care about. It moves beyond “does the output look reasonable” to “can we quantify quality and track it over time.”
Think of it like grading student essays. You could eyeball a few and say “looks fine.” Or you could define rubrics, score consistently, and track improvement. LLM evaluation is the rubric approach — and I prefer it because it turns gut feelings into data.
There are three core questions every evaluation answers:
- Correctness. Is the output factually accurate?
- Relevance. Does it actually address what was asked?
- Quality. Is it well-written, safe, and appropriate?
Different evaluation methods tackle different questions. Benchmarks handle correctness at scale. Metrics quantify specific quality dimensions. Human judgment catches everything else.
import numpy as np
from collections import Counter
import re
print("LLM Evaluation toolkit loaded")
print("We'll build metrics from scratch -- no black boxes")
LLM Evaluation toolkit loaded
We'll build metrics from scratch -- no black boxes
The evaluation landscape has exploded since 2023. Dozens of benchmarks, metrics, and frameworks compete for attention. I find the trick is knowing which ones matter for your use case — and ignoring the rest.
Key LLM Benchmarks Explained
Benchmarks are standardized tests. They present the model with predefined inputs, check the outputs against known answers, and produce a score.
You’ve probably seen benchmark numbers thrown around in model announcements. Here’s what the most important ones actually measure.
MMLU (Massive Multitask Language Understanding)
MMLU tests knowledge across 57 subjects — from abstract algebra to world religions. Each question is multiple-choice with four options.
# What an MMLU question looks like
mmlu_example = {
"question": "What is the capital of Australia?",
"choices": ["Sydney", "Melbourne", "Canberra", "Brisbane"],
"correct_answer": "C", # Canberra
"subject": "geography",
}
print(f"Subject: {mmlu_example['subject']}")
print(f"Q: {mmlu_example['question']}")
for i, choice in enumerate(mmlu_example['choices']):
label = chr(65 + i)
marker = " <-- correct" if label == mmlu_example['correct_answer'] else ""
print(f" {label}. {choice}{marker}")
Subject: geography
Q: What is the capital of Australia?
A. Sydney
B. Melbourne
C. Canberra <-- correct
D. Brisbane
The metric is simple: accuracy — what percentage of questions did the model get right. A score of 86% on MMLU means the model answered 86 out of 100 questions correctly across all 57 subjects. I like MMLU as a starting point, but it tells you less than you’d think.
Why it matters: MMLU is the most widely cited benchmark for general knowledge. But it has limits. Multiple-choice doesn’t test generation ability, and some questions have leaked into training data.
HumanEval (Code Generation)
HumanEval tests whether a model can write working Python code. It gives 164 programming problems and checks if the generated code passes unit tests.
# What a HumanEval problem looks like
humaneval_example = {
"task_id": "HumanEval/0",
"prompt": "def has_close_elements(numbers, threshold):\n"
" \"\"\"Check if any two numbers are closer "
"than the given threshold.\"\"\"\n",
"test": "assert has_close_elements([1.0, 2.0, 3.9, 4.0], 0.3)\n"
"assert not has_close_elements([1.0, 2.0, 3.0], 0.25)",
}
print("Task:", humaneval_example['task_id'])
print("The model must complete the function")
print("Then it runs against hidden unit tests")
print(f"\nSample test:\n{humaneval_example['test']}")
Task: HumanEval/0
The model must complete the function
Then it runs against hidden unit tests
Sample test:
assert has_close_elements([1.0, 2.0, 3.9, 4.0], 0.3)
assert not has_close_elements([1.0, 2.0, 3.0], 0.25)
The metric is pass@k: generate k code samples, and check if at least one passes all tests. Pass@1 means one attempt. Pass@10 means ten attempts — did any of them work?
Other Important Benchmarks
Here’s a quick reference for benchmarks you’ll see in model comparisons.
benchmarks = {
"MMLU": {
"tests": "General knowledge (57 subjects)",
"metric": "Accuracy (%)",
"format": "Multiple choice",
},
"HumanEval": {
"tests": "Code generation (164 tasks)",
"metric": "Pass@k",
"format": "Code completion",
},
"HellaSwag": {
"tests": "Commonsense reasoning",
"metric": "Accuracy (%)",
"format": "Sentence completion",
},
"TruthfulQA": {
"tests": "Factual accuracy vs misconceptions",
"metric": "% truthful + informative",
"format": "Open-ended + MC",
},
"MT-Bench": {
"tests": "Multi-turn conversation quality",
"metric": "GPT-4 score (1-10)",
"format": "Multi-turn dialogue",
},
"Chatbot Arena": {
"tests": "Overall preference (crowdsourced)",
"metric": "Elo rating",
"format": "Pairwise human voting",
},
}
print(f"{'Benchmark':<15} {'What It Tests':<40} {'Metric'}")
print("-" * 80)
for name, info in benchmarks.items():
print(f"{name:<15} {info['tests']:<40} {info['metric']}")
Benchmark What It Tests Metric
--------------------------------------------------------------------------------
MMLU General knowledge (57 subjects) Accuracy (%)
HumanEval Code generation (164 tasks) Pass@k
HellaSwag Commonsense reasoning Accuracy (%)
TruthfulQA Factual accuracy vs misconceptions % truthful + informative
MT-Bench Multi-turn conversation quality GPT-4 score (1-10)
Chatbot Arena Overall preference (crowdsourced) Elo rating
The Limitations of Benchmarks
Benchmarks are useful but imperfect. Before you rely on them too heavily, know their failure modes.
Data contamination. If benchmark questions leaked into training data, scores are inflated. The model memorized answers instead of reasoning.
Narrow format. Most benchmarks use multiple-choice or short-answer formats. Real-world tasks involve long-form generation, multi-step reasoning, and ambiguous inputs.
Benchmark gaming. Models can be optimized specifically for benchmark performance without improving general capability. High MMLU scores don’t guarantee good chat behavior.
Decay. As models improve, benchmarks saturate. When every model scores 90%+, the benchmark stops being useful. MMLU-Pro was created because MMLU became too easy.
# Benchmark saturation example
models_2023 = {"GPT-4": 86.4, "Claude 2": 78.5, "Llama 2 70B": 68.9}
models_2025 = {"GPT-4o": 88.7, "Claude 3.5": 88.7, "Llama 3.1 405B": 88.6}
print("MMLU Scores (approximate)")
print("-" * 40)
print("2023:")
for m, s in models_2023.items():
print(f" {m:<20} {s}%")
print("\n2025:")
for m, s in models_2025.items():
print(f" {m:<20} {s}%")
print("\nBy 2025, top models are nearly identical")
print("MMLU no longer differentiates them")
MMLU Scores (approximate)
----------------------------------------
2023:
GPT-4 86.4%
Claude 2 78.5%
Llama 2 70B 68.9%
2025:
GPT-4o 88.7%
Claude 3.5 88.7%
Llama 3.1 405B 88.6%
By 2025, top models are nearly identical
MMLU no longer differentiates them
Evaluation Metrics You Can Compute in Python
Benchmarks compare models. Metrics measure specific quality dimensions on your own data. Here are the ones that matter most — and we’ll implement each one from scratch so you understand exactly what they compute.
I think building metrics yourself is the fastest way to develop intuition for what they actually capture — and more importantly, what they miss.
Perplexity — How Surprised Is the Model?
Perplexity measures how well a language model predicts text. Lower perplexity means the model finds the text less surprising — it assigns higher probability to the actual words.
The formula: perplexity = exp(-average log probability per token). A model with perplexity 10 is, roughly speaking, “choosing between 10 equally likely options” at each step.
def compute_perplexity(log_probs):
"""Compute perplexity from token log-probabilities.
Args:
log_probs: list of log probabilities (one per token)
Returns:
perplexity score (lower is better)
"""
avg_log_prob = np.mean(log_probs)
return np.exp(-avg_log_prob)
# Simulated: a good model assigns high probability to each token
good_model_logprobs = [-0.5, -0.3, -0.8, -0.2, -0.6, -0.4, -0.5]
bad_model_logprobs = [-2.5, -3.1, -1.8, -2.9, -2.2, -3.5, -2.7]
print(f"Good model perplexity: {compute_perplexity(good_model_logprobs):.2f}")
print(f"Bad model perplexity: {compute_perplexity(bad_model_logprobs):.2f}")
print(f"\nLower = better (model is less 'surprised' by the text)")
Good model perplexity: 1.60
Bad model perplexity: 14.46
Lower = better (model is less 'surprised' by the text)
The good model has perplexity 1.60 — it predicted the text with high confidence. The bad model’s perplexity of 14.46 means it was much more uncertain.
When to use perplexity: Comparing language models on the same text, measuring fine-tuning impact, detecting domain mismatch (perplexity spikes on unfamiliar text).
BLEU Score — Translation and Text Similarity
BLEU (Bilingual Evaluation Understudy) measures how much a generated text overlaps with a reference text. Originally designed for machine translation, it’s now used broadly for text generation tasks.
BLEU counts matching n-grams (sequences of n words) between the candidate and reference. A perfect match scores 1.0. No overlap scores 0.0.
Here’s BLEU computed from scratch. The core idea: count overlapping n-grams between candidate and reference, then combine them.
Two details make BLEU more nuanced. First, a brevity penalty punishes candidates shorter than the reference — you can’t game a high score by outputting only the words you’re confident about. Second, BLEU takes the geometric mean of n-gram precisions, so a zero at any n-gram level zeros out the whole score.
def compute_bleu(reference, candidate, max_n=4):
"""Compute BLEU score between reference and candidate texts."""
ref_tokens = reference.lower().split()
cand_tokens = candidate.lower().split()
# Brevity penalty: penalize candidates shorter than reference
bp = min(1.0, np.exp(1 - len(ref_tokens) / max(len(cand_tokens), 1)))
scores = []
for n in range(1, max_n + 1):
# Count n-grams in reference and candidate
ref_ngrams = Counter(
tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens)-n+1)
)
cand_ngrams = Counter(
tuple(cand_tokens[i:i+n]) for i in range(len(cand_tokens)-n+1)
)
# Count matches (clipped to reference counts)
matches = sum(
min(count, ref_ngrams[ng])
for ng, count in cand_ngrams.items()
)
total = max(sum(cand_ngrams.values()), 1)
scores.append(matches / total)
# Geometric mean: if ANY n-gram precision is 0, BLEU = 0
if any(s == 0 for s in scores):
return 0.0
log_avg = np.mean([np.log(s) for s in scores])
return bp * np.exp(log_avg)
Watch how the geometric mean makes BLEU strict — even partial overlap isn’t enough if higher-order n-grams don’t match.
reference = "the cat sat on the mat"
candidate1 = "the cat sat on the mat"
candidate2 = "a cat was sitting on a mat"
candidate3 = "the dog played in the park"
print(f"Reference: '{reference}'\n")
print(f"Candidate 1 (exact match): BLEU = {compute_bleu(reference, candidate1):.4f}")
print(f"Candidate 2 (paraphrase): BLEU = {compute_bleu(reference, candidate2):.4f}")
print(f"Candidate 3 (unrelated): BLEU = {compute_bleu(reference, candidate3):.4f}")
Reference: 'the cat sat on the mat'
Candidate 1 (exact match): BLEU = 1.0000
Candidate 2 (paraphrase): BLEU = 0.0000
Candidate 3 (unrelated): BLEU = 0.0000
The exact match gets 1.0. Both the paraphrase and unrelated text score 0 — they share too few higher-order n-grams with the reference.
You might be wondering: “Why does the paraphrase score zero?” Because while it shares some unigrams (“cat”, “on”, “a”, “mat”), it has zero matching bigrams or trigrams. The geometric mean kills it. This is BLEU’s biggest weakness — it’s purely lexical and can’t recognize synonyms or paraphrases.
ROUGE Score — Summarization Quality
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap from the reference side. While BLEU asks “how much of the candidate appears in the reference,” ROUGE asks “how much of the reference appears in the candidate.”
ROUGE-1 counts unigram overlap. ROUGE-L finds the longest common subsequence.
def rouge_1(reference, candidate):
"""Compute ROUGE-1 (unigram) F1 score."""
ref_tokens = set(reference.lower().split())
cand_tokens = set(candidate.lower().split())
overlap = ref_tokens & cand_tokens
if not overlap:
return {"precision": 0, "recall": 0, "f1": 0}
precision = len(overlap) / len(cand_tokens)
recall = len(overlap) / len(ref_tokens)
f1 = 2 * precision * recall / (precision + recall)
return {"precision": round(precision, 4),
"recall": round(recall, 4),
"f1": round(f1, 4)}
reference = "the president spoke about climate change and economic policy"
summary1 = "the president discussed climate change"
summary2 = "weather and money topics were covered by the leader"
r1 = rouge_1(reference, summary1)
r2 = rouge_1(reference, summary2)
print(f"Reference: '{reference}'\n")
print(f"Summary 1: '{summary1}'")
print(f" ROUGE-1: P={r1['precision']}, R={r1['recall']}, F1={r1['f1']}")
print(f"\nSummary 2: '{summary2}'")
print(f" ROUGE-1: P={r2['precision']}, R={r2['recall']}, F1={r2['f1']}")
Reference: 'the president spoke about climate change and economic policy'
Summary 1: 'the president discussed climate change'
ROUGE-1: P=0.8, R=0.4444, F1=0.5714
Summary 2: 'weather and money topics were covered by the leader'
ROUGE-1: P=0.2222, R=0.2222, F1=0.2222
Summary 1 captures key terms (“president”, “climate”, “change”) and scores well. Summary 2 paraphrases everything — only “the” and “and” overlap with the reference, giving it a low 0.22 F1. ROUGE can’t recognize that “leader” means “president” or “weather” relates to “climate.”
BERTScore — Semantic Similarity Beyond Surface Overlap
BLEU and ROUGE both fail at the same thing: they can’t detect paraphrases. “The president spoke” and “the leader talked” share zero tokens, so they score zero — even though they mean the same thing.
BERTScore fixes this by comparing texts at the meaning level. Instead of matching exact words, it converts each token into a contextual embedding (using a pretrained model like BERT) and measures cosine similarity between the best-matching pairs.
# BERTScore concept: cosine similarity between token embeddings
# Full implementation requires a transformer model.
# Here's the core logic for understanding:
def cosine_similarity(vec_a, vec_b):
"""Compute cosine similarity between two vectors."""
dot = sum(a * b for a, b in zip(vec_a, vec_b))
norm_a = sum(a ** 2 for a in vec_a) ** 0.5
norm_b = sum(b ** 2 for b in vec_b) ** 0.5
return dot / (norm_a * norm_b) if norm_a * norm_b > 0 else 0.0
# Simulated embeddings (in practice, from a BERT model)
# "president" and "leader" have similar embeddings
president_emb = [0.8, 0.3, 0.5, 0.1]
leader_emb = [0.75, 0.35, 0.45, 0.15]
cat_emb = [0.1, 0.9, 0.2, 0.7]
sim_related = cosine_similarity(president_emb, leader_emb)
sim_unrelated = cosine_similarity(president_emb, cat_emb)
print(f"'president' vs 'leader': cosine similarity = {sim_related:.4f}")
print(f"'president' vs 'cat': cosine similarity = {sim_unrelated:.4f}")
print(f"\nBERTScore catches synonyms that BLEU/ROUGE miss")
'president' vs 'leader': cosine similarity = 0.9956
'president' vs 'cat': cosine similarity = 0.4498
BERTScore catches synonyms that BLEU/ROUGE miss
In practice, you’d use the bert-score library: pip install bert-score. But understanding the cosine similarity core helps you interpret the scores — high BERTScore with low ROUGE means good paraphrasing, while low BERTScore with high ROUGE means surface copying without understanding.
Quick Comparison: Which Metric When?
metrics_comparison = {
"Perplexity": {
"measures": "Model confidence on text",
"best_for": "Model comparison, fine-tuning tracking",
"limitation": "Doesn't measure output quality",
},
"BLEU": {
"measures": "N-gram overlap (precision-focused)",
"best_for": "Translation, structured generation",
"limitation": "Misses paraphrases entirely",
},
"ROUGE": {
"measures": "N-gram overlap (recall-focused)",
"best_for": "Summarization, extraction tasks",
"limitation": "Misses paraphrases entirely",
},
"BERTScore": {
"measures": "Semantic similarity via embeddings",
"best_for": "Open-ended generation, QA",
"limitation": "Requires GPU, slower to compute",
},
"Pass@k": {
"measures": "Functional correctness of code",
"best_for": "Code generation tasks",
"limitation": "Only works for code with tests",
},
}
print(f"{'Metric':<12} {'Best For':<35} {'Key Limitation'}")
print("-" * 80)
for metric, info in metrics_comparison.items():
print(f"{metric:<12} {info['best_for']:<35} {info['limitation']}")
Metric Best For Key Limitation
--------------------------------------------------------------------------------
Perplexity Model comparison, fine-tuning track Doesn't measure output quality
BLEU Translation, structured generation Misses paraphrases entirely
ROUGE Summarization, extraction tasks Misses paraphrases entirely
BERTScore Open-ended generation, QA Requires GPU, slower to compute
Pass@k Code generation tasks Only works for code with tests
Exercise 1: Compute ROUGE-1 for a summarization task.
Given a reference summary and three candidate summaries, compute the ROUGE-1 F1 score for each. Which candidate is the best summary?
# Starter code
def rouge_1_f1(reference, candidate):
"""Compute ROUGE-1 F1 score."""
ref_tokens = set(reference.lower().split())
cand_tokens = set(candidate.lower().split())
# YOUR CODE: compute overlap, precision, recall, F1
pass
ref = "machine learning models require large datasets for training"
c1 = "ml models need big data to train"
c2 = "machine learning requires large datasets"
c3 = "deep neural networks process information"
# Compute and print F1 for each candidate
# YOUR CODE HERE
Hints:
1. Overlap is the intersection of the two token sets. Precision = overlap/candidate_size. Recall = overlap/reference_size.
2. F1 = 2 * precision * recall / (precision + recall). Handle the case where precision + recall == 0.
Click to reveal solution
def rouge_1_f1(reference, candidate):
ref_tokens = set(reference.lower().split())
cand_tokens = set(candidate.lower().split())
overlap = ref_tokens & cand_tokens
if not overlap:
return 0.0
precision = len(overlap) / len(cand_tokens)
recall = len(overlap) / len(ref_tokens)
return 2 * precision * recall / (precision + recall)
ref = "machine learning models require large datasets for training"
c1 = "ml models need big data to train"
c2 = "machine learning requires large datasets"
c3 = "deep neural networks process information"
print(f"Candidate 1: {rouge_1_f1(ref, c1):.4f}")
print(f"Candidate 2: {rouge_1_f1(ref, c2):.4f}")
print(f"Candidate 3: {rouge_1_f1(ref, c3):.4f}")
print(f"\nCandidate 2 is best -- it captures 4 key terms from the reference")
Candidate 1: 0.1333
Candidate 2: 0.6154
Candidate 3: 0.0000
Candidate 2 is best -- it captures 4 key terms from the reference
**Explanation:** Candidate 2 shares “machine”, “learning”, “large”, “datasets” with the reference — 4 out of 5 candidate words match, giving high precision (0.8) and decent recall (4 out of 8 reference words = 0.5). Candidate 1 uses synonyms (“ml”, “big data”) that ROUGE can’t detect — only “models” overlaps. Candidate 3 shares zero words.
LLM-as-Judge — Using AI to Evaluate AI
Here’s where things get interesting. Automated metrics have a fundamental problem: they can’t judge qualities like helpfulness, tone, or reasoning quality. These require understanding, not pattern matching.
The solution? Use a strong LLM to evaluate outputs from the model you’re testing. This is the “LLM-as-judge” pattern, and it’s become the standard approach for evaluating open-ended generation. I use it on nearly every project now.
The idea is simple. You give a judge model (typically GPT-4 or Claude) the prompt, the model’s response, and a scoring rubric. The judge returns a structured score.
# LLM-as-Judge prompt template
judge_prompt = """You are an expert evaluator. Score the following response
on a scale of 1-5 for each criterion.
**Prompt:** {prompt}
**Response:** {response}
**Criteria:**
1. Relevance (1-5): Does the response address the question?
2. Accuracy (1-5): Is the information factually correct?
3. Clarity (1-5): Is the response clear and well-organized?
4. Completeness (1-5): Does it cover the topic adequately?
Return ONLY a JSON object:
{{"relevance": X, "accuracy": X, "clarity": X, "completeness": X}}
"""
# Example usage (simulated -- would normally call an API)
example = judge_prompt.format(
prompt="What causes rain?",
response="Rain forms when water vapor in clouds condenses into "
"droplets heavy enough to fall. This happens when warm, "
"moist air rises and cools at higher altitudes."
)
print("=== Judge Prompt (first 200 chars) ===")
print(example[:200] + "...")
print("\n=== Simulated Judge Response ===")
scores = {"relevance": 5, "accuracy": 5, "clarity": 5, "completeness": 4}
print(scores)
print(f"Average: {np.mean(list(scores.values())):.1f}/5.0")
=== Judge Prompt (first 200 chars) ===
You are an expert evaluator. Score the following response
on a scale of 1-5 for each criterion.
**Prompt:** What causes rain?
**Response:** Rain forms when water vapor in clouds condenses into ...
=== Simulated Judge Response ===
{'relevance': 5, 'accuracy': 5, 'clarity': 5, 'completeness': 4}
Average: 4.8/5.0
Pairwise Comparison — A Better Judge Pattern
Single-score judging is noisy. The judge might give a 4 one time and a 3 the next for the same response. Pairwise comparison is more reliable: show two responses and ask “which is better?”
pairwise_prompt = """Compare these two responses to the same prompt.
Which one is better? Explain briefly, then state your choice.
**Prompt:** {prompt}
**Response A:** {response_a}
**Response B:** {response_b}
Your answer must end with exactly one of:
[[A]] or [[B]] or [[Tie]]
"""
prompt = "Explain what an API is to a non-technical person."
response_a = ("An API is like a waiter at a restaurant. "
"You tell the waiter what you want, the waiter "
"goes to the kitchen, and brings back your food. "
"You never interact with the kitchen directly.")
response_b = ("An API (Application Programming Interface) is a "
"set of protocols and tools for building software "
"applications. It specifies how software components "
"should interact.")
print("Response A uses an analogy (conversational)")
print("Response B uses technical definition (formal)")
print("\nFor a non-technical audience, Response A is better")
print("A judge model would likely output: [[A]]")
Response A uses an analogy (conversational)
Response B uses technical definition (formal)
For a non-technical audience, Response A is better
A judge model would likely output: [[A]]
Running LLM-as-Judge at Scale
Here’s a complete evaluation function that calls the OpenAI API. In production, you’d run this on hundreds of test cases.
# NOTE: Requires OpenAI API key. Shown for reference.
# Set OPENAI_API_KEY environment variable before running.
"""
import openai
import json
def llm_judge(prompt, response, model="gpt-4o"):
judge_msg = f'''Score this response (1-5) on:
relevance, accuracy, clarity, completeness.
Return JSON only.
Prompt: {prompt}
Response: {response}'''
result = openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": judge_msg}],
temperature=0,
)
return json.loads(result.choices[0].message.content)
# Example:
# scores = llm_judge("What is Python?", "Python is a programming language.")
# print(scores)
"""
print("LLM-as-Judge: use GPT-4o or Claude as evaluator")
print("Key parameters:")
print(" temperature=0 -> deterministic scoring")
print(" JSON output -> structured, parseable results")
print(" Run twice with swapped positions for pairwise")
LLM-as-Judge: use GPT-4o or Claude as evaluator
Key parameters:
temperature=0 -> deterministic scoring
JSON output -> structured, parseable results
Run twice with swapped positions for pairwise
Exercise 2: Design a judge prompt for customer support quality.
Write a judge prompt that evaluates customer support responses on four criteria relevant to that domain. Include the scoring rubric.
# Starter code
def create_support_judge_prompt(customer_query, agent_response):
"""Create a judge prompt for customer support evaluation.
Define 4 criteria specific to support quality.
Return the complete prompt string.
"""
# YOUR CODE: define criteria and build the prompt
# Think about what matters: empathy, accuracy, actionability...
prompt = ""
return prompt
# Test
query = "My order hasn't arrived and it's been 2 weeks"
response = "I apologize for the delay. Let me check your order status."
print(create_support_judge_prompt(query, response)[:300])
Hints:
1. Good criteria for support: empathy/tone, problem resolution, actionability (did they give next steps?), accuracy of information provided.
2. Use a 1-5 scale and define what each score means. “1 = dismissive, 5 = warm and understanding” for empathy.
Click to reveal solution
def create_support_judge_prompt(customer_query, agent_response):
return f"""Score this customer support response (1-5) on each criterion.
**Customer Query:** {customer_query}
**Agent Response:** {agent_response}
**Criteria:**
1. Empathy (1-5): Does the agent acknowledge the customer's frustration?
1=Dismissive, 3=Neutral, 5=Warm and understanding
2. Resolution (1-5): Does the response work toward solving the problem?
1=Ignores issue, 3=Partial, 5=Clear path to resolution
3. Actionability (1-5): Are concrete next steps provided?
1=Vague, 3=Some steps, 5=Clear actions with timeline
4. Accuracy (1-5): Is the information correct and relevant?
1=Wrong info, 3=Partially correct, 5=Fully accurate
Return JSON: {{"empathy": X, "resolution": X, "actionability": X, "accuracy": X}}"""
query = "My order hasn't arrived and it's been 2 weeks"
response = "I apologize for the delay. Let me check your order status."
print(create_support_judge_prompt(query, response)[:400])
Score this customer support response (1-5) on each criterion.
**Customer Query:** My order hasn't arrived and it's been 2 weeks
**Agent Response:** I apologize for the delay. Let me check your order status.
**Criteria:**
1. Empathy (1-5): Does the agent acknowledge the customer's frustration?
1=Dismissive, 3=Neutral, 5=Warm and understanding
2. Resolution (1-5): Does the response work toward sol
**Explanation:** The four criteria map to what actually matters in support: emotional tone (empathy), problem-solving (resolution), giving next steps (actionability), and correctness (accuracy). Each criterion has anchor descriptions so the judge scores consistently.
Human Evaluation — When Automated Metrics Aren’t Enough
Automated metrics and LLM judges are scalable. But they systematically miss certain failure modes. Human evaluation catches what machines can’t — and skipping it is a mistake I’ve watched teams regret.
When you need humans:
– Safety-critical applications (medical, legal, financial advice)
– Subjective quality (brand voice, creativity, humor)
– Edge cases that LLM judges handle poorly
– Calibrating your automated evaluation pipeline
Setting Up a Human Evaluation Protocol
A good human eval has five components.
human_eval_protocol = {
"1. Sample Selection": "Random sample of 50-200 outputs from production",
"2. Rubric Definition": "3-5 criteria with clear anchor descriptions",
"3. Annotator Training": "Show 10 examples with expected scores",
"4. Double Annotation": "2+ annotators per example, measure agreement",
"5. Agreement Metric": "Cohen's kappa > 0.6 means acceptable agreement",
}
print("=== Human Evaluation Protocol ===\n")
for step, desc in human_eval_protocol.items():
print(f"{step}")
print(f" {desc}\n")
=== Human Evaluation Protocol ===
1. Sample Selection
Random sample of 50-200 outputs from production
2. Rubric Definition
3-5 criteria with clear anchor descriptions
3. Annotator Training
Show 10 examples with expected scores
4. Double Annotation
2+ annotators per example, measure agreement
5. Agreement Metric
Cohen's kappa > 0.6 means acceptable agreement
The most critical step is the rubric. Vague criteria like “quality” lead to inconsistent scores. Each criterion needs anchor examples.
But how do you measure whether your annotators actually agree? Raw agreement percentage is misleading — if 90% of examples are “Good,” two random annotators will agree most of the time by luck. Cohen’s Kappa corrects for this by subtracting the expected chance agreement from the observed agreement.
def cohens_kappa(annotator1, annotator2):
"""Compute Cohen's Kappa for inter-annotator agreement."""
assert len(annotator1) == len(annotator2)
n = len(annotator1)
# Observed agreement
agree = sum(a == b for a, b in zip(annotator1, annotator2))
po = agree / n
# Expected agreement (by chance)
labels = set(annotator1) | set(annotator2)
pe = sum(
(annotator1.count(l) / n) * (annotator2.count(l) / n)
for l in labels
)
if pe == 1.0:
return 1.0
return (po - pe) / (1 - pe)
# Two annotators scored 20 responses as Good/Bad
ann1 = ["Good"]*12 + ["Bad"]*8
ann2 = ["Good"]*10 + ["Bad"]*2 + ["Good"]*2 + ["Bad"]*6
kappa = cohens_kappa(ann1, ann2)
agreement = sum(a == b for a, b in zip(ann1, ann2)) / len(ann1)
print(f"Raw agreement: {agreement:.0%}")
print(f"Cohen's Kappa: {kappa:.4f}")
print(f"\nInterpretation:")
print(f" < 0.2 = Poor")
print(f" 0.2-0.4 = Fair")
print(f" 0.4-0.6 = Moderate")
print(f" 0.6-0.8 = Substantial")
print(f" > 0.8 = Almost perfect")
Raw agreement: 80%
Cohen's Kappa: 0.5833
Interpretation:
< 0.2 = Poor
0.2-0.4 = Fair
0.4-0.6 = Moderate
0.6-0.8 = Substantial
> 0.8 = Almost perfect
The annotators agree 80% of the time, but Cohen’s Kappa is only 0.58 (moderate). That’s because some agreement is due to chance. If you see kappa below 0.6, your rubric is probably too vague — tighten the criteria.
Building Your Evaluation Pipeline
Now let’s put everything together. This is the pipeline I’d recommend for any LLM-powered product, and it’s the structure I reach for whenever I start a new evaluation project.
The Three-Stage Pipeline
pipeline_stages = {
"Stage 1: Offline Eval (before deploy)": {
"what": "Run benchmarks + task-specific tests",
"tools": "Test dataset, automated metrics, LLM judge",
"frequency": "Every model change or prompt update",
"example": "Score 200 test cases, compute ROUGE + judge scores",
},
"Stage 2: Shadow Eval (during rollout)": {
"what": "Run new model alongside old, compare outputs",
"tools": "A/B test framework, pairwise LLM judge",
"frequency": "During model rollout period",
"example": "Both models answer same queries, judge picks winner",
},
"Stage 3: Online Eval (in production)": {
"what": "Monitor live outputs, catch regressions",
"tools": "User feedback, automated checks, spot human eval",
"frequency": "Continuous",
"example": "Track thumbs up/down, flag low-confidence responses",
},
}
for stage, info in pipeline_stages.items():
print(f"\n{stage}")
print(f" What: {info['what']}")
print(f" Tools: {info['tools']}")
print(f" When: {info['frequency']}")
Stage 1: Offline Eval (before deploy)
What: Run benchmarks + task-specific tests
Tools: Test dataset, automated metrics, LLM judge
When: Every model change or prompt update
Stage 2: Shadow Eval (during rollout)
What: Run new model alongside old, compare outputs
Tools: A/B test framework, pairwise LLM judge
When: During model rollout period
Stage 3: Online Eval (in production)
What: Monitor live outputs, catch regressions
Tools: User feedback, automated checks, spot human eval
When: Continuous
Building an Offline Evaluation Script
Here’s a complete evaluation harness you can adapt. It takes a test dataset, runs metrics, and produces a report.
def evaluate_model_outputs(test_cases, predictions):
"""Run a full offline evaluation on model outputs.
Args:
test_cases: list of dicts with 'prompt', 'reference'
predictions: list of model output strings
Returns:
dict with aggregate metrics
"""
assert len(test_cases) == len(predictions)
rouge_scores = []
length_ratios = []
for tc, pred in zip(test_cases, predictions):
ref = tc["reference"]
r1 = rouge_1(ref, pred)
rouge_scores.append(r1["f1"])
length_ratios.append(len(pred.split()) / max(len(ref.split()), 1))
return {
"n_samples": len(test_cases),
"avg_rouge1_f1": round(np.mean(rouge_scores), 4),
"min_rouge1_f1": round(np.min(rouge_scores), 4),
"max_rouge1_f1": round(np.max(rouge_scores), 4),
"avg_length_ratio": round(np.mean(length_ratios), 2),
}
# Demo: evaluate a "model" on 5 test cases
test_cases = [
{"prompt": "Summarize Python", "reference": "Python is a versatile programming language"},
{"prompt": "Summarize ML", "reference": "Machine learning finds patterns in data"},
{"prompt": "Summarize APIs", "reference": "APIs let applications communicate with each other"},
{"prompt": "Summarize cloud", "reference": "Cloud computing provides on-demand computing resources"},
{"prompt": "Summarize AI", "reference": "Artificial intelligence simulates human reasoning"},
]
predictions = [
"Python is a popular programming language for data science",
"ML uses algorithms to learn from data patterns",
"APIs enable software systems to talk to each other",
"Cloud services offer scalable computing on demand",
"AI mimics human intelligence and decision making",
]
report = evaluate_model_outputs(test_cases, predictions)
print("=== Evaluation Report ===")
for key, val in report.items():
print(f" {key}: {val}")
=== Evaluation Report ===
n_samples: 5
avg_rouge1_f1: 0.4038
min_rouge1_f1: 0.2857
max_rouge1_f1: 0.6667
avg_length_ratio: 1.34
The average ROUGE-1 F1 of 0.40 isn’t great — but that’s expected since the predictions paraphrase rather than copy the reference text. In practice, you’d combine this with an LLM judge for a fuller picture.
Exercise 3: Add length and vocabulary diversity checks to the evaluation pipeline.
Extend the evaluate_model_outputs function to also compute average response length (in words) and vocabulary diversity (unique words / total words).
# Starter code
def evaluate_extended(test_cases, predictions):
"""Extended evaluation with length and diversity metrics."""
rouge_scores = []
lengths = []
diversities = []
for tc, pred in zip(test_cases, predictions):
# ROUGE-1 (already implemented above)
r1 = rouge_1(tc["reference"], pred)
rouge_scores.append(r1["f1"])
# YOUR CODE: compute word count for this prediction
# YOUR CODE: compute vocabulary diversity (unique / total)
return {
"avg_rouge1_f1": round(np.mean(rouge_scores), 4),
# YOUR CODE: add avg_length and avg_diversity
}
result = evaluate_extended(test_cases, predictions)
print(result)
Hints:
1. Word count: len(pred.split()). Vocabulary diversity: len(set(pred.lower().split())) / len(pred.lower().split()).
2. A diversity score near 1.0 means every word is unique (good). Near 0.5 means lots of repetition (may indicate degenerate output).
Click to reveal solution
def evaluate_extended(test_cases, predictions):
rouge_scores, lengths, diversities = [], [], []
for tc, pred in zip(test_cases, predictions):
r1 = rouge_1(tc["reference"], pred)
rouge_scores.append(r1["f1"])
words = pred.lower().split()
lengths.append(len(words))
diversities.append(len(set(words)) / max(len(words), 1))
return {
"avg_rouge1_f1": round(np.mean(rouge_scores), 4),
"avg_length": round(np.mean(lengths), 1),
"avg_diversity": round(np.mean(diversities), 4),
}
result = evaluate_extended(test_cases, predictions)
print("=== Extended Report ===")
for k, v in result.items():
print(f" {k}: {v}")
=== Extended Report ===
avg_rouge1_f1: 0.4038
avg_length: 8.0
avg_diversity: 0.9778
**Explanation:** Average length of 8 words per response — these are short summaries. Diversity of 0.98 means nearly every word is unique (one prediction repeats “to”). If diversity dropped below 0.7, it could signal degenerate or repetitive outputs.
Evaluating RAG Systems
RAG (Retrieval-Augmented Generation) adds retrieval to the evaluation challenge. You’re no longer just evaluating the LLM — you’re evaluating the retriever, the context, AND the generation. This is where I see most teams struggle, because there are more moving parts to test.
Three metrics matter for RAG:
Faithfulness. Does the response only contain information from the retrieved context? An unfaithful response hallucinates facts not in the source.
Answer relevance. Does the response actually answer the question? A response can be faithful (everything it says is in the context) but irrelevant (it doesn’t address the query).
Context relevance. Did the retriever pull the right documents? Irrelevant context leads to irrelevant answers.
rag_metrics = {
"Faithfulness": {
"question": "Is every claim in the response supported by context?",
"fail_example": "Response says 'founded in 2020' but context says 2018",
"check_with": "LLM judge: extract claims, verify each against context",
},
"Answer Relevance": {
"question": "Does the response address what was asked?",
"fail_example": "Asked about pricing, response discusses features",
"check_with": "LLM judge: does the response answer the query?",
},
"Context Relevance": {
"question": "Did the retriever find the right documents?",
"fail_example": "Query about Python, retrieved Java documentation",
"check_with": "Embedding similarity or LLM judge",
},
}
print("=== RAG Evaluation Metrics ===\n")
for metric, info in rag_metrics.items():
print(f"{metric}")
print(f" Question: {info['question']}")
print(f" Fail case: {info['fail_example']}")
print(f" Method: {info['check_with']}\n")
=== RAG Evaluation Metrics ===
Faithfulness
Question: Is every claim in the response supported by context?
Fail case: Response says 'founded in 2020' but context says 2018
Method: LLM judge: extract claims, verify each against context
Answer Relevance
Question: Does the response address what was asked?
Fail case: Asked about pricing, response discusses features
Method: LLM judge: does the response answer the query?
Context Relevance
Question: Did the retriever find the right documents?
Fail case: Query about Python, retrieved Java documentation
Method: Embedding similarity or LLM judge
Common Evaluation Mistakes
I keep seeing the same mistakes across projects. Each one silently undermines your evaluation results — and you won’t realize it until production breaks.
Mistake 1: Evaluating on training data. If the model saw the test cases during training, your scores are inflated. Always use a held-out test set that was never part of training or fine-tuning.
Mistake 2: Using a single metric. ROUGE alone misses semantic similarity. Perplexity alone misses factual errors. Use 2-3 complementary metrics.
Mistake 3: Ignoring edge cases. Average scores hide failures. A model with 90% average accuracy might fail 100% on a critical subcategory. Break down scores by category.
Mistake 4: Not versioning your evaluation. When you change your test set or rubric, old scores become incomparable. Version everything: test data, prompts, scoring criteria, model checkpoints.
Here’s the full list with fixes — I keep this pinned on every project.
mistakes = [
("Evaluating on training data", "Use held-out test set", "Critical"),
("Single metric reliance", "Use 2-3 complementary metrics", "High"),
("Ignoring edge cases", "Break down by category", "High"),
("No evaluation versioning", "Version test sets and rubrics", "Medium"),
("Too few test cases", "Minimum 50, ideally 200+", "Medium"),
("Not calibrating LLM judges", "Human eval on 10% sample", "Medium"),
]
print(f"{'Mistake':<30} {'Fix':<35} {'Severity'}")
print("-" * 75)
for mistake, fix, sev in mistakes:
print(f"{mistake:<30} {fix:<35} {sev}")
Mistake Fix Severity
---------------------------------------------------------------------------
Evaluating on training data Use held-out test set Critical
Single metric reliance Use 2-3 complementary metrics High
Ignoring edge cases Break down by category High
No evaluation versioning Version test sets and rubrics Medium
Too few test cases Minimum 50, ideally 200+ Medium
Not calibrating LLM judges Human eval on 10% sample Medium
Choosing the Right Evaluation Strategy
Not sure where to start? Different tasks need different evaluation approaches. Here’s the decision framework I use.
print("=== Evaluation Strategy Decision Tree ===\n")
decisions = [
("What type of task?",
"Classification/QA -> Accuracy, F1, Exact Match",
"Generation/Summary -> ROUGE + LLM Judge",
"Code generation -> Pass@k (unit tests)"),
("How critical is safety?",
"Low risk -> Automated metrics + spot checks",
"High risk -> Full human evaluation pipeline"),
("How much budget?",
"Low -> ROUGE/BLEU + free LLM judge (Llama)",
"Medium -> GPT-4 judge + 10% human eval",
"High -> Full human eval + LLM judge + metrics"),
("New model or prompt change?",
"New model -> Full benchmark suite + task eval",
"Prompt tweak -> Task-specific eval only"),
]
for i, parts in enumerate(decisions, 1):
question = parts[0]
answers = parts[1:]
print(f"{i}. {question}")
for a in answers:
print(f" {a}")
print()
=== Evaluation Strategy Decision Tree ===
1. What type of task?
Classification/QA -> Accuracy, F1, Exact Match
Generation/Summary -> ROUGE + LLM Judge
Code generation -> Pass@k (unit tests)
2. How critical is safety?
Low risk -> Automated metrics + spot checks
High risk -> Full human evaluation pipeline
3. How much budget?
Low -> ROUGE/BLEU + free LLM judge (Llama)
Medium -> GPT-4 judge + 10% human eval
High -> Full human eval + LLM judge + metrics
4. New model or prompt change?
New model -> Full benchmark suite + task eval
Prompt tweak -> Task-specific eval only
Evaluation Frameworks Worth Knowing
Once you understand how metrics work under the hood, you’ll probably want a framework that handles the boilerplate. Here are the three I’d recommend looking at.
DeepEval is the most comprehensive open-source option. It gives you 14+ built-in metrics (hallucination, toxicity, answer relevance, faithfulness) and integrates with pytest for test-driven evaluation.
# DeepEval quickstart (pip install deepeval)
# from deepeval import evaluate
# from deepeval.metrics import AnswerRelevancyMetric
# from deepeval.test_case import LLMTestCase
# metric = AnswerRelevancyMetric(threshold=0.7)
# test_case = LLMTestCase(
# input="What is Python?",
# actual_output="Python is a programming language.",
# )
# metric.measure(test_case)
# print(f"Score: {metric.score}, Passed: {metric.is_successful()}")
print("DeepEval: pip install deepeval")
print(" 14+ metrics out of the box")
print(" pytest integration for CI/CD")
print(" Best for: comprehensive evaluation pipelines")
DeepEval: pip install deepeval
14+ metrics out of the box
pytest integration for CI/CD
Best for: comprehensive evaluation pipelines
RAGAS specializes in RAG evaluation. If you’re building retrieval-augmented systems, RAGAS gives you faithfulness, answer relevance, and context precision metrics designed specifically for that architecture.
Promptfoo takes a different approach — it’s a CLI tool for A/B testing prompts across multiple models. Think of it as unit testing for your prompts.
frameworks = {
"DeepEval": {
"focus": "General LLM evaluation",
"metrics": "14+ (hallucination, toxicity, relevancy...)",
"install": "pip install deepeval",
},
"RAGAS": {
"focus": "RAG pipeline evaluation",
"metrics": "Faithfulness, context precision, answer relevancy",
"install": "pip install ragas",
},
"Promptfoo": {
"focus": "Prompt A/B testing across models",
"metrics": "Custom assertions + LLM graders",
"install": "npx promptfoo@latest init",
},
}
print(f"{'Framework':<12} {'Focus':<30} {'Install'}")
print("-" * 70)
for name, info in frameworks.items():
print(f"{name:<12} {info['focus']:<30} {info['install']}")
Framework Focus Install
----------------------------------------------------------------------
DeepEval General LLM evaluation pip install deepeval
RAGAS RAG pipeline evaluation pip install ragas
Promptfoo Prompt A/B testing across models npx promptfoo@latest init
Summary
LLM evaluation has three layers: benchmarks for model selection, metrics for quality measurement, and judgment (human or LLM) for nuanced assessment.
Here’s what to remember:
- Benchmarks (MMLU, HumanEval, etc.) compare models but don’t predict performance on your task
- Metrics (BLEU, ROUGE, perplexity, BERTScore) quantify specific quality dimensions — always use 2-3, not just one
- LLM-as-judge scales evaluation for open-ended tasks — calibrate with human scores
- Human evaluation remains the gold standard for safety-critical and subjective tasks
- RAG evaluation requires testing retriever and generator separately
- Frameworks (DeepEval, RAGAS) handle boilerplate once you understand the underlying metrics
- Version everything — test sets, rubrics, and model checkpoints
print("=" * 55)
print(" LLM Evaluation: The Complete Toolkit")
print("=" * 55)
print()
print(" Layer 1: Benchmarks -> Model selection")
print(" Layer 2: Metrics -> Quality measurement")
print(" Layer 3: Judgment -> Nuanced assessment")
print()
print(" Start with: ROUGE + LLM judge + 50 test cases")
print(" Scale to: Category breakdowns + human eval")
print("=" * 55)
=======================================================
LLM Evaluation: The Complete Toolkit
=======================================================
Layer 1: Benchmarks -> Model selection
Layer 2: Metrics -> Quality measurement
Layer 3: Judgment -> Nuanced assessment
Start with: ROUGE + LLM judge + 50 test cases
Scale to: Category breakdowns + human eval
=======================================================
Click to expand the full script (copy-paste and run)
# Complete code from: How to Evaluate LLMs
# Requires: pip install numpy
# Python 3.9+
import numpy as np
from collections import Counter
# --- Perplexity ---
def compute_perplexity(log_probs):
return np.exp(-np.mean(log_probs))
# --- BLEU ---
def compute_bleu(reference, candidate, max_n=4):
ref_tokens = reference.lower().split()
cand_tokens = candidate.lower().split()
bp = min(1.0, np.exp(1 - len(ref_tokens) / max(len(cand_tokens), 1)))
scores = []
for n in range(1, max_n + 1):
ref_ng = Counter(tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens)-n+1))
cand_ng = Counter(tuple(cand_tokens[i:i+n]) for i in range(len(cand_tokens)-n+1))
matches = sum(min(c, ref_ng[ng]) for ng, c in cand_ng.items())
total = max(sum(cand_ng.values()), 1)
scores.append(matches / total)
if any(s == 0 for s in scores):
return 0.0
return bp * np.exp(np.mean([np.log(s) for s in scores]))
# --- ROUGE-1 ---
def rouge_1(reference, candidate):
ref_t = set(reference.lower().split())
cand_t = set(candidate.lower().split())
overlap = ref_t & cand_t
if not overlap:
return {"precision": 0, "recall": 0, "f1": 0}
p = len(overlap) / len(cand_t)
r = len(overlap) / len(ref_t)
return {"precision": round(p, 4), "recall": round(r, 4),
"f1": round(2*p*r/(p+r), 4)}
# --- Cosine Similarity (BERTScore core) ---
def cosine_similarity(vec_a, vec_b):
dot = sum(a * b for a, b in zip(vec_a, vec_b))
norm_a = sum(a ** 2 for a in vec_a) ** 0.5
norm_b = sum(b ** 2 for b in vec_b) ** 0.5
return dot / (norm_a * norm_b) if norm_a * norm_b > 0 else 0.0
# --- Cohen's Kappa ---
def cohens_kappa(a1, a2):
n = len(a1)
po = sum(x == y for x, y in zip(a1, a2)) / n
labels = set(a1) | set(a2)
pe = sum((a1.count(l)/n) * (a2.count(l)/n) for l in labels)
return (po - pe) / (1 - pe) if pe != 1 else 1.0
# --- Evaluation Harness ---
def evaluate_model_outputs(test_cases, predictions):
scores = []
for tc, pred in zip(test_cases, predictions):
r1 = rouge_1(tc["reference"], pred)
scores.append(r1["f1"])
return {"n": len(test_cases), "avg_rouge1": round(np.mean(scores), 4)}
print("All evaluation functions loaded. Script completed successfully.")
FAQ
Q: How many test cases do I need for reliable evaluation?
Minimum 50 for basic signals. 200+ for statistically significant comparisons between models. For category-level breakdowns, you need 20-30 per category.
Q: Can I use a weaker model as an LLM judge?
You can, but quality drops. GPT-4 and Claude 3.5 are the standard judges. Open models like Llama 3.1 70B work for simple rubrics but struggle with nuanced quality assessment.
Q: How often should I update my evaluation test set?
Quarterly at minimum. More often if your product scope changes. The test set should always reflect current user queries and edge cases.
Q: Is BLEU still relevant in 2026?
For translation and structured generation, yes. For open-ended generation, prefer ROUGE + BERTScore or an LLM judge. BLEU’s inability to handle paraphrasing makes it insufficient as a sole metric.
Q: What’s the fastest way to set up evaluation from scratch?
Start with 50 representative test cases from your actual use case. Compute ROUGE-1 and run an LLM-as-judge with a simple rubric. That gives you two complementary signals in under a day.
Q: Should I use an evaluation framework like DeepEval or build my own?
Both. Understand the metrics by building them from scratch first (as this guide teaches). Then use a framework like DeepEval or RAGAS for production pipelines — they handle scaling, CI/CD integration, and metric aggregation that you don’t want to maintain yourself.
Related Topics
- Prompt Engineering — optimizing prompts before evaluating their impact
- Fine-Tuning LLMs — when to fine-tune vs prompt and how to measure improvement
- RAG (Retrieval-Augmented Generation) — building and evaluating retrieval pipelines
- LLM Deployment — serving models in production with monitoring
- AI Safety and Alignment — evaluating model behavior for safety
- DPO (Direct Preference Optimization) — training models on human preferences
- Vector Databases — the retrieval layer that feeds RAG systems
References
- Hendrycks, D. et al. (2021). “Measuring Massive Multitask Language Understanding.” ICLR 2021. arXiv:2009.03300
- Chen, M. et al. (2021). “Evaluating Large Language Models Trained on Code.” arXiv:2107.03374. (HumanEval)
- Papineni, K. et al. (2002). “BLEU: a Method for Automatic Evaluation of Machine Translation.” ACL 2002.
- Lin, C-Y. (2004). “ROUGE: A Package for Automatic Evaluation of Summaries.” ACL 2004 Workshop.
- Zheng, L. et al. (2023). “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. arXiv:2306.05685
- Zhang, T. et al. (2020). “BERTScore: Evaluating Text Generation with BERT.” ICLR 2020. arXiv:1904.09675
- Liang, P. et al. (2023). “Holistic Evaluation of Language Models.” TMLR 2023. (HELM)
- Zellers, R. et al. (2019). “HellaSwag: Can a Machine Really Finish Your Sentence?” ACL 2019.
- Lin, S. et al. (2022). “TruthfulQA: Measuring How Models Mimic Human Falsehoods.” ACL 2022.
- Evidently AI. “30 LLM Evaluation Benchmarks and How They Work.” Link
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →