How to Evaluate LLMs — Metrics, Benchmarks & Python Code

Learn LLM evaluation from scratch -- benchmarks, metrics (BLEU, ROUGE, perplexity), LLM-as-judge, and custom pipelines with runnable Python code.

Written by Selva Prabhakaran | 39 min read

You just deployed an LLM-powered feature. Users are complaining — some answers are wrong, some are great, and you can’t tell which is which. Why? Because you never set up proper evaluation.

I’ve seen this pattern dozens of times. Teams spend weeks on prompt engineering and fine-tuning, then ship without measuring whether any of it actually worked.

Evaluation isn’t optional. It’s the difference between “we think the model is good” and “we know the model scores 78% on our test set, up from 71% last week.”

In this guide you’ll learn:

What the major LLM benchmarks (MMLU, HumanEval, MT-Bench) actually measure
How to compute perplexity, BLEU, ROUGE, and BERTScore from scratch in Python
How to build an LLM-as-judge pipeline for open-ended evaluation
When to use human evaluation and how to measure annotator agreement
How to evaluate RAG systems and avoid common evaluation mistakes
Which evaluation frameworks (DeepEval, RAGAS) to use in production

Before we write any code, here’s how the evaluation landscape fits together.

There are three layers to LLM evaluation. The first is benchmarks — standardized tests like MMLU and HumanEval that let you compare models against each other. They answer: “How does this model rank overall?”

The second layer is metrics — quantitative scores you compute on your own data. BLEU, ROUGE, perplexity, pass@k. They answer: “How well does this model perform on my specific task?”

The third layer is judgment — human evaluation and LLM-as-judge. These catch what automated metrics miss: tone, helpfulness, safety. They answer: “Is this output actually good?”

Most teams need all three. Benchmarks for model selection. Metrics for tracking improvement. Judgment for catching edge cases.

We’ll build each layer from scratch. By the end, you’ll have a working evaluation pipeline you can adapt to any LLM project.

Prerequisites

Python version: 3.9+
Required libraries: numpy (1.24+), collections (stdlib)
Optional libraries: openai (1.0+), rouge-score (0.1.2+), nltk (3.8+)
Install: pip install numpy openai rouge-score nltk
Time to complete: ~45 minutes
Last reviewed: March 2026

What Is LLM Evaluation?

LLM evaluation is the process of measuring how well a language model performs on tasks you care about. It moves beyond “does the output look reasonable” to “can we quantify quality and track it over time.”

Think of it like grading student essays. You could eyeball a few and say “looks fine.” Or you could define rubrics, score consistently, and track improvement. LLM evaluation is the rubric approach — and I prefer it because it turns gut feelings into data.

There are three core questions every evaluation answers:

Correctness. Is the output factually accurate?
Relevance. Does it actually address what was asked?
Quality. Is it well-written, safe, and appropriate?

Different evaluation methods tackle different questions. Benchmarks handle correctness at scale. Metrics quantify specific quality dimensions. Human judgment catches everything else.

python

import numpy as np
from collections import Counter
import re

print("LLM Evaluation toolkit loaded")
print("We'll build metrics from scratch -- no black boxes")

python

LLM Evaluation toolkit loaded
We'll build metrics from scratch -- no black boxes

The evaluation landscape has exploded since 2023. Dozens of benchmarks, metrics, and frameworks compete for attention. I find the trick is knowing which ones matter for your use case — and ignoring the rest.

Key LLM Benchmarks Explained

Benchmarks are standardized tests. They present the model with predefined inputs, check the outputs against known answers, and produce a score.

You’ve probably seen benchmark numbers thrown around in model announcements. Here’s what the most important ones actually measure.

MMLU (Massive Multitask Language Understanding)

MMLU tests knowledge across 57 subjects — from abstract algebra to world religions. Each question is multiple-choice with four options.

python

# What an MMLU question looks like
mmlu_example = {
    "question": "What is the capital of Australia?",
    "choices": ["Sydney", "Melbourne", "Canberra", "Brisbane"],
    "correct_answer": "C",  # Canberra
    "subject": "geography",
}

print(f"Subject: {mmlu_example['subject']}")
print(f"Q: {mmlu_example['question']}")
for i, choice in enumerate(mmlu_example['choices']):
    label = chr(65 + i)
    marker = " <-- correct" if label == mmlu_example['correct_answer'] else ""
    print(f"  {label}. {choice}{marker}")

python

Subject: geography
Q: What is the capital of Australia?
  A. Sydney
  B. Melbourne
  C. Canberra <-- correct
  D. Brisbane

The metric is simple: accuracy — what percentage of questions did the model get right. A score of 86% on MMLU means the model answered 86 out of 100 questions correctly across all 57 subjects. I like MMLU as a starting point, but it tells you less than you’d think.

Why it matters: MMLU is the most widely cited benchmark for general knowledge. But it has limits. Multiple-choice doesn’t test generation ability, and some questions have leaked into training data.

Note: **Quick check:** If someone tells you “this model scores 86% on MMLU,” can you explain what that means? It’s accuracy on multiple-choice questions across 57 academic subjects. Not generation quality. Not reasoning. Just knowledge recall.

HumanEval (Code Generation)

HumanEval tests whether a model can write working Python code. It gives 164 programming problems and checks if the generated code passes unit tests.

python

# What a HumanEval problem looks like
humaneval_example = {
    "task_id": "HumanEval/0",
    "prompt": "def has_close_elements(numbers, threshold):\n"
              "    \"\"\"Check if any two numbers are closer "
              "than the given threshold.\"\"\"\n",
    "test": "assert has_close_elements([1.0, 2.0, 3.9, 4.0], 0.3)\n"
            "assert not has_close_elements([1.0, 2.0, 3.0], 0.25)",
}

print("Task:", humaneval_example['task_id'])
print("The model must complete the function")
print("Then it runs against hidden unit tests")
print(f"\nSample test:\n{humaneval_example['test']}")

python

Task: HumanEval/0
The model must complete the function
Then it runs against hidden unit tests

Sample test:
assert has_close_elements([1.0, 2.0, 3.9, 4.0], 0.3)
assert not has_close_elements([1.0, 2.0, 3.0], 0.25)

The metric is pass@k: generate k code samples, and check if at least one passes all tests. Pass@1 means one attempt. Pass@10 means ten attempts — did any of them work?

Other Important Benchmarks

Here’s a quick reference for benchmarks you’ll see in model comparisons.

python

benchmarks = {
    "MMLU": {
        "tests": "General knowledge (57 subjects)",
        "metric": "Accuracy (%)",
        "format": "Multiple choice",
    },
    "HumanEval": {
        "tests": "Code generation (164 tasks)",
        "metric": "Pass@k",
        "format": "Code completion",
    },
    "HellaSwag": {
        "tests": "Commonsense reasoning",
        "metric": "Accuracy (%)",
        "format": "Sentence completion",
    },
    "TruthfulQA": {
        "tests": "Factual accuracy vs misconceptions",
        "metric": "% truthful + informative",
        "format": "Open-ended + MC",
    },
    "MT-Bench": {
        "tests": "Multi-turn conversation quality",
        "metric": "GPT-4 score (1-10)",
        "format": "Multi-turn dialogue",
    },
    "Chatbot Arena": {
        "tests": "Overall preference (crowdsourced)",
        "metric": "Elo rating",
        "format": "Pairwise human voting",
    },
}

print(f"{'Benchmark':<15} {'What It Tests':<40} {'Metric'}")
print("-" * 80)
for name, info in benchmarks.items():
    print(f"{name:<15} {info['tests']:<40} {info['metric']}")

python

Benchmark       What It Tests                            Metric
--------------------------------------------------------------------------------
MMLU            General knowledge (57 subjects)          Accuracy (%)
HumanEval       Code generation (164 tasks)              Pass@k
HellaSwag       Commonsense reasoning                    Accuracy (%)
TruthfulQA      Factual accuracy vs misconceptions       % truthful + informative
MT-Bench        Multi-turn conversation quality          GPT-4 score (1-10)
Chatbot Arena   Overall preference (crowdsourced)        Elo rating

Key Insight: **Benchmark scores tell you how models compare to each other. They don’t tell you how a model will perform on your specific task.** A model scoring 90% on MMLU might still fail at summarizing your company’s legal documents. Always supplement benchmarks with task-specific evaluation.

The Limitations of Benchmarks

Benchmarks are useful but imperfect. Before you rely on them too heavily, know their failure modes.

Data contamination. If benchmark questions leaked into training data, scores are inflated. The model memorized answers instead of reasoning.

Narrow format. Most benchmarks use multiple-choice or short-answer formats. Real-world tasks involve long-form generation, multi-step reasoning, and ambiguous inputs.

Benchmark gaming. Models can be optimized specifically for benchmark performance without improving general capability. High MMLU scores don’t guarantee good chat behavior.

Decay. As models improve, benchmarks saturate. When every model scores 90%+, the benchmark stops being useful. MMLU-Pro was created because MMLU became too easy.

python

# Benchmark saturation example
models_2023 = {"GPT-4": 86.4, "Claude 2": 78.5, "Llama 2 70B": 68.9}
models_2025 = {"GPT-4o": 88.7, "Claude 3.5": 88.7, "Llama 3.1 405B": 88.6}

print("MMLU Scores (approximate)")
print("-" * 40)
print("2023:")
for m, s in models_2023.items():
    print(f"  {m:<20} {s}%")
print("\n2025:")
for m, s in models_2025.items():
    print(f"  {m:<20} {s}%")
print("\nBy 2025, top models are nearly identical")
print("MMLU no longer differentiates them")

python

MMLU Scores (approximate)
----------------------------------------
2023:
  GPT-4                86.4%
  Claude 2             78.5%
  Llama 2 70B          68.9%

2025:
  GPT-4o               88.7%
  Claude 3.5           88.7%
  Llama 3.1 405B       88.6%

By 2025, top models are nearly identical
MMLU no longer differentiates them

Warning: **Don’t choose a model based on benchmark scores alone.** Run your own evaluation on data that matches your use case. A 2-point MMLU difference between models is meaningless if one handles your domain much better than the other.

Evaluation Metrics You Can Compute in Python

Benchmarks compare models. Metrics measure specific quality dimensions on your own data. Here are the ones that matter most — and we’ll implement each one from scratch so you understand exactly what they compute.

I think building metrics yourself is the fastest way to develop intuition for what they actually capture — and more importantly, what they miss.

Perplexity — How Surprised Is the Model?

Perplexity measures how well a language model predicts text. Lower perplexity means the model finds the text less surprising — it assigns higher probability to the actual words.

The formula: perplexity = exp(-average log probability per token). A model with perplexity 10 is, roughly speaking, “choosing between 10 equally likely options” at each step.

python

def compute_perplexity(log_probs):
    """Compute perplexity from token log-probabilities.

    Args:
        log_probs: list of log probabilities (one per token)
    Returns:
        perplexity score (lower is better)
    """
    avg_log_prob = np.mean(log_probs)
    return np.exp(-avg_log_prob)

# Simulated: a good model assigns high probability to each token
good_model_logprobs = [-0.5, -0.3, -0.8, -0.2, -0.6, -0.4, -0.5]
bad_model_logprobs = [-2.5, -3.1, -1.8, -2.9, -2.2, -3.5, -2.7]

print(f"Good model perplexity: {compute_perplexity(good_model_logprobs):.2f}")
print(f"Bad model perplexity:  {compute_perplexity(bad_model_logprobs):.2f}")
print(f"\nLower = better (model is less 'surprised' by the text)")

python

Good model perplexity: 1.60
Bad model perplexity:  14.46

Lower = better (model is less 'surprised' by the text)

The good model has perplexity 1.60 — it predicted the text with high confidence. The bad model’s perplexity of 14.46 means it was much more uncertain.

When to use perplexity: Comparing language models on the same text, measuring fine-tuning impact, detecting domain mismatch (perplexity spikes on unfamiliar text).

BLEU Score — Translation and Text Similarity

BLEU (Bilingual Evaluation Understudy) measures how much a generated text overlaps with a reference text. Originally designed for machine translation, it’s now used broadly for text generation tasks.

BLEU counts matching n-grams (sequences of n words) between the candidate and reference. A perfect match scores 1.0. No overlap scores 0.0.

Here’s BLEU computed from scratch. The core idea: count overlapping n-grams between candidate and reference, then combine them.

Two details make BLEU more nuanced. First, a brevity penalty punishes candidates shorter than the reference — you can’t game a high score by outputting only the words you’re confident about. Second, BLEU takes the geometric mean of n-gram precisions, so a zero at any n-gram level zeros out the whole score.

python

def compute_bleu(reference, candidate, max_n=4):
    """Compute BLEU score between reference and candidate texts."""
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()

    # Brevity penalty: penalize candidates shorter than reference
    bp = min(1.0, np.exp(1 - len(ref_tokens) / max(len(cand_tokens), 1)))

    scores = []
    for n in range(1, max_n + 1):
        # Count n-grams in reference and candidate
        ref_ngrams = Counter(
            tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens)-n+1)
        )
        cand_ngrams = Counter(
            tuple(cand_tokens[i:i+n]) for i in range(len(cand_tokens)-n+1)
        )

        # Count matches (clipped to reference counts)
        matches = sum(
            min(count, ref_ngrams[ng])
            for ng, count in cand_ngrams.items()
        )
        total = max(sum(cand_ngrams.values()), 1)
        scores.append(matches / total)

    # Geometric mean: if ANY n-gram precision is 0, BLEU = 0
    if any(s == 0 for s in scores):
        return 0.0
    log_avg = np.mean([np.log(s) for s in scores])
    return bp * np.exp(log_avg)

Watch how the geometric mean makes BLEU strict — even partial overlap isn’t enough if higher-order n-grams don’t match.

python

reference = "the cat sat on the mat"
candidate1 = "the cat sat on the mat"
candidate2 = "a cat was sitting on a mat"
candidate3 = "the dog played in the park"

print(f"Reference: '{reference}'\n")
print(f"Candidate 1 (exact match):  BLEU = {compute_bleu(reference, candidate1):.4f}")
print(f"Candidate 2 (paraphrase):   BLEU = {compute_bleu(reference, candidate2):.4f}")
print(f"Candidate 3 (unrelated):    BLEU = {compute_bleu(reference, candidate3):.4f}")

python

Reference: 'the cat sat on the mat'

Candidate 1 (exact match):  BLEU = 1.0000
Candidate 2 (paraphrase):   BLEU = 0.0000
Candidate 3 (unrelated):    BLEU = 0.0000

The exact match gets 1.0. Both the paraphrase and unrelated text score 0 — they share too few higher-order n-grams with the reference.

You might be wondering: “Why does the paraphrase score zero?” Because while it shares some unigrams (“cat”, “on”, “a”, “mat”), it has zero matching bigrams or trigrams. The geometric mean kills it. This is BLEU’s biggest weakness — it’s purely lexical and can’t recognize synonyms or paraphrases.

Tip: **BLEU works best for tasks with predictable phrasing** like translation and template-based generation. For open-ended tasks (summarization, creative writing), use ROUGE or BERTScore instead. They’re more forgiving of paraphrasing.

ROUGE Score — Summarization Quality

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap from the reference side. While BLEU asks “how much of the candidate appears in the reference,” ROUGE asks “how much of the reference appears in the candidate.”

ROUGE-1 counts unigram overlap. ROUGE-L finds the longest common subsequence.

python

def rouge_1(reference, candidate):
    """Compute ROUGE-1 (unigram) F1 score."""
    ref_tokens = set(reference.lower().split())
    cand_tokens = set(candidate.lower().split())

    overlap = ref_tokens & cand_tokens
    if not overlap:
        return {"precision": 0, "recall": 0, "f1": 0}

    precision = len(overlap) / len(cand_tokens)
    recall = len(overlap) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall)

    return {"precision": round(precision, 4),
            "recall": round(recall, 4),
            "f1": round(f1, 4)}

reference = "the president spoke about climate change and economic policy"
summary1 = "the president discussed climate change"
summary2 = "weather and money topics were covered by the leader"

r1 = rouge_1(reference, summary1)
r2 = rouge_1(reference, summary2)

print(f"Reference: '{reference}'\n")
print(f"Summary 1: '{summary1}'")
print(f"  ROUGE-1: P={r1['precision']}, R={r1['recall']}, F1={r1['f1']}")
print(f"\nSummary 2: '{summary2}'")
print(f"  ROUGE-1: P={r2['precision']}, R={r2['recall']}, F1={r2['f1']}")

python

Reference: 'the president spoke about climate change and economic policy'

Summary 1: 'the president discussed climate change'
  ROUGE-1: P=0.8, R=0.4444, F1=0.5714

Summary 2: 'weather and money topics were covered by the leader'
  ROUGE-1: P=0.2222, R=0.2222, F1=0.2222

Summary 1 captures key terms (“president”, “climate”, “change”) and scores well. Summary 2 paraphrases everything — only “the” and “and” overlap with the reference, giving it a low 0.22 F1. ROUGE can’t recognize that “leader” means “president” or “weather” relates to “climate.”

BERTScore — Semantic Similarity Beyond Surface Overlap

BLEU and ROUGE both fail at the same thing: they can’t detect paraphrases. “The president spoke” and “the leader talked” share zero tokens, so they score zero — even though they mean the same thing.

BERTScore fixes this by comparing texts at the meaning level. Instead of matching exact words, it converts each token into a contextual embedding (using a pretrained model like BERT) and measures cosine similarity between the best-matching pairs.

python

# BERTScore concept: cosine similarity between token embeddings
# Full implementation requires a transformer model.
# Here's the core logic for understanding:

def cosine_similarity(vec_a, vec_b):
    """Compute cosine similarity between two vectors."""
    dot = sum(a * b for a, b in zip(vec_a, vec_b))
    norm_a = sum(a ** 2 for a in vec_a) ** 0.5
    norm_b = sum(b ** 2 for b in vec_b) ** 0.5
    return dot / (norm_a * norm_b) if norm_a * norm_b > 0 else 0.0

# Simulated embeddings (in practice, from a BERT model)
# "president" and "leader" have similar embeddings
president_emb = [0.8, 0.3, 0.5, 0.1]
leader_emb    = [0.75, 0.35, 0.45, 0.15]
cat_emb       = [0.1, 0.9, 0.2, 0.7]

sim_related = cosine_similarity(president_emb, leader_emb)
sim_unrelated = cosine_similarity(president_emb, cat_emb)

print(f"'president' vs 'leader':  cosine similarity = {sim_related:.4f}")
print(f"'president' vs 'cat':     cosine similarity = {sim_unrelated:.4f}")
print(f"\nBERTScore catches synonyms that BLEU/ROUGE miss")

python

'president' vs 'leader':  cosine similarity = 0.9956
'president' vs 'cat':     cosine similarity = 0.4498

BERTScore catches synonyms that BLEU/ROUGE miss

In practice, you’d use the bert-score library: pip install bert-score. But understanding the cosine similarity core helps you interpret the scores — high BERTScore with low ROUGE means good paraphrasing, while low BERTScore with high ROUGE means surface copying without understanding.

Tip: **Use BERTScore for open-ended generation** (summarization, QA, creative writing) where paraphrasing is expected. Stick with BLEU/ROUGE for structured tasks (translation, template-based output) where exact wording matters.

Quick Comparison: Which Metric When?

python

metrics_comparison = {
    "Perplexity": {
        "measures": "Model confidence on text",
        "best_for": "Model comparison, fine-tuning tracking",
        "limitation": "Doesn't measure output quality",
    },
    "BLEU": {
        "measures": "N-gram overlap (precision-focused)",
        "best_for": "Translation, structured generation",
        "limitation": "Misses paraphrases entirely",
    },
    "ROUGE": {
        "measures": "N-gram overlap (recall-focused)",
        "best_for": "Summarization, extraction tasks",
        "limitation": "Misses paraphrases entirely",
    },
    "BERTScore": {
        "measures": "Semantic similarity via embeddings",
        "best_for": "Open-ended generation, QA",
        "limitation": "Requires GPU, slower to compute",
    },
    "Pass@k": {
        "measures": "Functional correctness of code",
        "best_for": "Code generation tasks",
        "limitation": "Only works for code with tests",
    },
}

print(f"{'Metric':<12} {'Best For':<35} {'Key Limitation'}")
print("-" * 80)
for metric, info in metrics_comparison.items():
    print(f"{metric:<12} {info['best_for']:<35} {info['limitation']}")

python

Metric       Best For                            Key Limitation
--------------------------------------------------------------------------------
Perplexity   Model comparison, fine-tuning track  Doesn't measure output quality
BLEU         Translation, structured generation   Misses paraphrases entirely
ROUGE        Summarization, extraction tasks      Misses paraphrases entirely
BERTScore    Open-ended generation, QA            Requires GPU, slower to compute
Pass@k       Code generation tasks                Only works for code with tests

Key Insight: **No single metric captures LLM quality.** Perplexity measures fluency, BLEU/ROUGE measure surface overlap, and BERTScore measures meaning. Use 2-3 complementary metrics, not just one. The best practice is 1-2 task-specific metrics plus 1 general quality metric.

Try It Yourself

Exercise 1: Compute ROUGE-1 for a summarization task.

Given a reference summary and three candidate summaries, compute the ROUGE-1 F1 score for each. Which candidate is the best summary?

python

# Starter code
def rouge_1_f1(reference, candidate):
    """Compute ROUGE-1 F1 score."""
    ref_tokens = set(reference.lower().split())
    cand_tokens = set(candidate.lower().split())
    # YOUR CODE: compute overlap, precision, recall, F1
    pass

ref = "machine learning models require large datasets for training"
c1 = "ml models need big data to train"
c2 = "machine learning requires large datasets"
c3 = "deep neural networks process information"

# Compute and print F1 for each candidate
# YOUR CODE HERE

Hints:
1. Overlap is the intersection of the two token sets. Precision = overlap/candidate_size. Recall = overlap/reference_size.
2. F1 = 2 * precision * recall / (precision + recall). Handle the case where precision + recall == 0.

Click to reveal solution

python

def rouge_1_f1(reference, candidate):
    ref_tokens = set(reference.lower().split())
    cand_tokens = set(candidate.lower().split())
    overlap = ref_tokens & cand_tokens
    if not overlap:
        return 0.0
    precision = len(overlap) / len(cand_tokens)
    recall = len(overlap) / len(ref_tokens)
    return 2 * precision * recall / (precision + recall)

ref = "machine learning models require large datasets for training"
c1 = "ml models need big data to train"
c2 = "machine learning requires large datasets"
c3 = "deep neural networks process information"

print(f"Candidate 1: {rouge_1_f1(ref, c1):.4f}")
print(f"Candidate 2: {rouge_1_f1(ref, c2):.4f}")
print(f"Candidate 3: {rouge_1_f1(ref, c3):.4f}")
print(f"\nCandidate 2 is best -- it captures 4 key terms from the reference")

python

Candidate 1: 0.1333
Candidate 2: 0.6154
Candidate 3: 0.0000

Candidate 2 is best -- it captures 4 key terms from the reference

**Explanation:** Candidate 2 shares “machine”, “learning”, “large”, “datasets” with the reference — 4 out of 5 candidate words match, giving high precision (0.8) and decent recall (4 out of 8 reference words = 0.5). Candidate 1 uses synonyms (“ml”, “big data”) that ROUGE can’t detect — only “models” overlaps. Candidate 3 shares zero words.

LLM-as-Judge — Using AI to Evaluate AI

Here’s where things get interesting. Automated metrics have a fundamental problem: they can’t judge qualities like helpfulness, tone, or reasoning quality. These require understanding, not pattern matching.

The solution? Use a strong LLM to evaluate outputs from the model you’re testing. This is the “LLM-as-judge” pattern, and it’s become the standard approach for evaluating open-ended generation. I use it on nearly every project now.

The idea is simple. You give a judge model (typically GPT-4 or Claude) the prompt, the model’s response, and a scoring rubric. The judge returns a structured score.

python

# LLM-as-Judge prompt template
judge_prompt = """You are an expert evaluator. Score the following response
on a scale of 1-5 for each criterion.

**Prompt:** {prompt}

**Response:** {response}

**Criteria:**
1. Relevance (1-5): Does the response address the question?
2. Accuracy (1-5): Is the information factually correct?
3. Clarity (1-5): Is the response clear and well-organized?
4. Completeness (1-5): Does it cover the topic adequately?

Return ONLY a JSON object:
{{"relevance": X, "accuracy": X, "clarity": X, "completeness": X}}
"""

# Example usage (simulated -- would normally call an API)
example = judge_prompt.format(
    prompt="What causes rain?",
    response="Rain forms when water vapor in clouds condenses into "
             "droplets heavy enough to fall. This happens when warm, "
             "moist air rises and cools at higher altitudes."
)

print("=== Judge Prompt (first 200 chars) ===")
print(example[:200] + "...")
print("\n=== Simulated Judge Response ===")
scores = {"relevance": 5, "accuracy": 5, "clarity": 5, "completeness": 4}
print(scores)
print(f"Average: {np.mean(list(scores.values())):.1f}/5.0")

python

=== Judge Prompt (first 200 chars) ===
You are an expert evaluator. Score the following response
on a scale of 1-5 for each criterion.

**Prompt:** What causes rain?

**Response:** Rain forms when water vapor in clouds condenses into ...

=== Simulated Judge Response ===
{'relevance': 5, 'accuracy': 5, 'clarity': 5, 'completeness': 4}
Average: 4.8/5.0

Pairwise Comparison — A Better Judge Pattern

Single-score judging is noisy. The judge might give a 4 one time and a 3 the next for the same response. Pairwise comparison is more reliable: show two responses and ask “which is better?”

python

pairwise_prompt = """Compare these two responses to the same prompt.
Which one is better? Explain briefly, then state your choice.

**Prompt:** {prompt}

**Response A:** {response_a}

**Response B:** {response_b}

Your answer must end with exactly one of:
[[A]] or [[B]] or [[Tie]]
"""

prompt = "Explain what an API is to a non-technical person."

response_a = ("An API is like a waiter at a restaurant. "
              "You tell the waiter what you want, the waiter "
              "goes to the kitchen, and brings back your food. "
              "You never interact with the kitchen directly.")

response_b = ("An API (Application Programming Interface) is a "
              "set of protocols and tools for building software "
              "applications. It specifies how software components "
              "should interact.")

print("Response A uses an analogy (conversational)")
print("Response B uses technical definition (formal)")
print("\nFor a non-technical audience, Response A is better")
print("A judge model would likely output: [[A]]")

python

Response A uses an analogy (conversational)
Response B uses technical definition (formal)

For a non-technical audience, Response A is better
A judge model would likely output: [[A]]

Warning: **LLM judges have biases.** They tend to prefer longer responses (verbosity bias), responses listed first (position bias), and their own style (self-enhancement bias). Mitigate position bias by running each comparison twice with swapped positions and checking for consistency.

Running LLM-as-Judge at Scale

Here’s a complete evaluation function that calls the OpenAI API. In production, you’d run this on hundreds of test cases.

python

# NOTE: Requires OpenAI API key. Shown for reference.
# Set OPENAI_API_KEY environment variable before running.

"""
import openai
import json

def llm_judge(prompt, response, model="gpt-4o"):
    judge_msg = f'''Score this response (1-5) on:
    relevance, accuracy, clarity, completeness.
    Return JSON only.

    Prompt: {prompt}
    Response: {response}'''

    result = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": judge_msg}],
        temperature=0,
    )
    return json.loads(result.choices[0].message.content)

# Example:
# scores = llm_judge("What is Python?", "Python is a programming language.")
# print(scores)
"""

print("LLM-as-Judge: use GPT-4o or Claude as evaluator")
print("Key parameters:")
print("  temperature=0  -> deterministic scoring")
print("  JSON output    -> structured, parseable results")
print("  Run twice with swapped positions for pairwise")

python

LLM-as-Judge: use GPT-4o or Claude as evaluator
Key parameters:
  temperature=0  -> deterministic scoring
  JSON output    -> structured, parseable results
  Run twice with swapped positions for pairwise

Tip: **Start with 50-100 test cases for LLM-as-judge.** Score them with the judge AND manually. If the judge agrees with human scores 80%+ of the time, you can trust it at larger scale. If agreement is below 70%, your rubric needs work.

Try It Yourself

Exercise 2: Design a judge prompt for customer support quality.

Write a judge prompt that evaluates customer support responses on four criteria relevant to that domain. Include the scoring rubric.

python

# Starter code
def create_support_judge_prompt(customer_query, agent_response):
    """Create a judge prompt for customer support evaluation.

    Define 4 criteria specific to support quality.
    Return the complete prompt string.
    """
    # YOUR CODE: define criteria and build the prompt
    # Think about what matters: empathy, accuracy, actionability...
    prompt = ""
    return prompt

# Test
query = "My order hasn't arrived and it's been 2 weeks"
response = "I apologize for the delay. Let me check your order status."
print(create_support_judge_prompt(query, response)[:300])

Hints:
1. Good criteria for support: empathy/tone, problem resolution, actionability (did they give next steps?), accuracy of information provided.
2. Use a 1-5 scale and define what each score means. “1 = dismissive, 5 = warm and understanding” for empathy.

Click to reveal solution

python

def create_support_judge_prompt(customer_query, agent_response):
    return f"""Score this customer support response (1-5) on each criterion.

**Customer Query:** {customer_query}

**Agent Response:** {agent_response}

**Criteria:**
1. Empathy (1-5): Does the agent acknowledge the customer's frustration?
   1=Dismissive, 3=Neutral, 5=Warm and understanding
2. Resolution (1-5): Does the response work toward solving the problem?
   1=Ignores issue, 3=Partial, 5=Clear path to resolution
3. Actionability (1-5): Are concrete next steps provided?
   1=Vague, 3=Some steps, 5=Clear actions with timeline
4. Accuracy (1-5): Is the information correct and relevant?
   1=Wrong info, 3=Partially correct, 5=Fully accurate

Return JSON: {{"empathy": X, "resolution": X, "actionability": X, "accuracy": X}}"""

query = "My order hasn't arrived and it's been 2 weeks"
response = "I apologize for the delay. Let me check your order status."
print(create_support_judge_prompt(query, response)[:400])

python

Score this customer support response (1-5) on each criterion.

**Customer Query:** My order hasn't arrived and it's been 2 weeks

**Agent Response:** I apologize for the delay. Let me check your order status.

**Criteria:**
1. Empathy (1-5): Does the agent acknowledge the customer's frustration?
   1=Dismissive, 3=Neutral, 5=Warm and understanding
2. Resolution (1-5): Does the response work toward sol

**Explanation:** The four criteria map to what actually matters in support: emotional tone (empathy), problem-solving (resolution), giving next steps (actionability), and correctness (accuracy). Each criterion has anchor descriptions so the judge scores consistently.

Human Evaluation — When Automated Metrics Aren’t Enough

Automated metrics and LLM judges are scalable. But they systematically miss certain failure modes. Human evaluation catches what machines can’t — and skipping it is a mistake I’ve watched teams regret.

When you need humans:
– Safety-critical applications (medical, legal, financial advice)
– Subjective quality (brand voice, creativity, humor)
– Edge cases that LLM judges handle poorly
– Calibrating your automated evaluation pipeline

Setting Up a Human Evaluation Protocol

A good human eval has five components.

python

human_eval_protocol = {
    "1. Sample Selection": "Random sample of 50-200 outputs from production",
    "2. Rubric Definition": "3-5 criteria with clear anchor descriptions",
    "3. Annotator Training": "Show 10 examples with expected scores",
    "4. Double Annotation": "2+ annotators per example, measure agreement",
    "5. Agreement Metric": "Cohen's kappa > 0.6 means acceptable agreement",
}

print("=== Human Evaluation Protocol ===\n")
for step, desc in human_eval_protocol.items():
    print(f"{step}")
    print(f"  {desc}\n")

python

=== Human Evaluation Protocol ===

1. Sample Selection
  Random sample of 50-200 outputs from production

2. Rubric Definition
  3-5 criteria with clear anchor descriptions

3. Annotator Training
  Show 10 examples with expected scores

4. Double Annotation
  2+ annotators per example, measure agreement

5. Agreement Metric
  Cohen's kappa > 0.6 means acceptable agreement

The most critical step is the rubric. Vague criteria like “quality” lead to inconsistent scores. Each criterion needs anchor examples.

But how do you measure whether your annotators actually agree? Raw agreement percentage is misleading — if 90% of examples are “Good,” two random annotators will agree most of the time by luck. Cohen’s Kappa corrects for this by subtracting the expected chance agreement from the observed agreement.

python

def cohens_kappa(annotator1, annotator2):
    """Compute Cohen's Kappa for inter-annotator agreement."""
    assert len(annotator1) == len(annotator2)
    n = len(annotator1)

    # Observed agreement
    agree = sum(a == b for a, b in zip(annotator1, annotator2))
    po = agree / n

    # Expected agreement (by chance)
    labels = set(annotator1) | set(annotator2)
    pe = sum(
        (annotator1.count(l) / n) * (annotator2.count(l) / n)
        for l in labels
    )

    if pe == 1.0:
        return 1.0
    return (po - pe) / (1 - pe)

# Two annotators scored 20 responses as Good/Bad
ann1 = ["Good"]*12 + ["Bad"]*8
ann2 = ["Good"]*10 + ["Bad"]*2 + ["Good"]*2 + ["Bad"]*6

kappa = cohens_kappa(ann1, ann2)
agreement = sum(a == b for a, b in zip(ann1, ann2)) / len(ann1)

print(f"Raw agreement: {agreement:.0%}")
print(f"Cohen's Kappa: {kappa:.4f}")
print(f"\nInterpretation:")
print(f"  < 0.2  = Poor")
print(f"  0.2-0.4 = Fair")
print(f"  0.4-0.6 = Moderate")
print(f"  0.6-0.8 = Substantial")
print(f"  > 0.8  = Almost perfect")

python

Raw agreement: 80%
Cohen's Kappa: 0.5833

Interpretation:
  < 0.2  = Poor
  0.2-0.4 = Fair
  0.4-0.6 = Moderate
  0.6-0.8 = Substantial
  > 0.8  = Almost perfect

The annotators agree 80% of the time, but Cohen’s Kappa is only 0.58 (moderate). That’s because some agreement is due to chance. If you see kappa below 0.6, your rubric is probably too vague — tighten the criteria.

Note: **The 10% rule.** Run human evaluation on at least 10% of your LLM judge’s outputs. Use the human scores to calibrate the judge. If they disagree more than 20% of the time, revise your judge prompt or rubric.

Note: **Quick check:** You computed a Cohen’s Kappa of 0.58. Two annotators agree 80% of the time. Why isn’t kappa higher? Because kappa corrects for chance agreement — if both annotators lean toward “Good,” they’ll agree often by luck alone. That’s exactly what kappa filters out.

Building Your Evaluation Pipeline

Now let’s put everything together. This is the pipeline I’d recommend for any LLM-powered product, and it’s the structure I reach for whenever I start a new evaluation project.

The Three-Stage Pipeline

python

pipeline_stages = {
    "Stage 1: Offline Eval (before deploy)": {
        "what": "Run benchmarks + task-specific tests",
        "tools": "Test dataset, automated metrics, LLM judge",
        "frequency": "Every model change or prompt update",
        "example": "Score 200 test cases, compute ROUGE + judge scores",
    },
    "Stage 2: Shadow Eval (during rollout)": {
        "what": "Run new model alongside old, compare outputs",
        "tools": "A/B test framework, pairwise LLM judge",
        "frequency": "During model rollout period",
        "example": "Both models answer same queries, judge picks winner",
    },
    "Stage 3: Online Eval (in production)": {
        "what": "Monitor live outputs, catch regressions",
        "tools": "User feedback, automated checks, spot human eval",
        "frequency": "Continuous",
        "example": "Track thumbs up/down, flag low-confidence responses",
    },
}

for stage, info in pipeline_stages.items():
    print(f"\n{stage}")
    print(f"  What: {info['what']}")
    print(f"  Tools: {info['tools']}")
    print(f"  When: {info['frequency']}")

python


Stage 1: Offline Eval (before deploy)
  What: Run benchmarks + task-specific tests
  Tools: Test dataset, automated metrics, LLM judge
  When: Every model change or prompt update

Stage 2: Shadow Eval (during rollout)
  What: Run new model alongside old, compare outputs
  Tools: A/B test framework, pairwise LLM judge
  When: During model rollout period

Stage 3: Online Eval (in production)
  What: Monitor live outputs, catch regressions
  Tools: User feedback, automated checks, spot human eval
  When: Continuous

Building an Offline Evaluation Script

Here’s a complete evaluation harness you can adapt. It takes a test dataset, runs metrics, and produces a report.

python

def evaluate_model_outputs(test_cases, predictions):
    """Run a full offline evaluation on model outputs.

    Args:
        test_cases: list of dicts with 'prompt', 'reference'
        predictions: list of model output strings
    Returns:
        dict with aggregate metrics
    """
    assert len(test_cases) == len(predictions)

    rouge_scores = []
    length_ratios = []

    for tc, pred in zip(test_cases, predictions):
        ref = tc["reference"]
        r1 = rouge_1(ref, pred)
        rouge_scores.append(r1["f1"])
        length_ratios.append(len(pred.split()) / max(len(ref.split()), 1))

    return {
        "n_samples": len(test_cases),
        "avg_rouge1_f1": round(np.mean(rouge_scores), 4),
        "min_rouge1_f1": round(np.min(rouge_scores), 4),
        "max_rouge1_f1": round(np.max(rouge_scores), 4),
        "avg_length_ratio": round(np.mean(length_ratios), 2),
    }

# Demo: evaluate a "model" on 5 test cases
test_cases = [
    {"prompt": "Summarize Python", "reference": "Python is a versatile programming language"},
    {"prompt": "Summarize ML", "reference": "Machine learning finds patterns in data"},
    {"prompt": "Summarize APIs", "reference": "APIs let applications communicate with each other"},
    {"prompt": "Summarize cloud", "reference": "Cloud computing provides on-demand computing resources"},
    {"prompt": "Summarize AI", "reference": "Artificial intelligence simulates human reasoning"},
]

predictions = [
    "Python is a popular programming language for data science",
    "ML uses algorithms to learn from data patterns",
    "APIs enable software systems to talk to each other",
    "Cloud services offer scalable computing on demand",
    "AI mimics human intelligence and decision making",
]

report = evaluate_model_outputs(test_cases, predictions)

print("=== Evaluation Report ===")
for key, val in report.items():
    print(f"  {key}: {val}")

python

=== Evaluation Report ===
  n_samples: 5
  avg_rouge1_f1: 0.4038
  min_rouge1_f1: 0.2857
  max_rouge1_f1: 0.6667
  avg_length_ratio: 1.34

The average ROUGE-1 F1 of 0.40 isn’t great — but that’s expected since the predictions paraphrase rather than copy the reference text. In practice, you’d combine this with an LLM judge for a fuller picture.

Try It Yourself

Exercise 3: Add length and vocabulary diversity checks to the evaluation pipeline.

Extend the evaluate_model_outputs function to also compute average response length (in words) and vocabulary diversity (unique words / total words).

python

# Starter code
def evaluate_extended(test_cases, predictions):
    """Extended evaluation with length and diversity metrics."""
    rouge_scores = []
    lengths = []
    diversities = []

    for tc, pred in zip(test_cases, predictions):
        # ROUGE-1 (already implemented above)
        r1 = rouge_1(tc["reference"], pred)
        rouge_scores.append(r1["f1"])

        # YOUR CODE: compute word count for this prediction
        # YOUR CODE: compute vocabulary diversity (unique / total)

    return {
        "avg_rouge1_f1": round(np.mean(rouge_scores), 4),
        # YOUR CODE: add avg_length and avg_diversity
    }

result = evaluate_extended(test_cases, predictions)
print(result)

Hints:
1. Word count: len(pred.split()). Vocabulary diversity: len(set(pred.lower().split())) / len(pred.lower().split()).
2. A diversity score near 1.0 means every word is unique (good). Near 0.5 means lots of repetition (may indicate degenerate output).

Click to reveal solution

python

def evaluate_extended(test_cases, predictions):
    rouge_scores, lengths, diversities = [], [], []

    for tc, pred in zip(test_cases, predictions):
        r1 = rouge_1(tc["reference"], pred)
        rouge_scores.append(r1["f1"])

        words = pred.lower().split()
        lengths.append(len(words))
        diversities.append(len(set(words)) / max(len(words), 1))

    return {
        "avg_rouge1_f1": round(np.mean(rouge_scores), 4),
        "avg_length": round(np.mean(lengths), 1),
        "avg_diversity": round(np.mean(diversities), 4),
    }

result = evaluate_extended(test_cases, predictions)
print("=== Extended Report ===")
for k, v in result.items():
    print(f"  {k}: {v}")

python

=== Extended Report ===
  avg_rouge1_f1: 0.4038
  avg_length: 8.0
  avg_diversity: 0.9778

**Explanation:** Average length of 8 words per response — these are short summaries. Diversity of 0.98 means nearly every word is unique (one prediction repeats “to”). If diversity dropped below 0.7, it could signal degenerate or repetitive outputs.

Evaluating RAG Systems

RAG (Retrieval-Augmented Generation) adds retrieval to the evaluation challenge. You’re no longer just evaluating the LLM — you’re evaluating the retriever, the context, AND the generation. This is where I see most teams struggle, because there are more moving parts to test.

Three metrics matter for RAG:

Faithfulness. Does the response only contain information from the retrieved context? An unfaithful response hallucinates facts not in the source.

Answer relevance. Does the response actually answer the question? A response can be faithful (everything it says is in the context) but irrelevant (it doesn’t address the query).

Context relevance. Did the retriever pull the right documents? Irrelevant context leads to irrelevant answers.

python

rag_metrics = {
    "Faithfulness": {
        "question": "Is every claim in the response supported by context?",
        "fail_example": "Response says 'founded in 2020' but context says 2018",
        "check_with": "LLM judge: extract claims, verify each against context",
    },
    "Answer Relevance": {
        "question": "Does the response address what was asked?",
        "fail_example": "Asked about pricing, response discusses features",
        "check_with": "LLM judge: does the response answer the query?",
    },
    "Context Relevance": {
        "question": "Did the retriever find the right documents?",
        "fail_example": "Query about Python, retrieved Java documentation",
        "check_with": "Embedding similarity or LLM judge",
    },
}

print("=== RAG Evaluation Metrics ===\n")
for metric, info in rag_metrics.items():
    print(f"{metric}")
    print(f"  Question:  {info['question']}")
    print(f"  Fail case: {info['fail_example']}")
    print(f"  Method:    {info['check_with']}\n")

python

=== RAG Evaluation Metrics ===

Faithfulness
  Question:  Is every claim in the response supported by context?
  Fail case: Response says 'founded in 2020' but context says 2018
  Method:    LLM judge: extract claims, verify each against context

Answer Relevance
  Question:  Does the response address what was asked?
  Fail case: Asked about pricing, response discusses features
  Method:    LLM judge: does the response answer the query?

Context Relevance
  Question:  Did the retriever find the right documents?
  Fail case: Query about Python, retrieved Java documentation
  Method:    Embedding similarity or LLM judge

Key Insight: **RAG evaluation must test the retriever and generator separately.** If the final answer is wrong, you need to know why. Was it because the retriever pulled irrelevant documents? Or because the generator ignored good context? Component-level evaluation pinpoints the problem.

Common Evaluation Mistakes

I keep seeing the same mistakes across projects. Each one silently undermines your evaluation results — and you won’t realize it until production breaks.

Mistake 1: Evaluating on training data. If the model saw the test cases during training, your scores are inflated. Always use a held-out test set that was never part of training or fine-tuning.

Mistake 2: Using a single metric. ROUGE alone misses semantic similarity. Perplexity alone misses factual errors. Use 2-3 complementary metrics.

Mistake 3: Ignoring edge cases. Average scores hide failures. A model with 90% average accuracy might fail 100% on a critical subcategory. Break down scores by category.

Mistake 4: Not versioning your evaluation. When you change your test set or rubric, old scores become incomparable. Version everything: test data, prompts, scoring criteria, model checkpoints.

Here’s the full list with fixes — I keep this pinned on every project.

python

mistakes = [
    ("Evaluating on training data", "Use held-out test set", "Critical"),
    ("Single metric reliance", "Use 2-3 complementary metrics", "High"),
    ("Ignoring edge cases", "Break down by category", "High"),
    ("No evaluation versioning", "Version test sets and rubrics", "Medium"),
    ("Too few test cases", "Minimum 50, ideally 200+", "Medium"),
    ("Not calibrating LLM judges", "Human eval on 10% sample", "Medium"),
]

print(f"{'Mistake':<30} {'Fix':<35} {'Severity'}")
print("-" * 75)
for mistake, fix, sev in mistakes:
    print(f"{mistake:<30} {fix:<35} {sev}")

python

Mistake                        Fix                                 Severity
---------------------------------------------------------------------------
Evaluating on training data    Use held-out test set                Critical
Single metric reliance         Use 2-3 complementary metrics        High
Ignoring edge cases            Break down by category               High
No evaluation versioning       Version test sets and rubrics        Medium
Too few test cases             Minimum 50, ideally 200+             Medium
Not calibrating LLM judges     Human eval on 10% sample             Medium

Warning: **Evaluation drift is real.** Your test set was great six months ago. But your product evolved, user queries shifted, and edge cases changed. Review and update your test set quarterly. Stale tests give false confidence.

Choosing the Right Evaluation Strategy

Not sure where to start? Different tasks need different evaluation approaches. Here’s the decision framework I use.

python

print("=== Evaluation Strategy Decision Tree ===\n")

decisions = [
    ("What type of task?",
     "Classification/QA  -> Accuracy, F1, Exact Match",
     "Generation/Summary -> ROUGE + LLM Judge",
     "Code generation    -> Pass@k (unit tests)"),
    ("How critical is safety?",
     "Low risk  -> Automated metrics + spot checks",
     "High risk -> Full human evaluation pipeline"),
    ("How much budget?",
     "Low    -> ROUGE/BLEU + free LLM judge (Llama)",
     "Medium -> GPT-4 judge + 10% human eval",
     "High   -> Full human eval + LLM judge + metrics"),
    ("New model or prompt change?",
     "New model   -> Full benchmark suite + task eval",
     "Prompt tweak -> Task-specific eval only"),
]

for i, parts in enumerate(decisions, 1):
    question = parts[0]
    answers = parts[1:]
    print(f"{i}. {question}")
    for a in answers:
        print(f"   {a}")
    print()

python

=== Evaluation Strategy Decision Tree ===

1. What type of task?
   Classification/QA  -> Accuracy, F1, Exact Match
   Generation/Summary -> ROUGE + LLM Judge
   Code generation    -> Pass@k (unit tests)

2. How critical is safety?
   Low risk  -> Automated metrics + spot checks
   High risk -> Full human evaluation pipeline

3. How much budget?
   Low    -> ROUGE/BLEU + free LLM judge (Llama)
   Medium -> GPT-4 judge + 10% human eval
   High   -> Full human eval + LLM judge + metrics

4. New model or prompt change?
   New model   -> Full benchmark suite + task eval
   Prompt tweak -> Task-specific eval only

Tip: **Start simple and add complexity.** Begin with ROUGE + an LLM judge on 50 test cases. That alone puts you ahead of 90% of LLM projects. Add human evaluation and category breakdowns when you need more confidence.

Evaluation Frameworks Worth Knowing

Once you understand how metrics work under the hood, you’ll probably want a framework that handles the boilerplate. Here are the three I’d recommend looking at.

DeepEval is the most comprehensive open-source option. It gives you 14+ built-in metrics (hallucination, toxicity, answer relevance, faithfulness) and integrates with pytest for test-driven evaluation.

python

# DeepEval quickstart (pip install deepeval)
# from deepeval import evaluate
# from deepeval.metrics import AnswerRelevancyMetric
# from deepeval.test_case import LLMTestCase

# metric = AnswerRelevancyMetric(threshold=0.7)
# test_case = LLMTestCase(
#     input="What is Python?",
#     actual_output="Python is a programming language.",
# )
# metric.measure(test_case)
# print(f"Score: {metric.score}, Passed: {metric.is_successful()}")

print("DeepEval: pip install deepeval")
print("  14+ metrics out of the box")
print("  pytest integration for CI/CD")
print("  Best for: comprehensive evaluation pipelines")

python

DeepEval: pip install deepeval
  14+ metrics out of the box
  pytest integration for CI/CD
  Best for: comprehensive evaluation pipelines

RAGAS specializes in RAG evaluation. If you’re building retrieval-augmented systems, RAGAS gives you faithfulness, answer relevance, and context precision metrics designed specifically for that architecture.

Promptfoo takes a different approach — it’s a CLI tool for A/B testing prompts across multiple models. Think of it as unit testing for your prompts.

python

frameworks = {
    "DeepEval": {
        "focus": "General LLM evaluation",
        "metrics": "14+ (hallucination, toxicity, relevancy...)",
        "install": "pip install deepeval",
    },
    "RAGAS": {
        "focus": "RAG pipeline evaluation",
        "metrics": "Faithfulness, context precision, answer relevancy",
        "install": "pip install ragas",
    },
    "Promptfoo": {
        "focus": "Prompt A/B testing across models",
        "metrics": "Custom assertions + LLM graders",
        "install": "npx promptfoo@latest init",
    },
}

print(f"{'Framework':<12} {'Focus':<30} {'Install'}")
print("-" * 70)
for name, info in frameworks.items():
    print(f"{name:<12} {info['focus']:<30} {info['install']}")

python

Framework    Focus                          Install
----------------------------------------------------------------------
DeepEval     General LLM evaluation         pip install deepeval
RAGAS        RAG pipeline evaluation        pip install ragas
Promptfoo    Prompt A/B testing across models npx promptfoo@latest init

Key Insight: **Frameworks save time, but don’t skip understanding the metrics.** If you’ve followed this guide and built metrics from scratch, you’ll know exactly what DeepEval’s “faithfulness” score means under the hood. That understanding is what lets you debug when framework scores don’t match your expectations.

Summary

LLM evaluation has three layers: benchmarks for model selection, metrics for quality measurement, and judgment (human or LLM) for nuanced assessment.

Here’s what to remember:

Benchmarks (MMLU, HumanEval, etc.) compare models but don’t predict performance on your task
Metrics (BLEU, ROUGE, perplexity, BERTScore) quantify specific quality dimensions — always use 2-3, not just one
LLM-as-judge scales evaluation for open-ended tasks — calibrate with human scores
Human evaluation remains the gold standard for safety-critical and subjective tasks
RAG evaluation requires testing retriever and generator separately
Frameworks (DeepEval, RAGAS) handle boilerplate once you understand the underlying metrics
Version everything — test sets, rubrics, and model checkpoints

python

print("=" * 55)
print("   LLM Evaluation: The Complete Toolkit")
print("=" * 55)
print()
print("  Layer 1: Benchmarks  -> Model selection")
print("  Layer 2: Metrics     -> Quality measurement")
print("  Layer 3: Judgment    -> Nuanced assessment")
print()
print("  Start with: ROUGE + LLM judge + 50 test cases")
print("  Scale to:   Category breakdowns + human eval")
print("=" * 55)

python

=======================================================
   LLM Evaluation: The Complete Toolkit
=======================================================

  Layer 1: Benchmarks  -> Model selection
  Layer 2: Metrics     -> Quality measurement
  Layer 3: Judgment    -> Nuanced assessment

  Start with: ROUGE + LLM judge + 50 test cases
  Scale to:   Category breakdowns + human eval
=======================================================

Click to expand the full script (copy-paste and run)

python

# Complete code from: How to Evaluate LLMs
# Requires: pip install numpy
# Python 3.9+

import numpy as np
from collections import Counter

# --- Perplexity ---
def compute_perplexity(log_probs):
    return np.exp(-np.mean(log_probs))

# --- BLEU ---
def compute_bleu(reference, candidate, max_n=4):
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    bp = min(1.0, np.exp(1 - len(ref_tokens) / max(len(cand_tokens), 1)))
    scores = []
    for n in range(1, max_n + 1):
        ref_ng = Counter(tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens)-n+1))
        cand_ng = Counter(tuple(cand_tokens[i:i+n]) for i in range(len(cand_tokens)-n+1))
        matches = sum(min(c, ref_ng[ng]) for ng, c in cand_ng.items())
        total = max(sum(cand_ng.values()), 1)
        scores.append(matches / total)
    if any(s == 0 for s in scores):
        return 0.0
    return bp * np.exp(np.mean([np.log(s) for s in scores]))

# --- ROUGE-1 ---
def rouge_1(reference, candidate):
    ref_t = set(reference.lower().split())
    cand_t = set(candidate.lower().split())
    overlap = ref_t & cand_t
    if not overlap:
        return {"precision": 0, "recall": 0, "f1": 0}
    p = len(overlap) / len(cand_t)
    r = len(overlap) / len(ref_t)
    return {"precision": round(p, 4), "recall": round(r, 4),
            "f1": round(2*p*r/(p+r), 4)}

# --- Cosine Similarity (BERTScore core) ---
def cosine_similarity(vec_a, vec_b):
    dot = sum(a * b for a, b in zip(vec_a, vec_b))
    norm_a = sum(a ** 2 for a in vec_a) ** 0.5
    norm_b = sum(b ** 2 for b in vec_b) ** 0.5
    return dot / (norm_a * norm_b) if norm_a * norm_b > 0 else 0.0

# --- Cohen's Kappa ---
def cohens_kappa(a1, a2):
    n = len(a1)
    po = sum(x == y for x, y in zip(a1, a2)) / n
    labels = set(a1) | set(a2)
    pe = sum((a1.count(l)/n) * (a2.count(l)/n) for l in labels)
    return (po - pe) / (1 - pe) if pe != 1 else 1.0

# --- Evaluation Harness ---
def evaluate_model_outputs(test_cases, predictions):
    scores = []
    for tc, pred in zip(test_cases, predictions):
        r1 = rouge_1(tc["reference"], pred)
        scores.append(r1["f1"])
    return {"n": len(test_cases), "avg_rouge1": round(np.mean(scores), 4)}

print("All evaluation functions loaded. Script completed successfully.")

FAQ

Q: How many test cases do I need for reliable evaluation?

Minimum 50 for basic signals. 200+ for statistically significant comparisons between models. For category-level breakdowns, you need 20-30 per category.

Q: Can I use a weaker model as an LLM judge?

You can, but quality drops. GPT-4 and Claude 3.5 are the standard judges. Open models like Llama 3.1 70B work for simple rubrics but struggle with nuanced quality assessment.

Q: How often should I update my evaluation test set?

Quarterly at minimum. More often if your product scope changes. The test set should always reflect current user queries and edge cases.

Q: Is BLEU still relevant in 2026?

For translation and structured generation, yes. For open-ended generation, prefer ROUGE + BERTScore or an LLM judge. BLEU’s inability to handle paraphrasing makes it insufficient as a sole metric.

Q: What’s the fastest way to set up evaluation from scratch?

Start with 50 representative test cases from your actual use case. Compute ROUGE-1 and run an LLM-as-judge with a simple rubric. That gives you two complementary signals in under a day.

Q: Should I use an evaluation framework like DeepEval or build my own?

Both. Understand the metrics by building them from scratch first (as this guide teaches). Then use a framework like DeepEval or RAGAS for production pipelines — they handle scaling, CI/CD integration, and metric aggregation that you don’t want to maintain yourself.

Prompt Engineering — optimizing prompts before evaluating their impact
Fine-Tuning LLMs — when to fine-tune vs prompt and how to measure improvement
RAG (Retrieval-Augmented Generation) — building and evaluating retrieval pipelines
LLM Deployment — serving models in production with monitoring
AI Safety and Alignment — evaluating model behavior for safety
DPO (Direct Preference Optimization) — training models on human preferences
Vector Databases — the retrieval layer that feeds RAG systems

References

Hendrycks, D. et al. (2021). “Measuring Massive Multitask Language Understanding.” ICLR 2021. arXiv:2009.03300
Chen, M. et al. (2021). “Evaluating Large Language Models Trained on Code.” arXiv:2107.03374. (HumanEval)
Papineni, K. et al. (2002). “BLEU: a Method for Automatic Evaluation of Machine Translation.” ACL 2002.
Lin, C-Y. (2004). “ROUGE: A Package for Automatic Evaluation of Summaries.” ACL 2004 Workshop.
Zheng, L. et al. (2023). “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. arXiv:2306.05685
Zhang, T. et al. (2020). “BERTScore: Evaluating Text Generation with BERT.” ICLR 2020. arXiv:1904.09675
Liang, P. et al. (2023). “Holistic Evaluation of Language Models.” TMLR 2023. (HELM)
Zellers, R. et al. (2019). “HellaSwag: Can a Machine Really Finish Your Sentence?” ACL 2019.
Lin, S. et al. (2022). “TruthfulQA: Measuring How Models Mimic Human Falsehoods.” ACL 2022.
Evidently AI. “30 LLM Evaluation Benchmarks and How They Work.” Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

How to Evaluate LLMs — Metrics, Benchmarks & Python Code

Prerequisites

What Is LLM Evaluation?

Key LLM Benchmarks Explained

MMLU (Massive Multitask Language Understanding)

HumanEval (Code Generation)

Other Important Benchmarks

The Limitations of Benchmarks

Evaluation Metrics You Can Compute in Python

Perplexity — How Surprised Is the Model?

BLEU Score — Translation and Text Similarity

ROUGE Score — Summarization Quality

BERTScore — Semantic Similarity Beyond Surface Overlap

Quick Comparison: Which Metric When?

LLM-as-Judge — Using AI to Evaluate AI

Pairwise Comparison — A Better Judge Pattern

Running LLM-as-Judge at Scale

Human Evaluation — When Automated Metrics Aren’t Enough

Setting Up a Human Evaluation Protocol

Building Your Evaluation Pipeline

The Three-Stage Pipeline

Building an Offline Evaluation Script

Evaluating RAG Systems

Common Evaluation Mistakes

Choosing the Right Evaluation Strategy

Evaluation Frameworks Worth Knowing

Summary

FAQ

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Prerequisites

What Is LLM Evaluation?

Key LLM Benchmarks Explained

MMLU (Massive Multitask Language Understanding)

HumanEval (Code Generation)

Other Important Benchmarks

The Limitations of Benchmarks

Evaluation Metrics You Can Compute in Python

Perplexity — How Surprised Is the Model?

BLEU Score — Translation and Text Similarity

ROUGE Score — Summarization Quality

BERTScore — Semantic Similarity Beyond Surface Overlap

Quick Comparison: Which Metric When?

LLM-as-Judge — Using AI to Evaluate AI

Pairwise Comparison — A Better Judge Pattern

Running LLM-as-Judge at Scale

Human Evaluation — When Automated Metrics Aren’t Enough

Setting Up a Human Evaluation Protocol

Building Your Evaluation Pipeline

The Three-Stage Pipeline

Building an Offline Evaluation Script

Evaluating RAG Systems

Common Evaluation Mistakes

Choosing the Right Evaluation Strategy

Evaluation Frameworks Worth Knowing

Summary

FAQ

Related Topics

References

Related Articles

Build an LLM Benchmarking Platform (Python Project)

LLM Scaling Laws: Model Comparison Dashboard (2026)

LLM Evaluation: Build an LLM-as-Judge Pipeline

Get Your Free AI/ML Engineer Roadmap

Want help choosing the right AI/ML path?

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science