How to Evaluate an LLM: Benchmarks, Metrics, and Practical Workflows

Evaluate your LLM using MMLU, MT-Bench, LLM-as-judge, and ROUGE. Covers lm-evaluation-harness, fine-tuned model comparison, and evaluation pitfalls. With code.

Written by Selva Prabhakaran | 25 min read

You spent a week fine-tuning a model. Loss converged. The adapter saved cleanly. And now you’re staring at a model you have no idea how to measure. That’s the evaluation problem — and it’s the part most teams skip, usually until something goes wrong in production.

Evaluating an LLM is fundamentally different from evaluating a classifier. There’s no single accuracy score. Good responses depend on intent, style, and context. A response can be grammatically perfect and completely wrong. This article gives you a practical framework: four distinct evaluation approaches, runnable code for each, and clear guidance on which method fits your task.

Here’s how the evaluation landscape maps out. Automated benchmarks (MMLU, HellaSwag, ARC) measure general capability using multiple-choice questions — fast, reproducible, but disconnected from your specific use case. LLM-as-judge uses a frontier model like GPT-4 to grade open-ended responses — flexible and task-relevant, but costs money and inherits GPT-4’s biases. Task-specific metrics (ROUGE, BLEU, exact match) measure output quality against reference answers — reliable for structured tasks, useless for open-ended ones. Human evaluation is ground truth — but slow and expensive. We’ll cover the first three with code, and show you when human eval is worth the cost.

Why Evaluation Is the Part Most Teams Skip

I’ve watched teams spend weeks optimising training pipelines and then declare their model “done” because the loss converged. Loss convergence tells you the model learned something. It doesn’t tell you if it learned the right thing. A model can memorise your training set, produce perfect training loss, and fail completely on real prompts.

The stakes are higher with fine-tuned models. You’re not just asking “is this model good?” — you’re asking “is this model better than the base model for my specific task?” That requires a baseline comparison. Without one, you have no way to know if your fine-tuning helped, hurt, or did nothing.

Three things make LLM evaluation hard:

Open-endedness. For a given prompt, many different responses are correct. “What is machine learning?” has thousands of valid answers. Accuracy metrics that require exact matches are nearly useless here.
Task specificity. A model that scores 70% on MMLU might be terrible at your customer support task. General benchmarks measure general capability, not task performance.
Evaluation-training contamination. Standard benchmarks like MMLU and HellaSwag have been in the training data of most large models. High scores might reflect memorisation, not capability.

None of this means evaluation is impossible. It means you need the right tool for your task.

The 4 Types of LLM Evaluation

Before picking a metric, pick the right category.

Type	Best for	Speed	Cost	Task-relevance
Automated benchmarks	General capability, baseline comparison	Fast	Free	Low
LLM-as-judge	Open-ended quality, instruction following	Medium	$$$	High
Task-specific metrics	Structured tasks (summarisation, extraction)	Fast	Free	Medium
Human evaluation	Final validation, high-stakes decisions	Slow	$$$$	Highest

A quick decision rule: Start with automated benchmarks to catch regressions. Add task-specific metrics if your task is structured. Use LLM-as-judge for instruction-following quality. Reserve human eval for final validation before production.

Most fine-tuning evaluations need at least two types: one automated benchmark to catch capability regressions, and one task-specific or LLM-as-judge evaluation to measure actual task improvement.

Automated Benchmarks: MMLU, HellaSwag, ARC, TruthfulQA

Automated benchmarks use multiple-choice questions to measure model capability without human involvement. They’re fast, reproducible, and the standard for comparing models publicly.

Here are the four benchmarks most useful for fine-tuned LLM evaluation:

MMLU (Massive Multitask Language Understanding) — 57 subjects from elementary math to law, 15,000+ questions. Measures general knowledge breadth. Use this to verify your fine-tuned model hasn’t degraded on general capability. A drop of roughly 2–3 points after fine-tuning is a warning sign that varies with training data size.

HellaSwag — Sentence completion tasks requiring commonsense reasoning. Multiple-choice with four options, one correct. Strong baseline test for reasoning.

ARC (AI2 Reasoning Challenge) — Grade-school science questions in two tiers: ARC-Easy and ARC-Challenge. The Challenge set filters out questions that simple retrieval or word-matching can answer — it tests actual reasoning.

TruthfulQA — 817 questions designed to elicit false but plausible-sounding answers. Measures whether the model generates falsehoods. Particularly important if you’re fine-tuning for factual tasks.

If you’ve seen frontier model leaderboards, you’ve probably noticed MMLU scores clustering above 88%. So why run it at all? Because your goal isn’t leaderboard comparison — it’s regression detection. You’re asking one question: did fine-tuning break something? For that, MMLU is still the right tool, even when frontier scores have saturated it.

I’ve stopped paying attention to MMLU leaderboard numbers for the same reason I stopped trusting loss curves alone — both measure something real, but neither tells you what you actually care about for your specific task.

Note: **Benchmark contamination caveat:** MMLU, HellaSwag, and ARC are likely in the training data of most public models. High scores reflect memorisation as much as capability. These benchmarks are most useful for measuring *regression* after fine-tuning — not for comparing to published results. Newer benchmarks like GPQA (graduate-level science questions) and ARC-AGI are harder to contaminate and are increasingly used for frontier model comparisons.

One metric not in the table above is perplexity — the model’s average uncertainty over a held-out text corpus. Lower perplexity means the model assigns higher probability to the test text, which correlates roughly with general language quality. It’s useful for comparing models on a fixed corpus without any task structure. Its weakness: it has no intuitive ceiling and can’t tell you whether the model is actually helpful.

Running Benchmarks with lm-evaluation-harness

The EleutherAI lm-evaluation-harness is the standard tool for running automated benchmarks. It powers the Hugging Face Open LLM Leaderboard. Install it and point it at any Hugging Face model.

Note: **Tested versions:** lm-eval 0.4+, Python 3.9+. The `–model_args` syntax and JSON output format below match lm-eval v0.4. Older versions use slightly different argument names.

Install the harness:

bash

pip install lm-eval

The most important pattern: run the benchmark on both your base model and your fine-tuned model, then compare. A single score in isolation is meaningless — the delta is what matters.

Run MMLU on the base model first to establish your baseline:

bash

lm_eval \
  --model hf \
  --model_args pretrained=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --tasks mmlu \
  --device cuda:0 \
  --batch_size 8 \
  --output_path ./eval-results/base-model

python

# Results summary (example — actual results depend on model and hardware)
|  Task  |Version|Filter|n-shot|Metric|Value|   |Stderr|
|--------|------:|------|-----:|------|----:|---|-----:|
|mmlu    |      1|none  |     5|acc   | 0.52|±  |0.0035|

Now run the same benchmark on your fine-tuned model. If you saved a merged model, point directly at the directory. If you saved only the adapter, load it with peft:

bash

# Option 1: Merged model
lm_eval \
  --model hf \
  --model_args pretrained=./merged-model \
  --tasks mmlu \
  --device cuda:0 \
  --batch_size 8 \
  --output_path ./eval-results/fine-tuned-model

# Option 2: Adapter (PEFT) — lm-eval loads the adapter separately from the base
lm_eval \
  --model hf \
  --model_args pretrained=TinyLlama/TinyLlama-1.1B-Chat-v1.0,peft=./lora-adapter \
  --tasks mmlu \
  --device cuda:0 \
  --batch_size 8 \
  --output_path ./eval-results/fine-tuned-model

Parse both result files and compute the delta. lm-eval v0.4 saves results as JSON with the score at results[task]["acc,none"]:

python

import json
from pathlib import Path

def load_result(result_dir: str, task: str = "mmlu") -> float:
    """Load accuracy score from an lm-eval v0.4 output directory."""
    result_files = list(Path(result_dir).glob("results_*.json"))
    if not result_files:
        raise FileNotFoundError(f"No result files in {result_dir}")
    with open(result_files[0]) as f:
        data = json.load(f)
    # lm-eval v0.4 format: results[task]["acc,none"]
    return data["results"][task]["acc,none"]

base_score = load_result("./eval-results/base-model")
ft_score   = load_result("./eval-results/fine-tuned-model")

print(f"Base model MMLU:       {base_score:.4f}")
print(f"Fine-tuned model MMLU: {ft_score:.4f}")
print(f"Delta:                 {ft_score - base_score:+.4f}")

if ft_score < base_score - 0.03:
    print("WARNING: Fine-tuning degraded general capability by >3 points.")
elif ft_score < base_score:
    print("Minor degradation — acceptable for task-specific fine-tuning.")
else:
    print("No regression detected.")

Tip: **Run multiple tasks in one command:** Add `–tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2` to evaluate all four benchmarks in a single pass. This typically takes 10–30 minutes on a consumer GPU for a 1B model.

Exercise: Evaluate a Model on ARC-Challenge

Write a bash command that evaluates TinyLlama/TinyLlama-1.1B-Chat-v1.0 on arc_challenge using 5-shot prompting, outputs results to ./eval-results/arc/, and uses batch size 4.

bash

# Starter: complete the missing arguments
lm_eval \
  --model hf \
  --model_args pretrained=___ \
  --tasks ___ \
  --num_fewshot ___ \
  --batch_size ___ \
  --output_path ___

Hint 1: The task name is arc_challenge and the model path is the full HuggingFace repo ID.
Hint 2: For 5-shot prompting, use --num_fewshot 5. Batch size 4 and the output path fill the remaining blanks.

Solution

bash

lm_eval \
  --model hf \
  --model_args pretrained=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --tasks arc_challenge \
  --num_fewshot 5 \
  --batch_size 4 \
  --output_path ./eval-results/arc

The `–num_fewshot 5` argument is important — few-shot prompting gives the model examples of the task format before each question, which stabilises answers and produces more reliable scores.

MT-Bench: Evaluating Instruction-Following Quality

MMLU measures knowledge. It doesn’t measure whether your model follows instructions well — which is what you actually care about after instruction fine-tuning. MT-Bench fills that gap.

MT-Bench is an 80-question benchmark covering 8 categories: writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. Questions are multi-turn — each question has a follow-up that requires the model to remember context. A frontier model (typically GPT-4) grades each response on a 1–10 scale.

The multi-turn design is what makes MT-Bench worth the setup effort. In my experience, it’s the multi-turn questions that expose instruction-following failures that single-turn benchmarks never surface — a model that answers question one brilliantly but ignores the follow-up constraint is a model that will frustrate real users.

Note: **Prerequisites for MT-Bench:** You’ll need (1) Python 3.9+, (2) an OpenAI API key with GPT-4 access (roughly $3–8 per evaluation run in 2025 pricing), and (3) Git to clone the FastChat repository. If you don’t have API access, skip to the LLM-as-judge section — you can build a custom evaluator with any frontier model.

The original MT-Bench implementation lives in the FastChat repository. Here’s a minimal setup:

bash

pip install fschat
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install -e ".[model_worker,llm_judge]"

Generate answers from your model. This runs all 80 questions and saves responses to a JSONL file:

bash

python -m fastchat.llm_judge.gen_model_answer \
  --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --model-id tinyllama-base \
  --bench-name mt_bench \
  --answer-file ./mt-bench-answers/tinyllama-base.jsonl

Now run GPT-4 judgment on the answers. This command reads each JSONL response file, batches the questions to GPT-4, and saves judgment results alongside the answers. Expect it to take 5–10 minutes and make roughly 160 API calls (2 turns × 80 questions):

bash

export OPENAI_API_KEY=your_key_here

python -m fastchat.llm_judge.gen_judgment \
  --model-list tinyllama-base \
  --bench-name mt_bench \
  --judge-model gpt-4 \
  --answer-dir ./mt-bench-answers

Show the final scores:

bash

python -m fastchat.llm_judge.show_result \
  --bench-name mt_bench \
  --model-list tinyllama-base

python

Model         | MT-Bench Score
tinyllama-base| 4.2

Key Insight: **MT-Bench scores don’t compare across model scales.** A 1B model scoring 4.5 and a 7B model scoring 6.5 can’t be directly compared — they’re not competing for the same tasks. Use MT-Bench to compare your fine-tuned model against *its own baseline*, not against published leaderboard scores.

LLM-as-Judge: Grading Open-Ended Responses

MT-Bench uses a fixed question set. But your task probably isn’t one of the 8 MT-Bench categories. LLM-as-judge lets you grade your model on any prompt using a frontier model as the evaluator.

You might be wondering whether using GPT-4 to grade your model just swaps one black box for another. It does, partially — but the key difference is that you control the rubric. You can write criteria that match your actual task, which is something MT-Bench can’t do for you.

The core idea: for each test prompt, generate a response from your model, then ask GPT-4 (or Claude Opus) to rate that response on a 1–10 scale using a structured rubric. The function below takes a question and a model response, sends both to GPT-4o-mini with an explicit rubric, and returns a structured score. temperature=0 makes the grading deterministic — the same response gets the same score every time:

python

from openai import OpenAI
import json

client = OpenAI()

JUDGE_PROMPT = """You are evaluating a language model response. Score it on a 1-10 scale.

**Question:** {question}

**Model response:** {response}

**Scoring rubric:**
- 9-10: Accurate, complete, well-formatted, no hallucinations
- 7-8: Mostly accurate with minor gaps, good format
- 5-6: Partially correct or poorly formatted
- 3-4: Significant factual errors or major format issues
- 1-2: Wrong or completely off-topic

Respond with valid JSON only:
{{"score": <int 1-10>, "reasoning": "<one sentence>"}}"""

def judge_response(question: str, response: str) -> dict:
    result = client.chat.completions.create(
        model="gpt-4o-mini",     # cheaper than gpt-4, still reliable for grading
        messages=[
            {"role": "user", "content": JUDGE_PROMPT.format(
                question=question, response=response
            )}
        ],
        temperature=0,           # deterministic grading
        response_format={"type": "json_object"}
    )
    return json.loads(result.choices[0].message.content)

I’ve found this works best when the rubric is explicit and narrow. Don’t ask “is this a good response?” — ask “does this response answer the question accurately (1-5), use the correct format (1-3), and stay under 150 words (1-2)?” Decomposed criteria produce more consistent scores.

To compare base vs. fine-tuned, generate responses from each model separately — load them in sequence, not simultaneously. Load, evaluate, unload before loading the next model to avoid memory issues:

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
ADAPTER_PATH = "./lora-adapter"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

def generate_response(model, tokenizer, prompt: str, max_new_tokens: int = 200) -> str:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    # Slice off the input tokens — only decode the generated portion
    return tokenizer.decode(output[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

test_prompts = [
    "Explain what a neural network is in two sentences.",
    "What is the difference between supervised and unsupervised learning?",
    "Describe three use cases for fine-tuning an LLM.",
]

# Step 1: Evaluate base model BEFORE loading the adapter
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, device_map="auto"
)
base_preds = [generate_response(base_model, tokenizer, p) for p in test_prompts]
base_judgments = [judge_response(p, r) for p, r in zip(test_prompts, base_preds)]
del base_model   # free GPU memory before loading fine-tuned

# Step 2: Load fine-tuned model and evaluate
ft_base = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, device_map="auto"
)
ft_model = PeftModel.from_pretrained(ft_base, ADAPTER_PATH)
ft_model.eval()
ft_preds = [generate_response(ft_model, tokenizer, p) for p in test_prompts]
ft_judgments = [judge_response(p, r) for p, r in zip(test_prompts, ft_preds)]

base_avg = sum(j["score"] for j in base_judgments) / len(base_judgments)
ft_avg   = sum(j["score"] for j in ft_judgments) / len(ft_judgments)

print(f"Base model avg score:       {base_avg:.2f}/10")
print(f"Fine-tuned model avg score: {ft_avg:.2f}/10")
print(f"Improvement:                {ft_avg - base_avg:+.2f}")

Warning: **LLM-as-judge biases:** GPT-4 tends to prefer verbose responses, markdown-formatted answers, and responses that sound confident — regardless of accuracy. Counteract this by (1) running each evaluation twice and averaging, (2) using pairwise comparison instead of absolute scoring, and (3) validating a random sample of judgments manually.

Exercise: Write a Custom LLM-as-Judge Rubric

You’re evaluating a model fine-tuned for customer support ticket classification. Write a JUDGE_PROMPT template that scores responses on three dimensions: correct category label (1–5), professional tone (1–3), and brevity under 50 words (1–2). The prompt should produce valid JSON output matching {"score": int, "category_correct": bool, "reasoning": str}.

python

# Starter: complete the JUDGE_PROMPT template
JUDGE_PROMPT = """You are evaluating a customer support classification response.

**Customer message:** {question}
**Model response:** {response}

Score on these dimensions:
- Category label: ___
- Professional tone: ___
- Brevity (under 50 words): ___

Respond with valid JSON: {"score": <total>, "category_correct": <bool>, "reasoning": "<str>"}"""

Hint 1: Each dimension needs a 1-sentence description of what earns full vs. partial marks.
Hint 2: The total score is the sum of all three dimensions (max 10). Be explicit about max per dimension.

Solution

python

JUDGE_PROMPT = """You are evaluating a customer support classification response.

**Customer message:** {question}
**Model response:** {response}

Score on these three dimensions and sum for total (max 10):
- Category label (1-5): 5=correct category, 3=adjacent category, 1=wrong category
- Professional tone (1-3): 3=formal and polite, 2=acceptable, 1=unprofessional
- Brevity (1-2): 2=under 50 words, 1=over 50 words

Respond with valid JSON only:
{{"score": <int 1-10>, "category_correct": <true/false>, "reasoning": "<one sentence>"}}"""

The decomposed rubric produces much more consistent GPT-4 scores than asking for a holistic “good/bad” judgment. Each dimension has a clear scale the judge can apply mechanically.

Task-Specific Metrics: ROUGE, BLEU, and Exact Match

When your task produces structured or reference-able outputs, you don’t need GPT-4 to grade them. Automated string-comparison metrics work — and they’re free.

What does a ROUGE-L score of 0.45 actually mean in practice? Honestly, very little in isolation. I treat ROUGE as a directional signal, not an absolute one — a 10-point ROUGE-L improvement after fine-tuning tells you something meaningful, but the absolute value of 0.45 doesn’t tell you whether your summaries are good. That judgment still requires human eyes, at least on a sample.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between generated text and a reference. ROUGE-1 counts unigram overlap, ROUGE-2 counts bigram overlap, ROUGE-L measures the longest common subsequence. Use it for summarisation.

BLEU (Bilingual Evaluation Understudy) measures precision of n-gram matches between generated text and reference translations. Use it for translation. It tends to penalise short outputs less than ROUGE.

Exact Match (EM) checks whether the generated output exactly matches the expected answer after normalisation (lowercase, strip punctuation). Use it for extraction, classification, and structured QA where there’s a single correct answer.

Install the required library and define all three metrics before running them on your test set:

python

import re
import torch
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# pip install rouge-score nltk

def normalize(text: str) -> str:
    """Lowercase and strip punctuation for exact-match comparison."""
    return re.sub(r'[^\w\s]', '', text.lower()).strip()

def compute_rouge(predictions: list[str], references: list[str]) -> dict:
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
    for pred, ref in zip(predictions, references):
        result = scorer.score(ref, pred)
        for key in scores:
            scores[key].append(result[key].fmeasure)
    return {k: sum(v) / len(v) for k, v in scores.items()}

def compute_bleu(predictions: list[str], references: list[str]) -> float:
    smoother = SmoothingFunction().method1
    scores = []
    for pred, ref in zip(predictions, references):
        scores.append(sentence_bleu([ref.lower().split()], pred.lower().split(),
                                    smoothing_function=smoother))
    return sum(scores) / len(scores)

def compute_exact_match(predictions: list[str], references: list[str]) -> float:
    return sum(normalize(p) == normalize(r) for p, r in zip(predictions, references)) / len(predictions)

Apply the metrics to a test set:

python

predictions = [
    "The model achieved 85% accuracy on the test set after fine-tuning.",
    "LoRA reduces memory by training only small adapter matrices.",
]
references = [
    "The fine-tuned model reached 85% accuracy on held-out test data.",
    "LoRA uses low-rank adapter matrices to reduce training memory requirements.",
]

rouge_scores = compute_rouge(predictions, references)
bleu_score   = compute_bleu(predictions, references)
em_score     = compute_exact_match(predictions, references)

print("ROUGE Scores:")
for metric, score in rouge_scores.items():
    print(f"  {metric}: {score:.4f}")
print(f"BLEU Score:   {bleu_score:.4f}")
print(f"Exact Match:  {em_score:.4f}")

python

ROUGE Scores:
  rouge1: 0.6341
  rouge2: 0.4615
  rougeL: 0.5854
BLEU Score:   0.4123
Exact Match:  0.0000

Tip: **Which metric to prioritise:** For summarisation, ROUGE-L is the most informative single number — it rewards sequences that preserve the original ordering. For extraction (pulling a span of text from a document), Exact Match is the most honest measure. BLEU is primarily for machine translation — it’s misleading for general text generation.

Exercise: Compute ROUGE-L for a Summarisation Task

Given the predictions and references below, use the compute_rouge function to compute ROUGE-L for each individual pair (not just the average). Then identify which prediction is closest to its reference and explain why.

python

# Starter code
from rouge_score import rouge_scorer

predictions = [
    "Python is a high-level programming language known for its simple syntax.",
    "Gradient descent updates weights by moving in the opposite direction of the gradient.",
    "Fine-tuning trains a pre-trained model on a smaller task-specific dataset.",
]
references = [
    "Python is a high-level language with readable, simple syntax ideal for beginners.",
    "Gradient descent minimises loss by adjusting weights opposite to the gradient direction.",
    "Fine-tuning adapts a pre-trained model to a specific task using a smaller dataset.",
]

# Your task: compute per-example ROUGE-L scores (not just the average)
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
for i, (pred, ref) in enumerate(zip(predictions, references)):
    score = ___
    print(f"Pair {i+1} ROUGE-L: {score:.4f}")

Hint 1: scorer.score(ref, pred) returns an object with a .fmeasure attribute.
Hint 2: Access ROUGE-L specifically with scorer.score(ref, pred)['rougeL'].fmeasure.

Solution

python

scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

for i, (pred, ref) in enumerate(zip(predictions, references)):
    score = scorer.score(ref, pred)['rougeL'].fmeasure
    print(f"Pair {i+1} ROUGE-L: {score:.4f}")

The fine-tuning example (pair 3) typically scores highest because it shares multiple long matching subsequences: “pre-trained model”, “specific task”, and “dataset”. The Python example scores lower because “programming language” vs “language” and “beginners” vs “simple” break the longest common subsequence.

Evaluating a Fine-Tuned Model: A Complete Before/After Workflow

The goal is simple: show that your fine-tuned model is better than the base model on your task, without being significantly worse on general capability.

I always run this four-step evaluation before considering a fine-tuned model production-ready.

Step 1: General capability regression check (lm-evaluation-harness)

Run MMLU on both models. A drop of more than roughly 2–3 points means your fine-tuning degraded general knowledge — likely from catastrophic forgetting. This is a blocker. If you see it, reduce learning rate, reduce epochs, or switch to a smaller dataset.

Step 2: Task-specific quality measurement

Define your task metric upfront (ROUGE-L for summarisation, EM for extraction, custom rubric for generation). Measure both models on a held-out test set of at least 50 examples — 100+ is better.

Step 3: Instruction-following quality check (LLM-as-judge)

Run 20–30 prompts representative of real production traffic through your LLM-as-judge rubric. Compare base vs. fine-tuned. Even a small improvement here signals the fine-tuning is working.

Step 4: Qualitative review of failure cases

Look at the 10–20 lowest-scoring responses from your fine-tuned model. Do they fail in the same way across multiple prompts? Systematic failures point to dataset problems or prompt template issues.

python

# Summary comparison table — replace values with your actual results
import pandas as pd

results = {
    "Metric": ["MMLU (general capability)", "ROUGE-L (task specific)", "LLM-as-judge (1-10)"],
    "Base model": [0.52, 0.41, 5.8],
    "Fine-tuned": [0.51, 0.67, 7.2],
    "Delta": [-0.01, +0.26, +1.4],
    "Verdict": ["OK (< 2pt regression)", "Strong improvement", "Strong improvement"]
}

df = pd.DataFrame(results)
print(df.to_string(index=False))

python

                   Metric  Base model  Fine-tuned  Delta                Verdict
MMLU (general capability)        0.52        0.51  -0.01     OK (< 2pt regression)
       ROUGE-L (task specific)   0.41        0.67  +0.26         Strong improvement
      LLM-as-judge (1-10)        5.80        7.20  +1.40         Strong improvement

Common Evaluation Mistakes to Avoid

Mistake 1: Evaluating only on the training distribution

❌ Wrong: Using your training examples as your test set.

Why it’s wrong: Your model will score perfectly — it memorised the data. You need a held-out test set created before training that the model has never seen.

✅ Correct: Split your dataset before training: 80% train, 20% test. Lock the test set away. Never train on it.

Mistake 2: Trusting a single metric

❌ Wrong: “MMLU improved 3 points, so the model is better.”

Why it’s wrong: MMLU measures general knowledge. It says nothing about instruction-following quality or your specific task. A model can improve on MMLU while getting worse at customer support.

✅ Correct: Always measure at least two complementary metrics — one general (MMLU), one task-specific.

Mistake 3: Treating LLM-as-judge scores as ground truth

❌ Wrong: Accepting every GPT-4 score without review.

Why it’s wrong: GPT-4 has known biases (length, formatting, confidence). It can consistently mis-score certain response types. A model that sounds confident but is wrong will outscore a humble but accurate response.

✅ Correct: Manually review a random 10–15% sample of LLM-as-judge scores. Check for systematic scoring errors. Validate the rubric before running the full evaluation.

Mistake 4: Skipping the failure case analysis

❌ Wrong: Looking only at average scores.

Why it’s wrong: An average score of 7.5/10 hides the 20% of cases scoring 2/10. Those are your production failures. The failures that cluster around specific input patterns are the most actionable.

✅ Correct: After every evaluation run, sort results by score ascending. Spend 20 minutes reading the bottom 10%. One pattern in the failures is worth more than a percentage-point improvement in the average.

I started keeping a failure log — a running document of the lowest-scoring response types from each evaluation. The same three failure patterns kept showing up across different fine-tuning runs: prompts with negation, multi-part instructions, and very short input text. The log saved me from repeating the same dataset mistakes three runs in a row.

When Not to Use Automated Evaluation

Automated evaluation is fast and cheap, but it has real limits.

When your quality criterion can’t be operationalised. “Does this response feel professional?” is hard to code into a rubric. Humans have an immediate intuition here that metrics struggle to replicate. If you can’t write down exactly what you’re measuring, don’t trust automated scores.

When the task requires domain expertise. Medical, legal, and scientific accuracy require domain experts to evaluate, not a rubric. GPT-4 can score a medical response wrong because it doesn’t know the current clinical guidelines.

When you’re evaluating safety or alignment. Harmfulness, bias, and refusal quality are high-stakes and hard to measure. Use human review for any safety-critical evaluations.

Before major production decisions. Automated evaluation de-risks the iteration cycle. But before deploying to real users, always do at least a small human review of representative cases.

Frequently Asked Questions

Q: How many test examples do I need for a reliable evaluation?

For automated metrics (ROUGE, exact match), 50 examples give rough estimates, 200+ give stable results. For LLM-as-judge, 30 examples is usually enough for a directional signal — scores stabilise quickly. For MMLU and similar benchmarks, the fixed question sets are large enough by design.

Q: Can I evaluate without a GPU?

Yes for LLM-as-judge (it’s all API calls). For lm-evaluation-harness, CPU evaluation works but is very slow — plan for several hours for even a 1B model on MMLU. For inference during task-specific evaluation, CPU is feasible for models under 1B parameters.

Q: What’s the difference between MT-Bench and MMLU?

MMLU tests knowledge across 57 subjects using multiple-choice questions. MT-Bench tests instruction-following quality and conversation ability across 8 open-ended categories graded by GPT-4. Use MMLU to measure “does the model know things?” and MT-Bench to measure “does the model follow instructions well?”

Q: How do I evaluate for hallucination?

TruthfulQA is the standard automated benchmark for hallucination. For task-specific hallucination, build a small set of “trap” prompts — questions with specific factual answers you can verify. Use exact-match or LLM-as-judge with an accuracy-focused rubric.

Q: Should I use GPT-4 or Claude as the judge?

Both are reasonable. GPT-4 is the default because MT-Bench and most published studies use it, making comparisons easier. Claude tends to score more conservatively. The key is consistency — pick one judge model and use it for all comparisons within a study.

Complete Code

Click to expand the complete evaluation pipeline

python

# LLM Evaluation Pipeline — Complete Script
# Requirements: pip install lm-eval rouge-score nltk openai transformers peft torch pandas
# Tested: lm-eval 0.4+, transformers 4.40+, peft 0.10+

import json
import re
import torch
import pandas as pd
from pathlib import Path
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from openai import OpenAI
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# ── Config ─────────────────────────────────────────────────────────────────────
BASE_MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
ADAPTER_PATH    = "./lora-adapter"   # Set to None to skip fine-tuned eval
OPENAI_API_KEY  = "your_key_here"

# ── Metric helpers ─────────────────────────────────────────────────────────────
def normalize(text: str) -> str:
    return re.sub(r'[^\w\s]', '', text.lower()).strip()

def compute_rouge(predictions: list, references: list) -> dict:
    scorer_obj = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
    for pred, ref in zip(predictions, references):
        result = scorer_obj.score(ref, pred)
        for key in scores:
            scores[key].append(result[key].fmeasure)
    return {k: sum(v) / len(v) for k, v in scores.items()}

def compute_exact_match(predictions: list, references: list) -> float:
    return sum(normalize(p) == normalize(r) for p, r in zip(predictions, references)) / len(predictions)

def generate_response(model, tokenizer, prompt: str, max_new_tokens: int = 200) -> str:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    return tokenizer.decode(output[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

# ── LLM-as-judge ──────────────────────────────────────────────────────────────
JUDGE_PROMPT = """You are evaluating a language model response. Score it 1-10.

**Question:** {question}
**Response:** {response}

Scoring: 9-10=accurate+complete, 7-8=mostly correct, 5-6=partial, 3-4=errors, 1-2=wrong.

Respond with JSON only: {{"score": <int 1-10>, "reasoning": "<one sentence>"}}"""

def judge_response(question: str, response: str, client: OpenAI) -> dict:
    result = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(question=question, response=response)}],
        temperature=0,
        response_format={"type": "json_object"}
    )
    return json.loads(result.choices[0].message.content)

# ── Load tokenizer ─────────────────────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

# ── Test prompts and references ────────────────────────────────────────────────
test_prompts = [
    "Summarize in one sentence: LoRA fine-tunes LLMs by training only small adapter matrices.",
    "Summarize in one sentence: QLoRA adds 4-bit quantization to LoRA, cutting memory in half.",
]
references = [
    "LoRA trains small adapter matrices instead of full model weights.",
    "QLoRA uses 4-bit quantization with LoRA adapters to reduce memory usage.",
]

# ── Step 1: Evaluate base model first ─────────────────────────────────────────
print("Evaluating base model...")
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_NAME, torch_dtype=torch.float16, device_map="auto"
)
base_preds = [generate_response(base_model, tokenizer, p) for p in test_prompts]
base_rouge = compute_rouge(base_preds, references)
del base_model   # free GPU memory before loading fine-tuned model

# ── Step 2: Evaluate fine-tuned model ─────────────────────────────────────────
ft_rouge = {}
if ADAPTER_PATH:
    print("Evaluating fine-tuned model...")
    ft_base = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL_NAME, torch_dtype=torch.float16, device_map="auto"
    )
    ft_model = PeftModel.from_pretrained(ft_base, ADAPTER_PATH)
    ft_model.eval()
    ft_preds = [generate_response(ft_model, tokenizer, p) for p in test_prompts]
    ft_rouge = compute_rouge(ft_preds, references)

# ── Step 3: LLM-as-judge (optional — requires API key) ────────────────────────
if OPENAI_API_KEY != "your_key_here" and ADAPTER_PATH:
    client = OpenAI(api_key=OPENAI_API_KEY)
    base_model_for_judge = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL_NAME, torch_dtype=torch.float16, device_map="auto"
    )
    base_judgments = [judge_response(p, generate_response(base_model_for_judge, tokenizer, p), client)
                      for p in test_prompts]
    ft_judgments   = [judge_response(p, generate_response(ft_model, tokenizer, p), client)
                      for p in test_prompts]
    base_judge_avg = sum(j["score"] for j in base_judgments) / len(base_judgments)
    ft_judge_avg   = sum(j["score"] for j in ft_judgments) / len(ft_judgments)
else:
    base_judge_avg, ft_judge_avg = None, None

# ── Step 4: Summary report ─────────────────────────────────────────────────────
print("\n=== Evaluation Summary ===")
print(f"Base  ROUGE-L: {base_rouge['rougeL']:.4f}")
if ft_rouge:
    print(f"FT    ROUGE-L: {ft_rouge['rougeL']:.4f}")
    print(f"Delta ROUGE-L: {ft_rouge['rougeL'] - base_rouge['rougeL']:+.4f}")
if base_judge_avg:
    print(f"Base  LLM-judge: {base_judge_avg:.2f}/10")
    print(f"FT    LLM-judge: {ft_judge_avg:.2f}/10")
    print(f"Delta LLM-judge: {ft_judge_avg - base_judge_avg:+.2f}")

References

Hendrycks, D. et al. (2020). Measuring Massive Multitask Language Understanding. arXiv:2009.03300. Link
Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685. Link
EleutherAI. Language Model Evaluation Harness. GitHub. Link
Lin, C. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. ACL Workshop. Link
Papineni, K. et al. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. ACL. Link
Lin, S. et al. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv:2109.07958. Link
Raschka, S. (2023). Understanding the 4 Main Approaches to LLM Evaluation. Ahead of AI. Link
Fourrier, C. (2023). Let’s Talk About LLM Evaluation. Hugging Face Blog. Link

[SCHEMA HINTS]
– Article type: HowTo
– Primary technology: lm-eval 0.4+, OpenAI API, rouge-score
– Programming language: Python
– Difficulty: Intermediate
– Keywords: evaluate LLM, LLM benchmark, MMLU evaluation, MT-Bench, LLM-as-judge, ROUGE score

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master NLP — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

How to Evaluate an LLM: Benchmarks, Metrics, and Practical Workflows

Why Evaluation Is the Part Most Teams Skip

The 4 Types of LLM Evaluation

Automated Benchmarks: MMLU, HellaSwag, ARC, TruthfulQA

Running Benchmarks with lm-evaluation-harness

Exercise: Evaluate a Model on ARC-Challenge

MT-Bench: Evaluating Instruction-Following Quality

LLM-as-Judge: Grading Open-Ended Responses

Exercise: Write a Custom LLM-as-Judge Rubric

Task-Specific Metrics: ROUGE, BLEU, and Exact Match

Exercise: Compute ROUGE-L for a Summarisation Task

Evaluating a Fine-Tuned Model: A Complete Before/After Workflow

Common Evaluation Mistakes to Avoid

Mistake 1: Evaluating only on the training distribution

Mistake 2: Trusting a single metric

Mistake 3: Treating LLM-as-judge scores as ground truth

Mistake 4: Skipping the failure case analysis

When Not to Use Automated Evaluation

Frequently Asked Questions

Complete Code

What to Read Next

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Why Evaluation Is the Part Most Teams Skip

The 4 Types of LLM Evaluation

Automated Benchmarks: MMLU, HellaSwag, ARC, TruthfulQA

Running Benchmarks with lm-evaluation-harness

Exercise: Evaluate a Model on ARC-Challenge

MT-Bench: Evaluating Instruction-Following Quality

LLM-as-Judge: Grading Open-Ended Responses

Exercise: Write a Custom LLM-as-Judge Rubric

Task-Specific Metrics: ROUGE, BLEU, and Exact Match

Exercise: Compute ROUGE-L for a Summarisation Task

Evaluating a Fine-Tuned Model: A Complete Before/After Workflow

Common Evaluation Mistakes to Avoid

Mistake 1: Evaluating only on the training distribution

Mistake 2: Trusting a single metric

Mistake 3: Treating LLM-as-judge scores as ground truth

Mistake 4: Skipping the failure case analysis

When Not to Use Automated Evaluation

Frequently Asked Questions

Complete Code

What to Read Next

References

Related Articles

Build a BPE Tokenizer from Scratch in Python — Step-by-Step Guide

What is Tokenization in Natural Language Processing (NLP)?

Text Summarization Approaches for NLP – Practical Guide with Generative Examples

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.