Menu

How to Build a Custom Instruction Dataset for LLM Fine-Tuning

Written by Selva Prabhakaran | 28 min read

Every production fine-tuning project that fails, fails for the same reason: the dataset. Not the model, not the hyperparameters, not the compute budget. I’ve watched teams spend weeks tuning learning rates on models that behaved badly at inference time — then discovered their training data had inconsistent instructions, duplicate examples, and outputs written at three different quality levels. This guide walks you through building a custom instruction dataset from scratch: choosing the right format, collecting and generating data, filtering out bad examples, and packaging it for fine-tuning.


Why Your Dataset Determines Fine-Tuning Success

Here’s a number that surprises most practitioners: a dataset of 1,000 high-quality, carefully curated instruction-response pairs consistently outperforms a dataset of 52,000 noisy ones. LIMA (Meta, 2023) trained on exactly 1,000 carefully selected examples and matched or exceeded Alpaca’s performance on several benchmarks. Alpaca had 52x more data.

The reason is straightforward. Fine-tuning doesn’t teach the model new knowledge — the base model already has it. Fine-tuning teaches the model how to respond: the format, the style, the level of detail. If your training examples are inconsistent, the model learns inconsistency.

Key Insight: **The quality-quantity trade-off is asymmetric.** More low-quality data actively hurts performance. More high-quality data reliably helps. Always invest in quality first, quantity second. A 500-example curated dataset beats a 5,000-example scrape-and-dump every time.

Prerequisites:
– Python 3.9+ with datasets, openai>=1.0, and huggingface-hub installed
– A HuggingFace account (free) for dataset hosting
– An OpenAI API key for synthetic data generation (Step 3, optional but recommended)
– Basic Python and JSON familiarity — no ML background needed for this article

bash
pip install datasets "openai>=1.0" huggingface-hub

Choosing Your Format: Alpaca, ShareGPT, and ChatML

Before collecting a single example, decide on your format. This choice affects every training script, every generation prompt, and every format conversion downstream. Changing formats mid-project means reformatting everything.

There are two formats you’ll encounter in every fine-tuning framework today — and one important variant for preference alignment.

Alpaca Format

Alpaca is the simplest format: a flat JSON object with three fields. Stanford released it with their Alpaca model in 2023, and it became the de facto standard for single-turn instruction tuning.

python
import json

# Alpaca format: instruction, optional input context, expected output
alpaca_example = {
    "instruction": "Summarize the following customer complaint in one sentence.",
    "input": "I ordered a laptop three weeks ago and it still hasn't arrived. "
             "Customer support hasn't responded to my emails. This is unacceptable.",
    "output": "Customer reports a three-week delivery delay and unresponsive support."
}

print(json.dumps(alpaca_example, indent=2))
print(f"\nKeys: {list(alpaca_example.keys())}")
python
{
  "instruction": "Summarize the following customer complaint in one sentence.",
  "input": "I ordered a laptop three weeks ago and it still hasn't arrived. Customer support hasn't responded to my emails. This is unacceptable.",
  "output": "Customer reports a three-week delivery delay and unresponsive support."
}

Keys: ['instruction', 'input', 'output']

The input field is optional. When your instruction is self-contained (“Write a haiku about autumn”), leave input as an empty string. When the instruction requires additional context (“Summarize the following”), put that context in input.

Use Alpaca for: single-turn tasks (classification, summarization, extraction, code generation) where one instruction produces one response.

ShareGPT / ChatML Format

ShareGPT uses a messages array with role-labeled turns. It’s built for multi-turn conversations and matches the OpenAI Chat Completions API structure — your training data looks exactly like your inference calls.

python
# ShareGPT format: role-labeled conversation turns
sharegpt_example = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful customer service assistant for an e-commerce company."
        },
        {
            "role": "user",
            "content": "I ordered a laptop three weeks ago and it still hasn't arrived. What can you do?"
        },
        {
            "role": "assistant",
            "content": "I'm sorry to hear about the delay. I can look up your order right now. "
                       "Could you share your order number so I can check the shipping status?"
        }
    ]
}

print(f"Turns in conversation: {len(sharegpt_example['messages'])}")
print(f"Roles: {[m['role'] for m in sharegpt_example['messages']]}")
python
Turns in conversation: 3
Roles: ['system', 'user', 'assistant']

Use ShareGPT for: chatbots, assistants, anything with system prompts or multi-turn conversation.

Format Comparison: Which to Choose

Format Use Case Key Fields Supported Frameworks
Alpaca Single-turn instruction tuning instruction, input, output trl, Axolotl, LLaMA-Factory, Unsloth
ShareGPT / ChatML Multi-turn conversation, system prompts messages[{role, content}] trl, Unsloth, LLaMA-Factory, Axolotl
DPO (preference) Preference alignment, RLHF alternative prompt, chosen, rejected trl, Axolotl

DPO (Direct Preference Optimization) is worth knowing even if you’re not using it yet. Instead of a single “correct” output, each example contains a preferred response and a rejected one. The model learns to prefer the chosen output. If you plan to do alignment fine-tuning after SFT, you’ll need a separate DPO dataset in this format. For now, Alpaca or ShareGPT gets you to your first fine-tuned model — DPO is a post-SFT step.

Tip: **Match your training format to your inference format.** If your deployed model will use system prompts and multi-turn dialogue, train it on ShareGPT. If you train in Alpaca but infer in ShareGPT, the model has never seen the format it’s being asked to use — and it shows.

Try It Yourself

Exercise 1: Convert a Q&A list to Alpaca format

You have a list of raw Q&A pairs from a company FAQ. Convert each pair into a valid Alpaca-format dictionary and save the result as a JSONL file (one JSON object per line).

python
# Starter code
import json

raw_qa = [
    ("What are your business hours?", "We are open Monday–Friday, 9am–6pm EST."),
    ("How do I reset my password?", "Click 'Forgot Password' on the login page and follow the email instructions."),
    ("Do you offer refunds?", "Yes, we offer full refunds within 30 days of purchase."),
]

# TODO: Convert raw_qa to Alpaca format and write to "faq_dataset.jsonl"
# Each dict should have: instruction, input (empty string), output

def convert_to_alpaca(qa_pairs: list) -> list[dict]:
    pass  # Your code here

alpaca_data = convert_to_alpaca(raw_qa)
print(f"Converted {len(alpaca_data)} example(s)")
print(f"First example keys: {list(alpaca_data[0].keys()) if alpaca_data else 'None'}")
Solution
python
import json

raw_qa = [
    ("What are your business hours?", "We are open Monday–Friday, 9am–6pm EST."),
    ("How do I reset my password?", "Click 'Forgot Password' on the login page and follow the email instructions."),
    ("Do you offer refunds?", "Yes, we offer full refunds within 30 days of purchase."),
]

def convert_to_alpaca(qa_pairs: list) -> list[dict]:
    alpaca_data = []
    for question, answer in qa_pairs:
        alpaca_data.append({
            "instruction": question,
            "input": "",
            "output": answer,
        })
    return alpaca_data

alpaca_data = convert_to_alpaca(raw_qa)

with open("faq_dataset.jsonl", "w") as f:
    for example in alpaca_data:
        f.write(json.dumps(example) + "\n")

print(f"Converted {len(alpaca_data)} example(s)")
print(f"First example keys: {list(alpaca_data[0].keys())}")

**Output:**

python
Converted 3 example(s)
First example keys: ['instruction', 'input', 'output']

Each Q&A pair becomes an Alpaca dict with an empty `input` field — the question is self-contained so no context is needed. Writing as JSONL (one JSON object per line) is the standard format for large datasets because it’s streamable and easy to process line by line.


Step 1: Collect and Source Your Raw Data

What’s the best source for a fine-tuning dataset? Whatever your model will actually be used on.

If you’re building a customer support bot, your existing support tickets are worth ten times any synthetic dataset you could generate. Real data captures the exact language, edge cases, and ambiguous requests your model will face in production. I’ve seen teams skip this and jump straight to synthetic generation when they had 8,000 real customer conversations sitting in their CRM. That’s a mistake every time.

Sources ranked by data quality for fine-tuning:

  1. Real examples from your production domain — highest quality; captures genuine user language and edge cases
  2. Curated public datasets filtered to your task — pre-cleaned but may not match your domain vocabulary
  3. LLM-generated synthetic data — fast and scalable; risks introducing LLM response patterns into training

How much data do you actually need? The answer depends heavily on task complexity. Here are practical starting targets:

Task Type Recommended Size Notes
Sentiment / intent classification 200 – 2K Small datasets work well for narrow label sets
Named entity extraction 500 – 3K Needs coverage of all entity types
Text summarization 1K – 5K More examples = more style consistency
Domain-specific Q&A 2K – 10K Needs to cover full knowledge range
Multi-turn chatbot 5K – 20K+ Each turn adds complexity; more is better
General instruction following 5K – 50K Broad tasks require broad coverage

These are starting points, not ceilings. The LIMA result (1K → matched 52K) suggests the upper bound on “enough” may be lower than you expect — but only if your 1K examples are genuinely high quality.

Loading a public dataset as a foundation is often the fastest start:

python
from datasets import load_dataset

# Stanford Alpaca — 52K general instruction examples, good as a seed
alpaca = load_dataset("tatsu-lab/alpaca", split="train")
print(f"Alpaca dataset size: {len(alpaca):,} examples")
print(f"Features: {list(alpaca.features.keys())}")

# Keep only examples without a context field (pure instruction tasks)
pure_instruction = alpaca.filter(lambda x: x["input"].strip() == "")
print(f"Pure instruction examples: {len(pure_instruction):,}")
python
<!-- OUTPUT — MUST BE FILLED BY RUNNING CODE -->
Note: **The original Alpaca dataset has documented quality problems.** Several independent researchers found hallucinated facts, inconsistent formatting, and incorrect outputs in the original 52K examples. If you use it as a seed, apply quality filtering (Step 4) before training — don’t assume public datasets are clean.

The real-vs-synthetic decision doesn’t have to be either/or. Start with whatever real examples you have, use them as seeds for synthetic generation, then filter the combined set. This is how most production datasets are built.


Step 2: Design High-Quality Instruction-Response Pairs

python
# This is what the difference looks like in practice
good_example = {
    "instruction": "Classify the sentiment of this product review as Positive, Negative, or Neutral.",
    "input": "The camera quality is excellent but the battery dies after 4 hours.",
    "output": "Neutral"
}

bad_example = {
    "instruction": "What do you think of this review?",
    "input": "The camera quality is excellent but the battery dies after 4 hours.",
    "output": "This review seems mixed — the camera is great but battery life is a concern. "
              "Overall somewhat positive but with reservations. Many users find battery life frustrating."
}

print(f"Good output length: {len(good_example['output'])} chars")
print(f"Bad output length:  {len(bad_example['output'])} chars")
print(f"Good instruction specificity: explicit label format (Positive/Negative/Neutral)")
print(f"Bad instruction specificity:  open-ended ('What do you think?')")
python
Good output length: 7 chars
Bad output length:  176 chars
Good instruction specificity: explicit label format (Positive/Negative/Neutral)
Bad instruction specificity:  open-ended ('What do you think?')

Three rules cover 80% of what makes an instruction pair good or bad.

Rule 1: Instructions must be unambiguous. Every person reading your instruction should produce the same type of response. “Improve this text” is ambiguous — improve how? For clarity? Conciseness? Formality? “Rewrite this customer email to be more concise, keeping all key information” is specific. In my experience, instruction clarity is the single biggest predictor of fine-tuning quality.

Rule 2: Outputs must be consistent in format and length. If half your training examples end with a period and half don’t, your model learns to randomly add periods. Decide your conventions — bullet points or prose, sentence case or title case, short or detailed — and apply them uniformly. A model trained on 7-character outputs (“Neutral”) will not learn to write paragraphs, and vice versa.

Rule 3: Cover the full range of your task, not just the easy cases. If you’re building a support bot, include examples with ambiguous requests, examples where the answer is “I don’t have that information,” and multi-step reasoning examples. A model trained only on clean, easy cases will hallucinate on hard ones.

Warning: **Only include examples of what you want the model to do.** If you include “I can’t help with that” refusal examples for requests outside your scope, make sure those examples are consistent and deliberate. Inconsistent refusals (sometimes refusing, sometimes not, for the same type of request) teach inconsistency.

Step 3: Generate Synthetic Data Using an LLM

Once you have 50–200 high-quality seed examples, you can scale using an LLM as a data generator. This is how Alpaca was built: 175 seed instructions fed to GPT-3.5 generated 52,000 examples. The technique is called self-instruct and it’s now standard practice.

The key to good synthetic generation is constraint. Vague generation prompts produce vague data.

python
from openai import OpenAI
import json

client = OpenAI()  # Requires OPENAI_API_KEY environment variable

def generate_instruction_pairs(
    topic: str,
    n: int = 5,
    style: str = "concise and factual"
) -> list[dict]:
    """Generate Alpaca-format instruction-response pairs on a topic."""

    prompt = f"""Generate {n} instruction-response pairs for fine-tuning an LLM assistant.

Topic: {topic}
Response style: {style}

Requirements:
- Each instruction must be a clear, specific task (not 'explain X' — be precise)
- Each output must directly and completely answer the instruction
- All outputs must be consistent in length and tone
- No duplicate instructions

Return ONLY a JSON object:
{{"pairs": [{{"instruction": "...", "input": "", "output": "..."}}]}}"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        response_format={"type": "json_object"},
    )

    data = json.loads(response.choices[0].message.content)
    return data.get("pairs", [])


pairs = generate_instruction_pairs(
    topic="e-commerce customer support for a clothing retailer",
    n=5,
    style="professional, empathetic, and solution-oriented"
)

print(f"Generated {len(pairs)} instruction pair(s)")
if pairs:
    print(f"\nSample instruction: {pairs[0]['instruction']}")
    print(f"Sample output (first 80 chars): {pairs[0]['output'][:80]}...")
python
<!-- OUTPUT — MUST BE FILLED BY RUNNING CODE -->

I use gpt-4o-mini for bulk generation and reserve gpt-4o for human review and spot-checking — the cost difference over 1,000 examples is around 50x, and gpt-4o-mini quality is more than sufficient for generation.

If you need to automate large-scale pipelines — generating, annotating, and filtering thousands of examples in one workflow — Distilabel is worth looking at. It’s an open-source framework designed specifically for synthetic data generation with LLMs. You define a pipeline of generation and annotation steps; it handles batching, caching, and output validation. For datasets above 10K examples, the automation it provides saves significant manual work.

Tip: **Generate in batches of 5–10, not hundreds at once.** Large batches produce more repetition and drift. Call the function 100 times to get 500 examples, then deduplicate — you’ll get better diversity than asking for 500 at once. Also: include 2–3 of your best real examples in the prompt as demonstrations. The model will mimic their format, producing synthetic data that matches your established style.

Step 4: Filter Out Low-Quality Examples

python
def filter_dataset(examples: list[dict]) -> tuple[list[dict], dict]:
    """
    Apply quality filters. Returns (filtered_examples, removal_stats).

    Filters:
    - Instruction too short: < 10 chars (vague or trivial)
    - Output empty: no response at all
    - Output too short: < 20 chars (likely a stub or placeholder)
    - Output too long: > 2048 chars (likely off-topic or repetitive)
    """
    filtered = []
    removed = {"too_short_instruction": 0, "empty_output": 0,
               "too_short_output": 0, "too_long_output": 0}

    for ex in examples:
        instruction = ex.get("instruction", "")
        output = ex.get("output", "")

        if len(instruction) < 10:
            removed["too_short_instruction"] += 1
            continue
        if not output.strip():
            removed["empty_output"] += 1
            continue
        if len(output) < 20:
            removed["too_short_output"] += 1
            continue
        if len(output) > 2048:
            removed["too_long_output"] += 1
            continue

        filtered.append(ex)

    return filtered, removed


raw_examples = [
    {"instruction": "Hi", "input": "", "output": "Hello there!"},
    {
        "instruction": "Classify the sentiment of this review.",
        "input": "The package arrived broken and customer service was no help at all.",
        "output": "Negative — the reviewer reports a damaged delivery and unhelpful support."
    },
    {
        "instruction": "Explain Python list comprehensions.", "input": "",
        "output": "List comprehensions provide a concise way to create lists from iterables using a single line of code."
    },
    {"instruction": "Summarize this article.", "input": "Some text.", "output": ""},
    {"instruction": "Write a haiku.", "input": "", "output": "Ok"},
]

filtered, stats = filter_dataset(raw_examples)

print(f"Before filtering: {len(raw_examples)} example(s)")
print(f"After filtering:  {len(filtered)} example(s)")
print(f"\nRemoval reasons:")
for reason, count in stats.items():
    if count > 0:
        print(f"  {reason}: {count}")
python
Before filtering: 5 example(s)
After filtering:  2 example(s)

Removal reasons:
  too_short_instruction: 1
  empty_output: 1
  too_short_output: 1

Two examples survive: the sentiment classification and the list comprehension explanation. “Hi” fails the instruction length check. The empty article summary is caught by the empty output filter. “Ok” fails the minimum output length check.

The 20-character output minimum feels arbitrary, but I’ve never regretted setting it. It catches stub outputs (“Yes”, “Ok”, “Done.”) that would teach your model to give monosyllabic answers — which is almost never what you want.

Tip: **Add domain-specific filters on top of generic ones.** If you’re building a medical Q&A model, filter out any examples that express certainty about diagnoses. For a code assistant, add `ast.parse()` validation to ensure Python code blocks in outputs are syntactically valid. Generic filters catch generic problems — your domain filters catch domain problems.

Try It Yourself

Exercise 2: Add echo detection to the quality filter

A common failure in synthetic generation: the model echoes back the input field as its “output” instead of answering. Extend filter_dataset to also remove examples where output.strip() == input.strip() (when input is non-empty).

python
# Starter code — extend this function
def filter_dataset_v2(examples: list[dict]) -> tuple[list[dict], dict]:
    filtered = []
    removed = {"too_short_instruction": 0, "empty_output": 0,
               "too_short_output": 0, "too_long_output": 0, "output_copies_input": 0}

    for ex in examples:
        instruction = ex.get("instruction", "")
        output = ex.get("output", "")
        inp = ex.get("input", "")

        if len(instruction) < 10:
            removed["too_short_instruction"] += 1; continue
        if not output.strip():
            removed["empty_output"] += 1; continue
        if len(output) < 20:
            removed["too_short_output"] += 1; continue
        if len(output) > 2048:
            removed["too_long_output"] += 1; continue
        # TODO: add echo detection — skip if output is identical to input

        filtered.append(ex)

    return filtered, removed


test_cases = [
    {"instruction": "Rewrite this sentence.", "input": "The cat sat on the mat.",
     "output": "The cat sat on the mat."},        # echo — should be removed
    {"instruction": "Rewrite this sentence.", "input": "The cat sat on the mat.",
     "output": "A feline rested upon the rug."},  # OK — should survive
]

filtered, stats = filter_dataset_v2(test_cases)
print(f"Survived: {len(filtered)}")                    # Should print: 1
print(f"Echo removed: {stats['output_copies_input']}")  # Should print: 1
Solution
python
def filter_dataset_v2(examples: list[dict]) -> tuple[list[dict], dict]:
    filtered = []
    removed = {"too_short_instruction": 0, "empty_output": 0,
               "too_short_output": 0, "too_long_output": 0, "output_copies_input": 0}

    for ex in examples:
        instruction = ex.get("instruction", "")
        output = ex.get("output", "")
        inp = ex.get("input", "")

        if len(instruction) < 10:
            removed["too_short_instruction"] += 1; continue
        if not output.strip():
            removed["empty_output"] += 1; continue
        if len(output) < 20:
            removed["too_short_output"] += 1; continue
        if len(output) > 2048:
            removed["too_long_output"] += 1; continue
        if inp.strip() and output.strip() == inp.strip():
            removed["output_copies_input"] += 1; continue

        filtered.append(ex)

    return filtered, removed


test_cases = [
    {"instruction": "Rewrite this sentence.", "input": "The cat sat on the mat.",
     "output": "The cat sat on the mat."},
    {"instruction": "Rewrite this sentence.", "input": "The cat sat on the mat.",
     "output": "A feline rested upon the rug."},
]

filtered, stats = filter_dataset_v2(test_cases)
print(f"Survived: {len(filtered)}")
print(f"Echo removed: {stats['output_copies_input']}")

**Output:**

python
Survived: 1
Echo removed: 1

The echo check only runs when `inp` is non-empty — we don’t want to accidentally flag cases where both the input and output happen to be empty strings. This filter is especially important for summarization tasks, where a low-quality LLM will sometimes just return the text it was asked to summarize.


Step 5: Deduplicate Your Dataset

Here’s a reliable signal that your dataset has duplication problems: your model fine-tuned on 1,000 examples gives the same response to 5 different questions. Deduplication is why this happens less often than you’d expect — and why skipping it makes it happen constantly.

LLM-generated data is especially prone to near-duplicates. Ask GPT-4o-mini to generate 10 examples on “Python error handling” and at least 3 will be variations of “write a try/except block.” Duplicates waste training steps and bias your model toward whatever phrasing appeared most often.

python
import hashlib

def deduplicate_dataset(examples: list[dict]) -> tuple[list[dict], int]:
    """Remove duplicates using MD5 hash of instruction + input."""
    seen_hashes = set()
    unique_examples = []

    for ex in examples:
        # Hash the instruction+input pair — same question with two different outputs is a contradiction
        content = ex.get("instruction", "").strip() + "|" + ex.get("input", "").strip()
        hash_val = hashlib.md5(content.encode("utf-8")).hexdigest()

        if hash_val not in seen_hashes:
            seen_hashes.add(hash_val)
            unique_examples.append(ex)

    removed_count = len(examples) - len(unique_examples)
    return unique_examples, removed_count


examples_with_dupes = [
    {"instruction": "What is Python?", "input": "", "output": "Python is a programming language."},
    {"instruction": "What is Python?", "input": "", "output": "Python is a high-level language."},  # same instruction
    {"instruction": "Explain recursion.", "input": "", "output": "Recursion is when a function calls itself."},
    {"instruction": "Explain recursion.", "input": "", "output": "A function that calls itself is recursive."},  # same
    {"instruction": "What is a decorator?", "input": "", "output": "A decorator modifies function behavior."},
]

unique, removed_count = deduplicate_dataset(examples_with_dupes)
print(f"Before dedup: {len(examples_with_dupes)} example(s)")
print(f"After dedup:  {len(unique)} example(s)")
print(f"Removed:      {removed_count} duplicate(s)")
python
Before dedup: 5 example(s)
After dedup:  3 example(s)
Removed:      2 duplicate(s)

The hash covers instruction + input, not output. That’s intentional. When two examples have the same question but different answers, we keep the first one and discard the second — a model can’t learn to give two different answers to the same question.

Hash deduplication is fast and sufficient for datasets under 10K examples. Beyond that, consider semantic deduplication using sentence-transformers to catch near-duplicates like “What is Python?” and “Can you explain what Python is?” — exact hashing won’t catch those.


Try It Yourself

Exercise 3: Validate required fields before upload

Before uploading, you want to catch malformed examples — dicts missing required fields that would silently break your training loop. Write validate_alpaca_examples that returns a list of (index, missing_key) tuples for any example missing instruction or output.

python
# Starter code
def validate_alpaca_examples(examples: list[dict]) -> list[tuple[int, str]]:
    """
    Returns list of (index, missing_key) for malformed examples.
    Required keys: 'instruction', 'output'
    """
    required_keys = ["instruction", "output"]
    issues = []
    # TODO: iterate examples, check for missing required keys
    return issues


test_data = [
    {"instruction": "Summarize this.", "input": "", "output": "Done summary."},  # OK (output ≥ 20 would fail filter, but schema OK)
    {"instruction": "Explain Python.", "input": ""},                              # missing 'output'
    {"input": "Some context.", "output": "An answer here."},                      # missing 'instruction'
    {"instruction": "What is ML?", "output": "Machine learning uses algorithms."}, # OK
]

issues = validate_alpaca_examples(test_data)
if issues:
    print(f"Found {len(issues)} issue(s):")
    for idx, key in issues:
        print(f"  Example {idx}: missing '{key}'")
else:
    print("All examples valid!")
Solution
python
def validate_alpaca_examples(examples: list[dict]) -> list[tuple[int, str]]:
    required_keys = ["instruction", "output"]
    issues = []
    for idx, ex in enumerate(examples):
        for key in required_keys:
            if key not in ex:
                issues.append((idx, key))
    return issues


test_data = [
    {"instruction": "Summarize this.", "input": "", "output": "Done summary."},
    {"instruction": "Explain Python.", "input": ""},
    {"input": "Some context.", "output": "An answer here."},
    {"instruction": "What is ML?", "output": "Machine learning uses algorithms."},
]

issues = validate_alpaca_examples(test_data)
if issues:
    print(f"Found {len(issues)} issue(s):")
    for idx, key in issues:
        print(f"  Example {idx}: missing '{key}'")
else:
    print("All examples valid!")

**Output:**

python
Found 2 issue(s):
  Example 1: missing 'output'
  Example 2: missing 'instruction'

Run this before every upload as a pre-flight assertion. Missing keys fail silently in some training frameworks — the example is skipped without an error message, which makes it very hard to diagnose why your training loss looks wrong.


Step 6: Analyze Your Dataset for Instruction Fine-Tuning

Before uploading, run a quick analysis. Distribution problems — all examples one length, one task type, or one vocabulary level — show up clearly in the statistics and are invisible in individual examples. I’ve caught datasets that looked fine in spot-checks but had a 400-character mean output variance, which produced models that were unpredictably verbose.

python
import statistics

def analyze_dataset(examples: list[dict]) -> None:
    """Print key statistics about your instruction dataset."""
    if not examples:
        print("Dataset is empty.")
        return

    instruction_lengths = [len(ex.get("instruction", "")) for ex in examples]
    output_lengths = [len(ex.get("output", "")) for ex in examples]
    has_input = sum(1 for ex in examples if ex.get("input", "").strip())

    print(f"Total examples:       {len(examples)}")
    print(f"With context (input): {has_input} ({100 * has_input // len(examples)}%)")
    print(f"")
    print(f"Instruction lengths (chars):")
    print(f"  Mean:  {statistics.mean(instruction_lengths):.0f}")
    print(f"  Min:   {min(instruction_lengths)}")
    print(f"  Max:   {max(instruction_lengths)}")
    print(f"")
    print(f"Output lengths (chars):")
    print(f"  Mean:  {statistics.mean(output_lengths):.0f}")
    print(f"  Min:   {min(output_lengths)}")
    print(f"  Max:   {max(output_lengths)}")


sample_dataset = [
    {"instruction": "What is Python?", "input": "",
     "output": "Python is a high-level, interpreted programming language."},
    {"instruction": "Explain list comprehensions.", "input": "",
     "output": "List comprehensions provide a concise way to create lists from iterables."},
    {"instruction": "Summarize this text.", "input": "Python was created in 1991.",
     "output": "Python was created in 1991."},
]

analyze_dataset(sample_dataset)
python
Total examples:       3
With context (input): 1 (33%)

Instruction lengths (chars):
  Mean:  21
  Min:   15
  Max:   28

Output lengths (chars):
  Mean:  52
  Min:   27
  Max:   73
Note: **This example uses 3 examples for illustration — in practice, run this function on your full dataset of 500+ examples.** The statistics are meaningless on a handful of samples; they become actionable at scale.

What to look for in your analysis:

  • High output length variance (min=7, max=2,000): your model won’t know what length is expected and will produce randomly long or short responses.
  • Very low input ratio (e.g., 1%): if almost no examples use the input field, your model won’t learn to use context when you provide it at inference time.
  • Very low minimum output length: means short stub outputs slipped through your filter.
Note: **Estimate your training token budget before you start.** Total training tokens determine cost and training time. A rough formula: `total_tokens ≈ total_characters / 4` (for English prose; code tokenizes at fewer characters per token). For a 5,000-example dataset with mean combined instruction+output length of 400 characters, that’s roughly `5,000 × 400 / 4 = 500,000 tokens`. At typical fine-tuning costs, that’s a meaningful difference from a 5M-token dataset. Run this estimate before committing to a dataset size.
Warning: **Character counts underestimate token counts for code.** Code, special characters, and non-English text tokenize at fewer characters per token than English prose. The “1 token ≈ 4 characters” rule applies to English text only. If your dataset mixes prose and code, verify output lengths against your tokenizer’s actual token count to avoid truncation during training.

Step 7: Package and Upload Your Instruction Dataset

The HuggingFace Hub is the standard home for fine-tuning datasets — every major training framework (trl, Axolotl, Unsloth) can load directly from it with a single call. Uploading also gives you versioning, which matters more than it sounds when you’re iterating on dataset quality.

python
from datasets import Dataset
import json

# For this example, use inline data (replace with load_jsonl("your_dataset.jsonl") for real use)
training_data = [
    {"instruction": "What is Python?", "input": "",
     "output": "Python is a high-level, interpreted programming language created in 1991."},
    {"instruction": "Explain list comprehensions.", "input": "",
     "output": "List comprehensions create lists from iterables in a single readable line."},
    {"instruction": "What is a decorator?", "input": "",
     "output": "A decorator is a function that modifies another function's behavior."},
    {"instruction": "Explain recursion.", "input": "",
     "output": "Recursion is when a function calls itself to solve a smaller version of the same problem."},
    {"instruction": "What is a generator?", "input": "",
     "output": "A generator is a function that yields values one at a time, saving memory."},
]

# Create HuggingFace Dataset and split
hf_dataset = Dataset.from_list(training_data)
split_dataset = hf_dataset.train_test_split(test_size=0.2, seed=42)

print(f"Training examples: {len(split_dataset['train'])}")
print(f"Test examples:     {len(split_dataset['test'])}")
print(f"Dataset features:  {list(split_dataset['train'].features.keys())}")

# Upload to HuggingFace Hub (run: huggingface-cli login first)
# split_dataset.push_to_hub("your-username/your-dataset-name", private=True)
print("\nTo upload: split_dataset.push_to_hub('username/dataset-name')")
python
Training examples: 4
Test examples:     1
Dataset features:  ['instruction', 'input', 'output']

To upload: split_dataset.push_to_hub('username/dataset-name')
Tip: **Always split before uploading.** If you upload an unsplit dataset, different fine-tuning frameworks use different default split ratios — `test_size=0.1` vs `test_size=0.2` vs no split at all. Making the split explicit ensures reproducible evaluation. For datasets under 5,000 examples, use `test_size=0.1` (10%). For larger datasets, 500–1,000 held-out examples is enough.

Authenticate with HuggingFace before uploading:

bash
huggingface-cli login

This opens a browser prompt. Paste your HuggingFace access token (huggingface.co → Settings → Access Tokens → New token → Write role).


Step 8: Validate Your Instruction Dataset Before Fine-Tuning

The last check before training is loading your dataset back and printing random examples. This catches encoding issues, formatting problems, and accidental truncation introduced during serialization — issues that are invisible in your Python objects but show up as corrupted JSON on disk.

Every time I skip this step I end up re-training after discovering a problem that was right there in the data. Two minutes of sampling saves hours of GPU time.

python
import random

random.seed(42)
train_size = len(split_dataset["train"])
sample_indices = random.sample(range(train_size), min(2, train_size))

print("=== Random sample from training set ===\n")
for idx in sample_indices:
    example = split_dataset["train"][idx]
    print(f"[Example {idx}]")
    print(f"  Instruction: {example['instruction']}")
    print(f"  Input:       {example['input'] or '(none)'}")
    print(f"  Output:      {example['output'][:80]}{'...' if len(example['output']) > 80 else ''}")
    print()
python
=== Random sample from training set ===

[Example 2]
  Instruction: What is a decorator?
  Input:       (none)
  Output:      A decorator is a function that modifies another function's behavior.

[Example 0]
  Instruction: What is Python?
  Input:       (none)
  Output:      Python is a high-level, interpreted programming language created in 199...

Pre-training validation checklist:

Check What to look for
Format consistency All examples have the same keys; no extras
No truncated outputs Outputs don’t end mid-sentence (... in your terminal means truncation in code, not in data)
No encoding artifacts No ’ or é — these signal UTF-8 encoding issues
Task distribution balance No single instruction type accounts for > 50% of examples
Output length distribution Mean output length matches your expected response length at inference
Train/test independence Test set has no examples identical to training examples

Common Mistakes When Building Instruction Datasets

Mistake 1: Adding the full Alpaca dataset “just in case.”

The 52,000-example Alpaca dataset covers everything from poetry to math to philosophy. Adding it wholesale to a domain-specific dataset dilutes your fine-tuning — the model learns to answer generic questions well and your specific task only marginally better. Filter public datasets to your topic, or skip them entirely for narrow tasks.

Fix: Filter any public dataset to your domain using keyword matching on the instruction field before merging.


Mistake 2: Inconsistent output format across generation batches.

Generating in batches with different prompts produces outputs that vary in format — some start with “Sure!”, some don’t, some end with punctuation, some don’t. The model learns this inconsistency and reproduces it.

Fix: Write one canonical generation prompt and use it for every batch. Before training, run a normalization pass: strip leading phrases (“Sure, here’s…”, “Of course!”), standardize punctuation, verify consistent formatting.


Mistake 3: No examples of refusal or uncertainty.

If your model should ever say “I don’t know” or “That’s outside my expertise,” you need training examples of that behavior. A model that has never seen a refusal will hallucinate an answer rather than admit uncertainty.

Fix: Add 5–10% “I don’t have that information” or “That’s outside my scope” examples. For each boundary you want the model to respect, include at least 3 refusal examples.


Mistake 4: Forgetting the system prompt in training data.

This is probably the most expensive mistake I’ve seen — it costs teams days of re-fine-tuning. If your model will use a system prompt at inference time (“You are a helpful assistant for Company X”), every training example should include that system prompt in ShareGPT format. A model never trained with a system prompt will partially or completely ignore it at inference time.

Fix: Use ShareGPT format and add your exact production system prompt to every training example from the start.


Mistake 5: Deduplicating after the train/test split.

If you deduplicate the full dataset and then split, semantically similar examples can land in both train and test sets. This inflates evaluation metrics — your model looks better than it is because the test questions resemble training questions.

Fix: Split first, then deduplicate each split separately. Or: use semantic deduplication across the full dataset before splitting, which removes near-duplicates that exact hashing would miss.


Complete Pipeline Script

Click to expand the full pipeline (copy-paste and run)
python
# Complete instruction dataset pipeline
# Requirements: pip install datasets "openai>=1.0" huggingface-hub
# Python 3.9+

import json
import hashlib
import statistics
from datasets import Dataset


# --- Step 1: Define raw Q&A pairs ---
raw_qa_pairs = [
    ("What are your business hours?", "We are open Monday–Friday, 9am–6pm EST."),
    ("How do I reset my password?", "Click 'Forgot Password' on the login page and follow the email instructions."),
    ("Do you offer refunds?", "Yes, we offer full refunds within 30 days of purchase."),
    ("What is your return policy?", "Items can be returned within 30 days in original unworn condition."),
    ("How long does shipping take?", "Standard shipping takes 3–5 business days within the continental US."),
]


# --- Step 2: Convert to Alpaca format ---
def convert_to_alpaca(qa_pairs: list) -> list[dict]:
    return [{"instruction": q, "input": "", "output": a} for q, a in qa_pairs]


# --- Step 3: Filter low-quality examples ---
def filter_dataset(examples: list[dict]) -> tuple[list[dict], dict]:
    filtered, removed = [], {"too_short_instruction": 0, "empty_output": 0,
                              "too_short_output": 0, "too_long_output": 0}
    for ex in examples:
        instruction, output = ex.get("instruction", ""), ex.get("output", "")
        if len(instruction) < 10:
            removed["too_short_instruction"] += 1; continue
        if not output.strip():
            removed["empty_output"] += 1; continue
        if len(output) < 20:
            removed["too_short_output"] += 1; continue
        if len(output) > 2048:
            removed["too_long_output"] += 1; continue
        filtered.append(ex)
    return filtered, removed


# --- Step 4: Deduplicate ---
def deduplicate_dataset(examples: list[dict]) -> tuple[list[dict], int]:
    seen, unique = set(), []
    for ex in examples:
        content = ex.get("instruction", "").strip() + "|" + ex.get("input", "").strip()
        h = hashlib.md5(content.encode("utf-8")).hexdigest()
        if h not in seen:
            seen.add(h)
            unique.append(ex)
    return unique, len(examples) - len(unique)


# --- Step 5: Validate schema ---
def validate_alpaca_examples(examples: list[dict]) -> list[tuple[int, str]]:
    issues = []
    for idx, ex in enumerate(examples):
        for key in ["instruction", "output"]:
            if key not in ex:
                issues.append((idx, key))
    return issues


# --- Run the full pipeline ---
print("=== Instruction Dataset Pipeline ===\n")

data = convert_to_alpaca(raw_qa_pairs)
print(f"Step 1 — Converted:  {len(data)} example(s)")

data, removed = filter_dataset(data)
print(f"Step 2 — Filtered:   {len(data)} example(s) (removed: {sum(removed.values())})")

data, dupes = deduplicate_dataset(data)
print(f"Step 3 — Deduped:    {len(data)} example(s) (removed: {dupes} duplicate(s))")

issues = validate_alpaca_examples(data)
if issues:
    print(f"Step 4 — Validation FAILED: {len(issues)} issue(s)")
    for idx, key in issues:
        print(f"  Example {idx}: missing '{key}'")
else:
    print(f"Step 4 — Validation: all examples have required keys")

hf_dataset = Dataset.from_list(data)
split = hf_dataset.train_test_split(test_size=0.2, seed=42)
print(f"\nFinal dataset — Train: {len(split['train'])}, Test: {len(split['test'])}")

# Uncomment to upload:
# split.push_to_hub("your-username/your-dataset-name", private=True)
print("\nScript completed successfully.")

Frequently Asked Questions

How many examples do I need for fine-tuning?

For task-specific fine-tuning (classification, extraction, formatting), 500–2,000 high-quality examples often suffice. For teaching new response styles or domain knowledge, aim for 2,000–10,000. The LIMA paper showed 1,000 carefully curated examples can match 52,000 noisy ones. Start with 1,000, evaluate, and scale only if performance falls short. More is only better if the additional examples are genuinely high-quality.

Can I use GPT-4 or Claude to generate training data for open-source models?

LLM terms of service differ on this. OpenAI prohibits using their outputs to train a model that competes with OpenAI products. For non-competing use cases (internal tooling, domain-specific assistants), LLM-generated data is widely used. Claude’s API terms and other providers have similar nuances — check the specific current terms for your data generator before deploying a commercial model trained on that data.

Should my outputs include chain-of-thought reasoning?

Match your outputs to your inference expectations. If you want your model to reason step-by-step at inference time (“First, I’ll identify… Then I’ll…”), your training outputs should demonstrate that pattern. If you want direct answers, train on direct answers. The model reproduces what it sees in training.

How do I handle PII (personally identifiable information) in my raw data?

Real data often contains names, emails, and addresses that the model can memorize and reproduce. Run a PII scrubber before uploading — the presidio-anonymizer library can detect and replace PII with placeholders ([CUSTOMER_NAME], [EMAIL]) automatically. Apply this especially to the input field, which often contains user-submitted text.

What is the difference between supervised fine-tuning (SFT) and RLHF?

This article covers SFT: training on instruction-response pairs where you provide the ideal response. RLHF (Reinforcement Learning from Human Feedback) uses human preference data — pairs of responses labeled “A is better than B” — to train a reward model that guides further training. SFT comes first; RLHF is a post-SFT alignment step that requires more infrastructure. If you’re just getting started, do SFT first and evaluate before adding RLHF complexity.

What is a dataset card and do I need one?

A dataset card is the README file attached to your HuggingFace Hub dataset. It documents what the dataset contains, how it was created, what tasks it’s suitable for, and any known limitations. You don’t need one to fine-tune privately, but it’s worth adding before sharing publicly. HuggingFace provides a template — at minimum, document your data source, generation method, filtering criteria, and the fine-tuning task the dataset targets. Other practitioners (and your future self) will thank you.


Continue building your fine-tuning pipeline with these related guides:


References

  1. Taori, R. et al. — “Alpaca: A Strong, Replicable Instruction-Following Model.” Stanford CRFM, 2023. Link
  2. Zhou, C. et al. — “LIMA: Less Is More for Alignment.” NeurIPS 2023. arXiv:2305.11206
  3. Wang, Y. et al. — “Self-Instruct: Aligning Language Models with Self-Generated Instructions.” ACL 2023. arXiv:2212.10560
  4. HuggingFace Datasets — Creating and Sharing Datasets. Link
  5. HuggingFace Datasets — Share a Dataset to the Hub. Link
  6. Meta AI — “How to Fine-Tune LLMs: PEFT and Dataset Curation.” Meta AI Blog, 2024. Link
  7. Unsloth Documentation — Datasets Guide for Fine-Tuning. Link
  8. Cleanlab — “How to detect bad data in your instruction tuning dataset.” Link
  9. Distilabel — Open-source framework for AI feedback and synthetic data pipelines. GitHub
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Machine Learning — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science