How to Build a Custom Instruction Dataset for LLM Fine-Tuning
Every production fine-tuning project that fails, fails for the same reason: the dataset. Not the model, not the hyperparameters, not the compute budget. I’ve watched teams spend weeks tuning learning rates on models that behaved badly at inference time — then discovered their training data had inconsistent instructions, duplicate examples, and outputs written at three different quality levels. This guide walks you through building a custom instruction dataset from scratch: choosing the right format, collecting and generating data, filtering out bad examples, and packaging it for fine-tuning.
Why Your Dataset Determines Fine-Tuning Success
Here’s a number that surprises most practitioners: a dataset of 1,000 high-quality, carefully curated instruction-response pairs consistently outperforms a dataset of 52,000 noisy ones. LIMA (Meta, 2023) trained on exactly 1,000 carefully selected examples and matched or exceeded Alpaca’s performance on several benchmarks. Alpaca had 52x more data.
The reason is straightforward. Fine-tuning doesn’t teach the model new knowledge — the base model already has it. Fine-tuning teaches the model how to respond: the format, the style, the level of detail. If your training examples are inconsistent, the model learns inconsistency.
Prerequisites:
– Python 3.9+ with datasets, openai>=1.0, and huggingface-hub installed
– A HuggingFace account (free) for dataset hosting
– An OpenAI API key for synthetic data generation (Step 3, optional but recommended)
– Basic Python and JSON familiarity — no ML background needed for this article
pip install datasets "openai>=1.0" huggingface-hub
Choosing Your Format: Alpaca, ShareGPT, and ChatML
Before collecting a single example, decide on your format. This choice affects every training script, every generation prompt, and every format conversion downstream. Changing formats mid-project means reformatting everything.
There are two formats you’ll encounter in every fine-tuning framework today — and one important variant for preference alignment.
Alpaca Format
Alpaca is the simplest format: a flat JSON object with three fields. Stanford released it with their Alpaca model in 2023, and it became the de facto standard for single-turn instruction tuning.
import json
# Alpaca format: instruction, optional input context, expected output
alpaca_example = {
"instruction": "Summarize the following customer complaint in one sentence.",
"input": "I ordered a laptop three weeks ago and it still hasn't arrived. "
"Customer support hasn't responded to my emails. This is unacceptable.",
"output": "Customer reports a three-week delivery delay and unresponsive support."
}
print(json.dumps(alpaca_example, indent=2))
print(f"\nKeys: {list(alpaca_example.keys())}")
{
"instruction": "Summarize the following customer complaint in one sentence.",
"input": "I ordered a laptop three weeks ago and it still hasn't arrived. Customer support hasn't responded to my emails. This is unacceptable.",
"output": "Customer reports a three-week delivery delay and unresponsive support."
}
Keys: ['instruction', 'input', 'output']
The input field is optional. When your instruction is self-contained (“Write a haiku about autumn”), leave input as an empty string. When the instruction requires additional context (“Summarize the following”), put that context in input.
Use Alpaca for: single-turn tasks (classification, summarization, extraction, code generation) where one instruction produces one response.
ShareGPT / ChatML Format
ShareGPT uses a messages array with role-labeled turns. It’s built for multi-turn conversations and matches the OpenAI Chat Completions API structure — your training data looks exactly like your inference calls.
# ShareGPT format: role-labeled conversation turns
sharegpt_example = {
"messages": [
{
"role": "system",
"content": "You are a helpful customer service assistant for an e-commerce company."
},
{
"role": "user",
"content": "I ordered a laptop three weeks ago and it still hasn't arrived. What can you do?"
},
{
"role": "assistant",
"content": "I'm sorry to hear about the delay. I can look up your order right now. "
"Could you share your order number so I can check the shipping status?"
}
]
}
print(f"Turns in conversation: {len(sharegpt_example['messages'])}")
print(f"Roles: {[m['role'] for m in sharegpt_example['messages']]}")
Turns in conversation: 3
Roles: ['system', 'user', 'assistant']
Use ShareGPT for: chatbots, assistants, anything with system prompts or multi-turn conversation.
Format Comparison: Which to Choose
| Format | Use Case | Key Fields | Supported Frameworks |
|---|---|---|---|
| Alpaca | Single-turn instruction tuning | instruction, input, output |
trl, Axolotl, LLaMA-Factory, Unsloth |
| ShareGPT / ChatML | Multi-turn conversation, system prompts | messages[{role, content}] |
trl, Unsloth, LLaMA-Factory, Axolotl |
| DPO (preference) | Preference alignment, RLHF alternative | prompt, chosen, rejected |
trl, Axolotl |
DPO (Direct Preference Optimization) is worth knowing even if you’re not using it yet. Instead of a single “correct” output, each example contains a preferred response and a rejected one. The model learns to prefer the chosen output. If you plan to do alignment fine-tuning after SFT, you’ll need a separate DPO dataset in this format. For now, Alpaca or ShareGPT gets you to your first fine-tuned model — DPO is a post-SFT step.
Exercise 1: Convert a Q&A list to Alpaca format
You have a list of raw Q&A pairs from a company FAQ. Convert each pair into a valid Alpaca-format dictionary and save the result as a JSONL file (one JSON object per line).
# Starter code
import json
raw_qa = [
("What are your business hours?", "We are open Monday–Friday, 9am–6pm EST."),
("How do I reset my password?", "Click 'Forgot Password' on the login page and follow the email instructions."),
("Do you offer refunds?", "Yes, we offer full refunds within 30 days of purchase."),
]
# TODO: Convert raw_qa to Alpaca format and write to "faq_dataset.jsonl"
# Each dict should have: instruction, input (empty string), output
def convert_to_alpaca(qa_pairs: list) -> list[dict]:
pass # Your code here
alpaca_data = convert_to_alpaca(raw_qa)
print(f"Converted {len(alpaca_data)} example(s)")
print(f"First example keys: {list(alpaca_data[0].keys()) if alpaca_data else 'None'}")
Solution
import json
raw_qa = [
("What are your business hours?", "We are open Monday–Friday, 9am–6pm EST."),
("How do I reset my password?", "Click 'Forgot Password' on the login page and follow the email instructions."),
("Do you offer refunds?", "Yes, we offer full refunds within 30 days of purchase."),
]
def convert_to_alpaca(qa_pairs: list) -> list[dict]:
alpaca_data = []
for question, answer in qa_pairs:
alpaca_data.append({
"instruction": question,
"input": "",
"output": answer,
})
return alpaca_data
alpaca_data = convert_to_alpaca(raw_qa)
with open("faq_dataset.jsonl", "w") as f:
for example in alpaca_data:
f.write(json.dumps(example) + "\n")
print(f"Converted {len(alpaca_data)} example(s)")
print(f"First example keys: {list(alpaca_data[0].keys())}")
**Output:**
Converted 3 example(s)
First example keys: ['instruction', 'input', 'output']
Each Q&A pair becomes an Alpaca dict with an empty `input` field — the question is self-contained so no context is needed. Writing as JSONL (one JSON object per line) is the standard format for large datasets because it’s streamable and easy to process line by line.
Step 1: Collect and Source Your Raw Data
What’s the best source for a fine-tuning dataset? Whatever your model will actually be used on.
If you’re building a customer support bot, your existing support tickets are worth ten times any synthetic dataset you could generate. Real data captures the exact language, edge cases, and ambiguous requests your model will face in production. I’ve seen teams skip this and jump straight to synthetic generation when they had 8,000 real customer conversations sitting in their CRM. That’s a mistake every time.
Sources ranked by data quality for fine-tuning:
- Real examples from your production domain — highest quality; captures genuine user language and edge cases
- Curated public datasets filtered to your task — pre-cleaned but may not match your domain vocabulary
- LLM-generated synthetic data — fast and scalable; risks introducing LLM response patterns into training
How much data do you actually need? The answer depends heavily on task complexity. Here are practical starting targets:
| Task Type | Recommended Size | Notes |
|---|---|---|
| Sentiment / intent classification | 200 – 2K | Small datasets work well for narrow label sets |
| Named entity extraction | 500 – 3K | Needs coverage of all entity types |
| Text summarization | 1K – 5K | More examples = more style consistency |
| Domain-specific Q&A | 2K – 10K | Needs to cover full knowledge range |
| Multi-turn chatbot | 5K – 20K+ | Each turn adds complexity; more is better |
| General instruction following | 5K – 50K | Broad tasks require broad coverage |
These are starting points, not ceilings. The LIMA result (1K → matched 52K) suggests the upper bound on “enough” may be lower than you expect — but only if your 1K examples are genuinely high quality.
Loading a public dataset as a foundation is often the fastest start:
from datasets import load_dataset
# Stanford Alpaca — 52K general instruction examples, good as a seed
alpaca = load_dataset("tatsu-lab/alpaca", split="train")
print(f"Alpaca dataset size: {len(alpaca):,} examples")
print(f"Features: {list(alpaca.features.keys())}")
# Keep only examples without a context field (pure instruction tasks)
pure_instruction = alpaca.filter(lambda x: x["input"].strip() == "")
print(f"Pure instruction examples: {len(pure_instruction):,}")
<!-- OUTPUT — MUST BE FILLED BY RUNNING CODE -->
The real-vs-synthetic decision doesn’t have to be either/or. Start with whatever real examples you have, use them as seeds for synthetic generation, then filter the combined set. This is how most production datasets are built.
Step 2: Design High-Quality Instruction-Response Pairs
# This is what the difference looks like in practice
good_example = {
"instruction": "Classify the sentiment of this product review as Positive, Negative, or Neutral.",
"input": "The camera quality is excellent but the battery dies after 4 hours.",
"output": "Neutral"
}
bad_example = {
"instruction": "What do you think of this review?",
"input": "The camera quality is excellent but the battery dies after 4 hours.",
"output": "This review seems mixed — the camera is great but battery life is a concern. "
"Overall somewhat positive but with reservations. Many users find battery life frustrating."
}
print(f"Good output length: {len(good_example['output'])} chars")
print(f"Bad output length: {len(bad_example['output'])} chars")
print(f"Good instruction specificity: explicit label format (Positive/Negative/Neutral)")
print(f"Bad instruction specificity: open-ended ('What do you think?')")
Good output length: 7 chars
Bad output length: 176 chars
Good instruction specificity: explicit label format (Positive/Negative/Neutral)
Bad instruction specificity: open-ended ('What do you think?')
Three rules cover 80% of what makes an instruction pair good or bad.
Rule 1: Instructions must be unambiguous. Every person reading your instruction should produce the same type of response. “Improve this text” is ambiguous — improve how? For clarity? Conciseness? Formality? “Rewrite this customer email to be more concise, keeping all key information” is specific. In my experience, instruction clarity is the single biggest predictor of fine-tuning quality.
Rule 2: Outputs must be consistent in format and length. If half your training examples end with a period and half don’t, your model learns to randomly add periods. Decide your conventions — bullet points or prose, sentence case or title case, short or detailed — and apply them uniformly. A model trained on 7-character outputs (“Neutral”) will not learn to write paragraphs, and vice versa.
Rule 3: Cover the full range of your task, not just the easy cases. If you’re building a support bot, include examples with ambiguous requests, examples where the answer is “I don’t have that information,” and multi-step reasoning examples. A model trained only on clean, easy cases will hallucinate on hard ones.
Step 3: Generate Synthetic Data Using an LLM
Once you have 50–200 high-quality seed examples, you can scale using an LLM as a data generator. This is how Alpaca was built: 175 seed instructions fed to GPT-3.5 generated 52,000 examples. The technique is called self-instruct and it’s now standard practice.
The key to good synthetic generation is constraint. Vague generation prompts produce vague data.
from openai import OpenAI
import json
client = OpenAI() # Requires OPENAI_API_KEY environment variable
def generate_instruction_pairs(
topic: str,
n: int = 5,
style: str = "concise and factual"
) -> list[dict]:
"""Generate Alpaca-format instruction-response pairs on a topic."""
prompt = f"""Generate {n} instruction-response pairs for fine-tuning an LLM assistant.
Topic: {topic}
Response style: {style}
Requirements:
- Each instruction must be a clear, specific task (not 'explain X' — be precise)
- Each output must directly and completely answer the instruction
- All outputs must be consistent in length and tone
- No duplicate instructions
Return ONLY a JSON object:
{{"pairs": [{{"instruction": "...", "input": "", "output": "..."}}]}}"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
response_format={"type": "json_object"},
)
data = json.loads(response.choices[0].message.content)
return data.get("pairs", [])
pairs = generate_instruction_pairs(
topic="e-commerce customer support for a clothing retailer",
n=5,
style="professional, empathetic, and solution-oriented"
)
print(f"Generated {len(pairs)} instruction pair(s)")
if pairs:
print(f"\nSample instruction: {pairs[0]['instruction']}")
print(f"Sample output (first 80 chars): {pairs[0]['output'][:80]}...")
<!-- OUTPUT — MUST BE FILLED BY RUNNING CODE -->
I use gpt-4o-mini for bulk generation and reserve gpt-4o for human review and spot-checking — the cost difference over 1,000 examples is around 50x, and gpt-4o-mini quality is more than sufficient for generation.
If you need to automate large-scale pipelines — generating, annotating, and filtering thousands of examples in one workflow — Distilabel is worth looking at. It’s an open-source framework designed specifically for synthetic data generation with LLMs. You define a pipeline of generation and annotation steps; it handles batching, caching, and output validation. For datasets above 10K examples, the automation it provides saves significant manual work.
Step 4: Filter Out Low-Quality Examples
def filter_dataset(examples: list[dict]) -> tuple[list[dict], dict]:
"""
Apply quality filters. Returns (filtered_examples, removal_stats).
Filters:
- Instruction too short: < 10 chars (vague or trivial)
- Output empty: no response at all
- Output too short: < 20 chars (likely a stub or placeholder)
- Output too long: > 2048 chars (likely off-topic or repetitive)
"""
filtered = []
removed = {"too_short_instruction": 0, "empty_output": 0,
"too_short_output": 0, "too_long_output": 0}
for ex in examples:
instruction = ex.get("instruction", "")
output = ex.get("output", "")
if len(instruction) < 10:
removed["too_short_instruction"] += 1
continue
if not output.strip():
removed["empty_output"] += 1
continue
if len(output) < 20:
removed["too_short_output"] += 1
continue
if len(output) > 2048:
removed["too_long_output"] += 1
continue
filtered.append(ex)
return filtered, removed
raw_examples = [
{"instruction": "Hi", "input": "", "output": "Hello there!"},
{
"instruction": "Classify the sentiment of this review.",
"input": "The package arrived broken and customer service was no help at all.",
"output": "Negative — the reviewer reports a damaged delivery and unhelpful support."
},
{
"instruction": "Explain Python list comprehensions.", "input": "",
"output": "List comprehensions provide a concise way to create lists from iterables using a single line of code."
},
{"instruction": "Summarize this article.", "input": "Some text.", "output": ""},
{"instruction": "Write a haiku.", "input": "", "output": "Ok"},
]
filtered, stats = filter_dataset(raw_examples)
print(f"Before filtering: {len(raw_examples)} example(s)")
print(f"After filtering: {len(filtered)} example(s)")
print(f"\nRemoval reasons:")
for reason, count in stats.items():
if count > 0:
print(f" {reason}: {count}")
Before filtering: 5 example(s)
After filtering: 2 example(s)
Removal reasons:
too_short_instruction: 1
empty_output: 1
too_short_output: 1
Two examples survive: the sentiment classification and the list comprehension explanation. “Hi” fails the instruction length check. The empty article summary is caught by the empty output filter. “Ok” fails the minimum output length check.
The 20-character output minimum feels arbitrary, but I’ve never regretted setting it. It catches stub outputs (“Yes”, “Ok”, “Done.”) that would teach your model to give monosyllabic answers — which is almost never what you want.
Exercise 2: Add echo detection to the quality filter
A common failure in synthetic generation: the model echoes back the input field as its “output” instead of answering. Extend filter_dataset to also remove examples where output.strip() == input.strip() (when input is non-empty).
# Starter code — extend this function
def filter_dataset_v2(examples: list[dict]) -> tuple[list[dict], dict]:
filtered = []
removed = {"too_short_instruction": 0, "empty_output": 0,
"too_short_output": 0, "too_long_output": 0, "output_copies_input": 0}
for ex in examples:
instruction = ex.get("instruction", "")
output = ex.get("output", "")
inp = ex.get("input", "")
if len(instruction) < 10:
removed["too_short_instruction"] += 1; continue
if not output.strip():
removed["empty_output"] += 1; continue
if len(output) < 20:
removed["too_short_output"] += 1; continue
if len(output) > 2048:
removed["too_long_output"] += 1; continue
# TODO: add echo detection — skip if output is identical to input
filtered.append(ex)
return filtered, removed
test_cases = [
{"instruction": "Rewrite this sentence.", "input": "The cat sat on the mat.",
"output": "The cat sat on the mat."}, # echo — should be removed
{"instruction": "Rewrite this sentence.", "input": "The cat sat on the mat.",
"output": "A feline rested upon the rug."}, # OK — should survive
]
filtered, stats = filter_dataset_v2(test_cases)
print(f"Survived: {len(filtered)}") # Should print: 1
print(f"Echo removed: {stats['output_copies_input']}") # Should print: 1
Solution
def filter_dataset_v2(examples: list[dict]) -> tuple[list[dict], dict]:
filtered = []
removed = {"too_short_instruction": 0, "empty_output": 0,
"too_short_output": 0, "too_long_output": 0, "output_copies_input": 0}
for ex in examples:
instruction = ex.get("instruction", "")
output = ex.get("output", "")
inp = ex.get("input", "")
if len(instruction) < 10:
removed["too_short_instruction"] += 1; continue
if not output.strip():
removed["empty_output"] += 1; continue
if len(output) < 20:
removed["too_short_output"] += 1; continue
if len(output) > 2048:
removed["too_long_output"] += 1; continue
if inp.strip() and output.strip() == inp.strip():
removed["output_copies_input"] += 1; continue
filtered.append(ex)
return filtered, removed
test_cases = [
{"instruction": "Rewrite this sentence.", "input": "The cat sat on the mat.",
"output": "The cat sat on the mat."},
{"instruction": "Rewrite this sentence.", "input": "The cat sat on the mat.",
"output": "A feline rested upon the rug."},
]
filtered, stats = filter_dataset_v2(test_cases)
print(f"Survived: {len(filtered)}")
print(f"Echo removed: {stats['output_copies_input']}")
**Output:**
Survived: 1
Echo removed: 1
The echo check only runs when `inp` is non-empty — we don’t want to accidentally flag cases where both the input and output happen to be empty strings. This filter is especially important for summarization tasks, where a low-quality LLM will sometimes just return the text it was asked to summarize.
Step 5: Deduplicate Your Dataset
Here’s a reliable signal that your dataset has duplication problems: your model fine-tuned on 1,000 examples gives the same response to 5 different questions. Deduplication is why this happens less often than you’d expect — and why skipping it makes it happen constantly.
LLM-generated data is especially prone to near-duplicates. Ask GPT-4o-mini to generate 10 examples on “Python error handling” and at least 3 will be variations of “write a try/except block.” Duplicates waste training steps and bias your model toward whatever phrasing appeared most often.
import hashlib
def deduplicate_dataset(examples: list[dict]) -> tuple[list[dict], int]:
"""Remove duplicates using MD5 hash of instruction + input."""
seen_hashes = set()
unique_examples = []
for ex in examples:
# Hash the instruction+input pair — same question with two different outputs is a contradiction
content = ex.get("instruction", "").strip() + "|" + ex.get("input", "").strip()
hash_val = hashlib.md5(content.encode("utf-8")).hexdigest()
if hash_val not in seen_hashes:
seen_hashes.add(hash_val)
unique_examples.append(ex)
removed_count = len(examples) - len(unique_examples)
return unique_examples, removed_count
examples_with_dupes = [
{"instruction": "What is Python?", "input": "", "output": "Python is a programming language."},
{"instruction": "What is Python?", "input": "", "output": "Python is a high-level language."}, # same instruction
{"instruction": "Explain recursion.", "input": "", "output": "Recursion is when a function calls itself."},
{"instruction": "Explain recursion.", "input": "", "output": "A function that calls itself is recursive."}, # same
{"instruction": "What is a decorator?", "input": "", "output": "A decorator modifies function behavior."},
]
unique, removed_count = deduplicate_dataset(examples_with_dupes)
print(f"Before dedup: {len(examples_with_dupes)} example(s)")
print(f"After dedup: {len(unique)} example(s)")
print(f"Removed: {removed_count} duplicate(s)")
Before dedup: 5 example(s)
After dedup: 3 example(s)
Removed: 2 duplicate(s)
The hash covers instruction + input, not output. That’s intentional. When two examples have the same question but different answers, we keep the first one and discard the second — a model can’t learn to give two different answers to the same question.
Hash deduplication is fast and sufficient for datasets under 10K examples. Beyond that, consider semantic deduplication using sentence-transformers to catch near-duplicates like “What is Python?” and “Can you explain what Python is?” — exact hashing won’t catch those.
Exercise 3: Validate required fields before upload
Before uploading, you want to catch malformed examples — dicts missing required fields that would silently break your training loop. Write validate_alpaca_examples that returns a list of (index, missing_key) tuples for any example missing instruction or output.
# Starter code
def validate_alpaca_examples(examples: list[dict]) -> list[tuple[int, str]]:
"""
Returns list of (index, missing_key) for malformed examples.
Required keys: 'instruction', 'output'
"""
required_keys = ["instruction", "output"]
issues = []
# TODO: iterate examples, check for missing required keys
return issues
test_data = [
{"instruction": "Summarize this.", "input": "", "output": "Done summary."}, # OK (output ≥ 20 would fail filter, but schema OK)
{"instruction": "Explain Python.", "input": ""}, # missing 'output'
{"input": "Some context.", "output": "An answer here."}, # missing 'instruction'
{"instruction": "What is ML?", "output": "Machine learning uses algorithms."}, # OK
]
issues = validate_alpaca_examples(test_data)
if issues:
print(f"Found {len(issues)} issue(s):")
for idx, key in issues:
print(f" Example {idx}: missing '{key}'")
else:
print("All examples valid!")
Solution
def validate_alpaca_examples(examples: list[dict]) -> list[tuple[int, str]]:
required_keys = ["instruction", "output"]
issues = []
for idx, ex in enumerate(examples):
for key in required_keys:
if key not in ex:
issues.append((idx, key))
return issues
test_data = [
{"instruction": "Summarize this.", "input": "", "output": "Done summary."},
{"instruction": "Explain Python.", "input": ""},
{"input": "Some context.", "output": "An answer here."},
{"instruction": "What is ML?", "output": "Machine learning uses algorithms."},
]
issues = validate_alpaca_examples(test_data)
if issues:
print(f"Found {len(issues)} issue(s):")
for idx, key in issues:
print(f" Example {idx}: missing '{key}'")
else:
print("All examples valid!")
**Output:**
Found 2 issue(s):
Example 1: missing 'output'
Example 2: missing 'instruction'
Run this before every upload as a pre-flight assertion. Missing keys fail silently in some training frameworks — the example is skipped without an error message, which makes it very hard to diagnose why your training loss looks wrong.
Step 6: Analyze Your Dataset for Instruction Fine-Tuning
Before uploading, run a quick analysis. Distribution problems — all examples one length, one task type, or one vocabulary level — show up clearly in the statistics and are invisible in individual examples. I’ve caught datasets that looked fine in spot-checks but had a 400-character mean output variance, which produced models that were unpredictably verbose.
import statistics
def analyze_dataset(examples: list[dict]) -> None:
"""Print key statistics about your instruction dataset."""
if not examples:
print("Dataset is empty.")
return
instruction_lengths = [len(ex.get("instruction", "")) for ex in examples]
output_lengths = [len(ex.get("output", "")) for ex in examples]
has_input = sum(1 for ex in examples if ex.get("input", "").strip())
print(f"Total examples: {len(examples)}")
print(f"With context (input): {has_input} ({100 * has_input // len(examples)}%)")
print(f"")
print(f"Instruction lengths (chars):")
print(f" Mean: {statistics.mean(instruction_lengths):.0f}")
print(f" Min: {min(instruction_lengths)}")
print(f" Max: {max(instruction_lengths)}")
print(f"")
print(f"Output lengths (chars):")
print(f" Mean: {statistics.mean(output_lengths):.0f}")
print(f" Min: {min(output_lengths)}")
print(f" Max: {max(output_lengths)}")
sample_dataset = [
{"instruction": "What is Python?", "input": "",
"output": "Python is a high-level, interpreted programming language."},
{"instruction": "Explain list comprehensions.", "input": "",
"output": "List comprehensions provide a concise way to create lists from iterables."},
{"instruction": "Summarize this text.", "input": "Python was created in 1991.",
"output": "Python was created in 1991."},
]
analyze_dataset(sample_dataset)
Total examples: 3
With context (input): 1 (33%)
Instruction lengths (chars):
Mean: 21
Min: 15
Max: 28
Output lengths (chars):
Mean: 52
Min: 27
Max: 73
What to look for in your analysis:
- High output length variance (min=7, max=2,000): your model won’t know what length is expected and will produce randomly long or short responses.
- Very low input ratio (e.g., 1%): if almost no examples use the
inputfield, your model won’t learn to use context when you provide it at inference time. - Very low minimum output length: means short stub outputs slipped through your filter.
Step 7: Package and Upload Your Instruction Dataset
The HuggingFace Hub is the standard home for fine-tuning datasets — every major training framework (trl, Axolotl, Unsloth) can load directly from it with a single call. Uploading also gives you versioning, which matters more than it sounds when you’re iterating on dataset quality.
from datasets import Dataset
import json
# For this example, use inline data (replace with load_jsonl("your_dataset.jsonl") for real use)
training_data = [
{"instruction": "What is Python?", "input": "",
"output": "Python is a high-level, interpreted programming language created in 1991."},
{"instruction": "Explain list comprehensions.", "input": "",
"output": "List comprehensions create lists from iterables in a single readable line."},
{"instruction": "What is a decorator?", "input": "",
"output": "A decorator is a function that modifies another function's behavior."},
{"instruction": "Explain recursion.", "input": "",
"output": "Recursion is when a function calls itself to solve a smaller version of the same problem."},
{"instruction": "What is a generator?", "input": "",
"output": "A generator is a function that yields values one at a time, saving memory."},
]
# Create HuggingFace Dataset and split
hf_dataset = Dataset.from_list(training_data)
split_dataset = hf_dataset.train_test_split(test_size=0.2, seed=42)
print(f"Training examples: {len(split_dataset['train'])}")
print(f"Test examples: {len(split_dataset['test'])}")
print(f"Dataset features: {list(split_dataset['train'].features.keys())}")
# Upload to HuggingFace Hub (run: huggingface-cli login first)
# split_dataset.push_to_hub("your-username/your-dataset-name", private=True)
print("\nTo upload: split_dataset.push_to_hub('username/dataset-name')")
Training examples: 4
Test examples: 1
Dataset features: ['instruction', 'input', 'output']
To upload: split_dataset.push_to_hub('username/dataset-name')
Authenticate with HuggingFace before uploading:
huggingface-cli login
This opens a browser prompt. Paste your HuggingFace access token (huggingface.co → Settings → Access Tokens → New token → Write role).
Step 8: Validate Your Instruction Dataset Before Fine-Tuning
The last check before training is loading your dataset back and printing random examples. This catches encoding issues, formatting problems, and accidental truncation introduced during serialization — issues that are invisible in your Python objects but show up as corrupted JSON on disk.
Every time I skip this step I end up re-training after discovering a problem that was right there in the data. Two minutes of sampling saves hours of GPU time.
import random
random.seed(42)
train_size = len(split_dataset["train"])
sample_indices = random.sample(range(train_size), min(2, train_size))
print("=== Random sample from training set ===\n")
for idx in sample_indices:
example = split_dataset["train"][idx]
print(f"[Example {idx}]")
print(f" Instruction: {example['instruction']}")
print(f" Input: {example['input'] or '(none)'}")
print(f" Output: {example['output'][:80]}{'...' if len(example['output']) > 80 else ''}")
print()
=== Random sample from training set ===
[Example 2]
Instruction: What is a decorator?
Input: (none)
Output: A decorator is a function that modifies another function's behavior.
[Example 0]
Instruction: What is Python?
Input: (none)
Output: Python is a high-level, interpreted programming language created in 199...
Pre-training validation checklist:
| Check | What to look for |
|---|---|
| Format consistency | All examples have the same keys; no extras |
| No truncated outputs | Outputs don’t end mid-sentence (... in your terminal means truncation in code, not in data) |
| No encoding artifacts | No ’ or é — these signal UTF-8 encoding issues |
| Task distribution balance | No single instruction type accounts for > 50% of examples |
| Output length distribution | Mean output length matches your expected response length at inference |
| Train/test independence | Test set has no examples identical to training examples |
Common Mistakes When Building Instruction Datasets
Mistake 1: Adding the full Alpaca dataset “just in case.”
The 52,000-example Alpaca dataset covers everything from poetry to math to philosophy. Adding it wholesale to a domain-specific dataset dilutes your fine-tuning — the model learns to answer generic questions well and your specific task only marginally better. Filter public datasets to your topic, or skip them entirely for narrow tasks.
✅ Fix: Filter any public dataset to your domain using keyword matching on the instruction field before merging.
Mistake 2: Inconsistent output format across generation batches.
Generating in batches with different prompts produces outputs that vary in format — some start with “Sure!”, some don’t, some end with punctuation, some don’t. The model learns this inconsistency and reproduces it.
✅ Fix: Write one canonical generation prompt and use it for every batch. Before training, run a normalization pass: strip leading phrases (“Sure, here’s…”, “Of course!”), standardize punctuation, verify consistent formatting.
Mistake 3: No examples of refusal or uncertainty.
If your model should ever say “I don’t know” or “That’s outside my expertise,” you need training examples of that behavior. A model that has never seen a refusal will hallucinate an answer rather than admit uncertainty.
✅ Fix: Add 5–10% “I don’t have that information” or “That’s outside my scope” examples. For each boundary you want the model to respect, include at least 3 refusal examples.
Mistake 4: Forgetting the system prompt in training data.
This is probably the most expensive mistake I’ve seen — it costs teams days of re-fine-tuning. If your model will use a system prompt at inference time (“You are a helpful assistant for Company X”), every training example should include that system prompt in ShareGPT format. A model never trained with a system prompt will partially or completely ignore it at inference time.
✅ Fix: Use ShareGPT format and add your exact production system prompt to every training example from the start.
Mistake 5: Deduplicating after the train/test split.
If you deduplicate the full dataset and then split, semantically similar examples can land in both train and test sets. This inflates evaluation metrics — your model looks better than it is because the test questions resemble training questions.
✅ Fix: Split first, then deduplicate each split separately. Or: use semantic deduplication across the full dataset before splitting, which removes near-duplicates that exact hashing would miss.
Complete Pipeline Script
Click to expand the full pipeline (copy-paste and run)
# Complete instruction dataset pipeline
# Requirements: pip install datasets "openai>=1.0" huggingface-hub
# Python 3.9+
import json
import hashlib
import statistics
from datasets import Dataset
# --- Step 1: Define raw Q&A pairs ---
raw_qa_pairs = [
("What are your business hours?", "We are open Monday–Friday, 9am–6pm EST."),
("How do I reset my password?", "Click 'Forgot Password' on the login page and follow the email instructions."),
("Do you offer refunds?", "Yes, we offer full refunds within 30 days of purchase."),
("What is your return policy?", "Items can be returned within 30 days in original unworn condition."),
("How long does shipping take?", "Standard shipping takes 3–5 business days within the continental US."),
]
# --- Step 2: Convert to Alpaca format ---
def convert_to_alpaca(qa_pairs: list) -> list[dict]:
return [{"instruction": q, "input": "", "output": a} for q, a in qa_pairs]
# --- Step 3: Filter low-quality examples ---
def filter_dataset(examples: list[dict]) -> tuple[list[dict], dict]:
filtered, removed = [], {"too_short_instruction": 0, "empty_output": 0,
"too_short_output": 0, "too_long_output": 0}
for ex in examples:
instruction, output = ex.get("instruction", ""), ex.get("output", "")
if len(instruction) < 10:
removed["too_short_instruction"] += 1; continue
if not output.strip():
removed["empty_output"] += 1; continue
if len(output) < 20:
removed["too_short_output"] += 1; continue
if len(output) > 2048:
removed["too_long_output"] += 1; continue
filtered.append(ex)
return filtered, removed
# --- Step 4: Deduplicate ---
def deduplicate_dataset(examples: list[dict]) -> tuple[list[dict], int]:
seen, unique = set(), []
for ex in examples:
content = ex.get("instruction", "").strip() + "|" + ex.get("input", "").strip()
h = hashlib.md5(content.encode("utf-8")).hexdigest()
if h not in seen:
seen.add(h)
unique.append(ex)
return unique, len(examples) - len(unique)
# --- Step 5: Validate schema ---
def validate_alpaca_examples(examples: list[dict]) -> list[tuple[int, str]]:
issues = []
for idx, ex in enumerate(examples):
for key in ["instruction", "output"]:
if key not in ex:
issues.append((idx, key))
return issues
# --- Run the full pipeline ---
print("=== Instruction Dataset Pipeline ===\n")
data = convert_to_alpaca(raw_qa_pairs)
print(f"Step 1 — Converted: {len(data)} example(s)")
data, removed = filter_dataset(data)
print(f"Step 2 — Filtered: {len(data)} example(s) (removed: {sum(removed.values())})")
data, dupes = deduplicate_dataset(data)
print(f"Step 3 — Deduped: {len(data)} example(s) (removed: {dupes} duplicate(s))")
issues = validate_alpaca_examples(data)
if issues:
print(f"Step 4 — Validation FAILED: {len(issues)} issue(s)")
for idx, key in issues:
print(f" Example {idx}: missing '{key}'")
else:
print(f"Step 4 — Validation: all examples have required keys")
hf_dataset = Dataset.from_list(data)
split = hf_dataset.train_test_split(test_size=0.2, seed=42)
print(f"\nFinal dataset — Train: {len(split['train'])}, Test: {len(split['test'])}")
# Uncomment to upload:
# split.push_to_hub("your-username/your-dataset-name", private=True)
print("\nScript completed successfully.")
Frequently Asked Questions
How many examples do I need for fine-tuning?
For task-specific fine-tuning (classification, extraction, formatting), 500–2,000 high-quality examples often suffice. For teaching new response styles or domain knowledge, aim for 2,000–10,000. The LIMA paper showed 1,000 carefully curated examples can match 52,000 noisy ones. Start with 1,000, evaluate, and scale only if performance falls short. More is only better if the additional examples are genuinely high-quality.
Can I use GPT-4 or Claude to generate training data for open-source models?
LLM terms of service differ on this. OpenAI prohibits using their outputs to train a model that competes with OpenAI products. For non-competing use cases (internal tooling, domain-specific assistants), LLM-generated data is widely used. Claude’s API terms and other providers have similar nuances — check the specific current terms for your data generator before deploying a commercial model trained on that data.
Should my outputs include chain-of-thought reasoning?
Match your outputs to your inference expectations. If you want your model to reason step-by-step at inference time (“First, I’ll identify… Then I’ll…”), your training outputs should demonstrate that pattern. If you want direct answers, train on direct answers. The model reproduces what it sees in training.
How do I handle PII (personally identifiable information) in my raw data?
Real data often contains names, emails, and addresses that the model can memorize and reproduce. Run a PII scrubber before uploading — the presidio-anonymizer library can detect and replace PII with placeholders ([CUSTOMER_NAME], [EMAIL]) automatically. Apply this especially to the input field, which often contains user-submitted text.
What is the difference between supervised fine-tuning (SFT) and RLHF?
This article covers SFT: training on instruction-response pairs where you provide the ideal response. RLHF (Reinforcement Learning from Human Feedback) uses human preference data — pairs of responses labeled “A is better than B” — to train a reward model that guides further training. SFT comes first; RLHF is a post-SFT alignment step that requires more infrastructure. If you’re just getting started, do SFT first and evaluate before adding RLHF complexity.
What is a dataset card and do I need one?
A dataset card is the README file attached to your HuggingFace Hub dataset. It documents what the dataset contains, how it was created, what tasks it’s suitable for, and any known limitations. You don’t need one to fine-tune privately, but it’s worth adding before sharing publicly. HuggingFace provides a template — at minimum, document your data source, generation method, filtering criteria, and the fine-tuning task the dataset targets. Other practitioners (and your future self) will thank you.
What to Read Next
Continue building your fine-tuning pipeline with these related guides:
- LoRA and QLoRA Fine-Tuning with Python — Fine-tune efficiently on consumer hardware using parameter-efficient methods
- Evaluating Fine-Tuned LLMs — Benchmarks, human evaluation frameworks, and automated metrics
- DPO: Direct Preference Optimization Explained — Preference alignment without a reward model, using the DPO format introduced above
- Unsloth Fine-Tuning Guide — 2–5× faster training with significantly less GPU memory
References
- Taori, R. et al. — “Alpaca: A Strong, Replicable Instruction-Following Model.” Stanford CRFM, 2023. Link
- Zhou, C. et al. — “LIMA: Less Is More for Alignment.” NeurIPS 2023. arXiv:2305.11206
- Wang, Y. et al. — “Self-Instruct: Aligning Language Models with Self-Generated Instructions.” ACL 2023. arXiv:2212.10560
- HuggingFace Datasets — Creating and Sharing Datasets. Link
- HuggingFace Datasets — Share a Dataset to the Hub. Link
- Meta AI — “How to Fine-Tune LLMs: PEFT and Dataset Curation.” Meta AI Blog, 2024. Link
- Unsloth Documentation — Datasets Guide for Fine-Tuning. Link
- Cleanlab — “How to detect bad data in your instruction tuning dataset.” Link
- Distilabel — Open-source framework for AI feedback and synthetic data pipelines. GitHub
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →