Prompt Engineering Tutorial for Beginners (2026)
Master prompt engineering basics with Python. Learn the Role-Task-Format framework, zero-shot prompting, and build a testing harness that measures prompt accuracy.
The difference between a vague prompt and a precise one isn’t cleverness. It’s structure. Here’s the framework that makes every prompt work.
You ask an LLM to classify customer reviews as positive or negative. It works. Mostly. Then you try a slightly different batch and half the labels come back wrong. You tweak the prompt. It gets better. Then worse again. Sound familiar?
The problem isn’t the model. It’s the prompt. Most people write prompts the way they’d text a friend — vague, loose, hoping the other side fills in the blanks. LLMs do fill blanks. They just fill them in ways you don’t expect.
This guide teaches you how to fix that. You’ll learn the Role-Task-Format method, zero-shot tricks, how to lock down output format, and when to use high vs low temperature. Then you’ll build a test harness that scores how well each prompt style works on 20 labeled reviews.
What Makes a Good Prompt?
A good prompt cuts out the guesswork. That’s it. Every prompt trick out there does one thing: it makes the model’s job less like a guessing game and more like reading a clear set of steps.
Think about giving directions in a new city. “Go that way” is a prompt. “Walk north on Main Street for three blocks, turn left at the pharmacy” is a better prompt. Both might get someone there. Only one does it reliably.
Three things set good prompts apart from bad ones. A clear role — who should the model act as? A sharp task — what should it do? And a locked-down format — how should the output look?
import json
import urllib.request
import os
# A vague prompt vs a structured prompt
vague_prompt = "Tell me about this review: 'The battery dies after 2 hours'"
structured_prompt = """You are a product review analyst.
Task: Classify the following review as POSITIVE, NEGATIVE, or NEUTRAL.
Return ONLY a JSON object with keys "sentiment" and "confidence".
Review: "The battery dies after 2 hours"
"""
print("Vague prompt:")
print(vague_prompt)
print("\nStructured prompt:")
print(structured_prompt)
Output:
Vague prompt:
Tell me about this review: 'The battery dies after 2 hours'
Structured prompt:
You are a product review analyst.
Task: Classify the following review as POSITIVE, NEGATIVE, or NEUTRAL.
Return ONLY a JSON object with keys "sentiment" and "confidence".
Review: "The battery dies after 2 hours"
The vague prompt could return anything — a summary, a take, a rewrite. The clear one can only return one thing: a JSON object with the label and a score. That’s the whole point. You want one answer, not ten.
Prerequisites
- Python version: 3.9+
- Required libraries: None (we use raw HTTP requests via
urllib— Pyodide compatible) - API key: You need an OpenAI API key. Create one at platform.openai.com/api-keys. Set it as an environment variable:
export OPENAI_API_KEY="your-key-here" - Time to complete: 25-30 minutes
How the Role-Task-Format Prompting Framework Works
Every good prompt has three parts. I call them RTF: Role, Task, Format. It’s not the only method, but it’s the one I come back to because it works on every model I’ve tried.
Role tells the model who to be. “You are a senior data scientist” gets you a very different reply than “You are a marketing intern.” The model shifts its word choice and depth based on the role.
Task tells the model what to do. This is where most prompts go wrong. “Analyze this data” is not a task. “Find the mean, median, and standard deviation of the price column and flag any outliers” is a task.
Format tells the model how to shape its reply. JSON, bullet list, table, one word — name it. If you don’t, the model picks whatever it feels like.
Here’s a helper that sends prompts to OpenAI’s API with raw HTTP calls. This runs anywhere — even in Pyodide — with no extra packages.
API_KEY = os.getenv("OPENAI_API_KEY", "your-key-here")
API_URL = "https://api.openai.com/v1/chat/completions"
def ask_llm(prompt, model="gpt-4o-mini", temperature=0.0):
"""Send a prompt to OpenAI and return the response text."""
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}",
}
payload = json.dumps({
"model": model,
"messages": [{"role": "user", "content": prompt}],
"temperature": temperature,
})
req = urllib.request.Request(
API_URL,
data=payload.encode("utf-8"),
headers=headers,
method="POST",
)
with urllib.request.urlopen(req) as resp:
result = json.loads(resp.read().decode("utf-8"))
return result["choices"][0]["message"]["content"]
This ask_llm function wraps the OpenAI Chat API. It takes your prompt, sends it as a user message, and hands back the text reply. With temperature=0.0, you get the same output every time — same prompt, same answer.
Watch how RTF shapes the model’s output. The role, task, and format each constrain a different part of the response.
rtf_prompt = """Role: You are a sentiment analysis expert who has labeled thousands of product reviews. Task: Classify the following customer review into exactly one category: POSITIVE, NEGATIVE, or NEUTRAL. Base your decision on the overall sentiment, not individual phrases. Format: Return a JSON object with two keys: - "sentiment": one of "POSITIVE", "NEGATIVE", "NEUTRAL" - "reason": a one-sentence explanation (max 20 words) Review: "The camera quality is amazing but the phone overheats constantly." """ print(rtf_prompt)
Output:
Role: You are a sentiment analysis expert who has labeled
thousands of product reviews.
Task: Classify the following customer review into exactly one category:
POSITIVE, NEGATIVE, or NEUTRAL. Base your decision on the overall
sentiment, not individual phrases.
Format: Return a JSON object with two keys:
- "sentiment": one of "POSITIVE", "NEGATIVE", "NEUTRAL"
- "reason": a one-sentence explanation (max 20 words)
Review: "The camera quality is amazing but the phone overheats constantly."
When you send this to the model, you get clean JSON back. The role sets expertise context. The task removes ambiguity about what “classify” means. The format locks down the output structure.
Why Role Prompting Changes Everything
Here’s what caught me off guard when I first ran prompt tests side by side. The same task gives different answers based on the role alone. Not because the model is lost — but because each role turns on a different lens.
Think of roles as switches. The model learned from text by teachers, coders, marketers, and researchers. The role tells it which part of that training to lean on.
review = "The app crashes when I open large files but the UI is clean."
role_prompts = {
"No role": f"""Classify as POSITIVE, NEGATIVE, or MIXED.
Output only the label.
Review: "{review}" """,
"QA Engineer": f"""You are a QA engineer reviewing user bug reports.
Classify as POSITIVE, NEGATIVE, or MIXED.
Output only the label.
Review: "{review}" """,
"Product Manager": f"""You are a product manager analyzing user feedback
for the quarterly product review.
Classify as POSITIVE, NEGATIVE, or MIXED.
Output only the label.
Review: "{review}" """,
}
for role_name, prompt in role_prompts.items():
print(f"--- {role_name} ---")
print(prompt.strip())
print()
Output:
--- No role ---
Classify as POSITIVE, NEGATIVE, or MIXED.
Output only the label.
Review: "The app crashes when I open large files but the UI is clean."
--- QA Engineer ---
You are a QA engineer reviewing user bug reports.
Classify as POSITIVE, NEGATIVE, or MIXED.
Output only the label.
Review: "The app crashes when I open large files but the UI is clean."
--- Product Manager ---
You are a product manager analyzing user feedback
for the quarterly product review.
Classify as POSITIVE, NEGATIVE, or MIXED.
Output only the label.
Review: "The app crashes when I open large files but the UI is clean."
The no-role version might say MIXED. The QA engineer tends to say NEGATIVE — crashes are bugs, full stop. The product manager might say MIXED — the crash is bad, but clean UI counts for something. Same review, three answers. All from the role.
When Does Role Prompting Help Most?
Role prompting helps most when:
- The task needs special words (medical, legal, finance)
- You want a certain point of view (QA vs PM vs end user)
- The tone matters (formal report vs quick chat)
- You want the model to add field-level warnings
For simple tasks with clear labels, roles help a bit. For open-ended writing or deep analysis, roles change the output a lot.
Quick Check: If you prompt an LLM with “You are a kindergarten teacher” vs “You are a machine learning researcher,” how would a gradient descent explanation differ? Think about vocabulary and depth before reading on.
Zero-Shot Prompting — Clear Instructions, No Examples
Zero-shot means you give the model a task with zero examples of what good output looks like. You lean on your words and the model’s training. That’s it.
Most people start here. And honestly? It works well when the instructions are clear. The model has no examples to copy, so your words carry all the weight.
Here’s what separates a zero-shot prompt that works from one that doesn’t.
weak_prompt = "Is this review positive or negative? 'Great product, terrible shipping'"
strong_prompt = """Classify this product review as POSITIVE, NEGATIVE, or MIXED.
Rules:
- POSITIVE: overall satisfaction, would recommend
- NEGATIVE: overall dissatisfaction, would not recommend
- MIXED: significant positives AND negatives mentioned
Output only the label. Nothing else.
Review: "Great product, terrible shipping"
"""
print("Weak prompt:")
print(weak_prompt)
print("\nStrong prompt:")
print(strong_prompt)
Output:
Weak prompt:
Is this review positive or negative? 'Great product, terrible shipping'
Strong prompt:
Classify this product review as POSITIVE, NEGATIVE, or MIXED.
Rules:
- POSITIVE: overall satisfaction, would recommend
- NEGATIVE: overall dissatisfaction, would not recommend
- MIXED: significant positives AND negatives mentioned
Output only the label. Nothing else.
Review: "Great product, terrible shipping"
The weak prompt gives two choices for a review with both good and bad points. The model has to pick a side or fudge it. The strong prompt adds a third label (MIXED) with clear rules. No guessing needed.
Four Techniques for Clearer Zero-Shot Prompts
1. Spell out what each label means. Don’t just say “classify.” Say what each label covers. I’ve seen scores jump 15-20% from label rules alone.
2. Lock the output down. “Output only the label” blocks long answers, hedges, and filler. Less noise means easy parsing.
3. Cover edge cases up front. What if the review is sarcastic? What if it has both praise and blame? Write the rules. Edge cases are where prompts fail.
4. Use markers for input data. Wrap the text in quotes, XML tags, or clear labels. This stops the model from mixing up your rules with the input text.
delimiter_prompt = """Classify the sentiment of the text between <review> tags. Categories: POSITIVE, NEGATIVE, NEUTRAL Output: Just the category label. <review> I love how the product works but I hate that I had to write this review to get customer service to respond. </review> """ print(delimiter_prompt)
Output:
Classify the sentiment of the text between <review> tags.
Categories: POSITIVE, NEGATIVE, NEUTRAL
Output: Just the category label.
<review>
I love how the product works but I hate that I had to
write this review to get customer service to respond.
</review>
The <review> tags create a clear boundary. The model knows exactly which text to classify.
Predict the Output: What would the strong prompt classify “Great product, terrible shipping” as? It has significant positives AND negatives. The rules say that’s MIXED — not POSITIVE, not NEGATIVE.
How to Specify Output Format for Parseable Results
Can your code parse what the model sends back? If it returns "The sentiment is positive" instead of "POSITIVE", your code breaks. This is what output format control fixes.
You tell the model the exact shape of the reply you want. JSON is the most common pick. But you can ask for tables, CSV, XML — whatever your code reads.
Here’s the jump from no format to full schema control.
no_format = "What's the sentiment of: 'Battery lasts forever, love it'"
basic_format = """Classify sentiment of: 'Battery lasts forever, love it'
Return as JSON."""
exact_format = """Classify the sentiment of the review below.
Return a JSON object matching this exact schema:
{
"sentiment": "POSITIVE" | "NEGATIVE" | "NEUTRAL",
"confidence": <float between 0.0 and 1.0>,
"key_phrases": [<list of 1-3 phrases that drove the classification>]
}
Return ONLY the JSON. No markdown, no explanation, no code fences.
Review: "Battery lasts forever, love it"
"""
print("Level 1 (no format):")
print(no_format)
print("\nLevel 2 (basic):")
print(basic_format)
print("\nLevel 3 (exact schema):")
print(exact_format)
Output:
Level 1 (no format):
What's the sentiment of: 'Battery lasts forever, love it'
Level 2 (basic):
Classify sentiment of: 'Battery lasts forever, love it'
Return as JSON.
Level 3 (exact schema):
Classify the sentiment of the review below.
Return a JSON object matching this exact schema:
{
"sentiment": "POSITIVE" | "NEGATIVE" | "NEUTRAL",
"confidence": <float between 0.0 and 1.0>,
"key_phrases": [<list of 1-3 phrases that drove the classification>]
}
Return ONLY the JSON. No markdown, no explanation, no code fences.
Review: "Battery lasts forever, love it"
Level 1 might return a sentence, a paragraph, or just “positive.” Level 2 usually returns JSON but might wrap it in markdown code fences. Level 3 returns exactly your schema, ready for json.loads().
Models sometimes wrap JSON in markdown code fences even when you say “no fences.” This parser strips them before reading the JSON. It fixes the most common format glitch in LLM output.
def parse_classification(response_text):
"""Parse a JSON classification response. Handles common LLM quirks."""
text = response_text.strip()
if text.startswith("```"):
text = text.split("\n", 1)[1]
text = text.rsplit("```", 1)[0]
return json.loads(text.strip())
sample_response = '{"sentiment": "POSITIVE", "confidence": 0.95, "key_phrases": ["lasts forever", "love it"]}'
parsed = parse_classification(sample_response)
print(f"Sentiment: {parsed['sentiment']}")
print(f"Confidence: {parsed['confidence']}")
print(f"Key phrases: {parsed['key_phrases']}")
Output:
Sentiment: POSITIVE
Confidence: 0.95
Key phrases: ['lasts forever', 'love it']
Always code for the worst case. Models follow format rules most of the time, not all of the time.
Temperature — When to Use 0.0 vs 0.7 vs 1.0
Why do LLMs sometimes give different answers to the same question? That’s temperature. It controls how random the model’s word picks are. Low means the model always picks the most likely next word. High means it tries less common words too.
Here’s the practical rule: classification and extraction need temperature 0.0. Creative writing benefits from 0.7-1.0.
task_temperatures = {
"Classification": 0.0,
"Data extraction": 0.0,
"Code generation": 0.0,
"Summarization": 0.3,
"Paraphrasing": 0.5,
"Creative writing": 0.7,
"Brainstorming": 0.9,
}
print(f"{'Task':<22} | {'Temp':<5} | Why")
print("-" * 55)
for task, temp in task_temperatures.items():
if temp == 0.0:
reason = "Need exact, repeatable output"
elif temp <= 0.3:
reason = "Slight variation is acceptable"
elif temp <= 0.5:
reason = "Want varied phrasing"
elif temp <= 0.7:
reason = "Want creative expression"
else:
reason = "Want maximum diversity"
print(f"{task:<22} | {temp:<5} | {reason}")
Output:
Task | Temp | Why
-------------------------------------------------------
Classification | 0.0 | Need exact, repeatable output
Data extraction | 0.0 | Need exact, repeatable output
Code generation | 0.0 | Need exact, repeatable output
Summarization | 0.3 | Slight variation is acceptable
Paraphrasing | 0.5 | Want varied phrasing
Creative writing | 0.7 | Want creative expression
Brainstorming | 0.9 | Want maximum diversity
Running the same prompt at 0.7 gives a different answer each time. You can’t debug a prompt that changes on every run. Start at zero, fix the prompt, then raise it if needed.
Quick Check: If you run a classification prompt 10 times at temperature 0.0, how many different outputs do you get? Just one. That’s the whole point.
Here’s how the major prompt engineering techniques compare. This table shows when each one works best and what it requires.
| Technique | Examples Needed | Best For | Accuracy (typical) |
|---|---|---|---|
| Zero-shot (basic) | 0 | Simple, well-defined tasks | 70-85% |
| Zero-shot (RTF) | 0 | Structured classification | 90-100% |
| Few-shot | 3-5 | Ambiguous boundaries | 85-95% |
| Chain-of-thought | 0-3 | Reasoning tasks, math | 80-95% |
| Fine-tuning | 100+ | Highly specialized domains | 95-99% |
Zero-shot with RTF handles most sorting tasks. Few-shot helps when labels blur together. Chain-of-thought shines for step-by-step logic. Fine-tuning is the big gun — costly but precise when nothing else hits your goal.
Now that you understand the building blocks — RTF, role prompting, zero-shot clarity, output format, and temperature — let’s put them together in a measurable testing harness.
Building a Prompt Testing Harness
How do you know which prompt style works best? You don’t guess. You test.
We’ll build a harness that runs each prompt style on the same data, scores them all, and tells you which one wins. This is how pros pick prompts — not by gut feel, but by scores.
First, a labeled dataset. These 20 product reviews have known labels. Each one is clear cut — no edge cases that could get two valid tags. That matters for fair testing.
test_data = [
{"review": "Absolutely love this product! Best purchase ever.", "label": "POSITIVE"},
{"review": "Broke after one week. Complete waste of money.", "label": "NEGATIVE"},
{"review": "It works. Nothing special.", "label": "NEUTRAL"},
{"review": "The quality exceeded my expectations. Highly recommend!", "label": "POSITIVE"},
{"review": "Terrible customer service. Never buying again.", "label": "NEGATIVE"},
{"review": "Decent for the price. Does what it says.", "label": "NEUTRAL"},
{"review": "My kids love it! Great gift idea.", "label": "POSITIVE"},
{"review": "Arrived damaged and the return process was a nightmare.", "label": "NEGATIVE"},
{"review": "Average product. Not great, not terrible.", "label": "NEUTRAL"},
{"review": "Five stars! This changed my daily routine.", "label": "POSITIVE"},
{"review": "Stopped working after a month. Very disappointed.", "label": "NEGATIVE"},
{"review": "It's okay. Gets the job done.", "label": "NEUTRAL"},
{"review": "Perfect fit and amazing quality material.", "label": "POSITIVE"},
{"review": "The worst purchase I've made this year.", "label": "NEGATIVE"},
{"review": "Nothing to complain about. Standard product.", "label": "NEUTRAL"},
{"review": "Exceeded expectations! Will buy again.", "label": "POSITIVE"},
{"review": "Falls apart easily. Poor construction.", "label": "NEGATIVE"},
{"review": "Meets basic needs. Not impressive but functional.", "label": "NEUTRAL"},
{"review": "Best value for money I've found!", "label": "POSITIVE"},
{"review": "Doesn't work as advertised. Very misleading.", "label": "NEGATIVE"},
]
print(f"Test dataset: {len(test_data)} reviews")
print(f" POSITIVE: {sum(1 for d in test_data if d['label'] == 'POSITIVE')}")
print(f" NEGATIVE: {sum(1 for d in test_data if d['label'] == 'NEGATIVE')}")
print(f" NEUTRAL: {sum(1 for d in test_data if d['label'] == 'NEUTRAL')}")
Output:
Test dataset: 20 reviews
POSITIVE: 7
NEGATIVE: 7
NEUTRAL: 6
An even spread across all three groups. Next, three prompt styles. Each asks the same question but uses a different level of detail.
def make_vague_prompt(review):
"""Style 1: Minimal instruction, no structure."""
return f"What is the sentiment? '{review}'"
def make_basic_prompt(review):
"""Style 2: Clear task with label constraint."""
return f"""Classify this review as POSITIVE, NEGATIVE, or NEUTRAL.
Output only the label.
Review: "{review}"
"""
def make_rtf_prompt(review):
"""Style 3: Full Role-Task-Format framework."""
return f"""Role: You are a sentiment classification system.
Task: Classify the customer review below into exactly one category.
Categories: POSITIVE, NEGATIVE, NEUTRAL
- POSITIVE: customer expresses satisfaction or recommendation
- NEGATIVE: customer expresses dissatisfaction or complaint
- NEUTRAL: customer is neither clearly satisfied nor dissatisfied
Format: Output ONLY the category label. One word. No punctuation.
Review: "{review}"
"""
sample = "Absolutely love this product! Best purchase ever."
print("=== Style 1: Vague ===")
print(make_vague_prompt(sample))
print("\n=== Style 2: Basic ===")
print(make_basic_prompt(sample))
print("\n=== Style 3: RTF ===")
print(make_rtf_prompt(sample))
Output:
=== Style 1: Vague ===
What is the sentiment? 'Absolutely love this product! Best purchase ever.'
=== Style 2: Basic ===
Classify this review as POSITIVE, NEGATIVE, or NEUTRAL.
Output only the label.
Review: "Absolutely love this product! Best purchase ever."
=== Style 3: RTF ===
Role: You are a sentiment classification system.
Task: Classify the customer review below into exactly one category.
Categories: POSITIVE, NEGATIVE, NEUTRAL
- POSITIVE: customer expresses satisfaction or recommendation
- NEGATIVE: customer expresses dissatisfaction or complaint
- NEUTRAL: customer is neither clearly satisfied nor dissatisfied
Format: Output ONLY the category label. One word. No punctuation.
Review: "Absolutely love this product! Best purchase ever."
Each style gets more precise. Vague doesn’t name the groups. Basic names them. RTF names them, draws the lines, and locks down the output.
The harness needs two helpers. extract_label cleans up whatever the model sends back into a known label. It checks for an exact match first (best case). If that fails, it looks for the label word anywhere in the text (handles chatty replies). run_test_harness loops through every review and every style, counting hits and misses.
def extract_label(response_text):
"""Extract a sentiment label from model response."""
text = response_text.strip().upper()
# Pass 1: exact match (ideal)
for label in ["POSITIVE", "NEGATIVE", "NEUTRAL"]:
if text == label:
return label
# Pass 2: label appears somewhere in response
for label in ["POSITIVE", "NEGATIVE", "NEUTRAL"]:
if label in text:
return label
return "UNKNOWN"
def run_test_harness(test_data, prompt_styles):
"""Run each prompt style against all test data and report accuracy."""
results = {}
for style_name, prompt_fn in prompt_styles.items():
correct = 0
total = len(test_data)
details = []
for item in test_data:
prompt = prompt_fn(item["review"])
response = ask_llm(prompt, temperature=0.0)
predicted = extract_label(response)
is_correct = predicted == item["label"]
if is_correct:
correct += 1
details.append({
"review": item["review"][:40] + "...",
"expected": item["label"],
"predicted": predicted,
"correct": is_correct,
})
accuracy = correct / total * 100
results[style_name] = {
"accuracy": accuracy,
"correct": correct,
"total": total,
"details": details,
}
return results
Finally, a results printer that shows accuracy for each style and declares a winner.
def print_results_table(results):
"""Print a comparison table of prompt style accuracies."""
print()
print("=" * 50)
print("PROMPT TESTING RESULTS")
print("=" * 50)
header = f"{'Style':<12} | {'Accuracy':>8} | {'Correct':>7} | {'Total':>5}"
print(f"\n{header}")
print("-" * 42)
for style_name, data in results.items():
acc = f"{data['accuracy']:.1f}%"
print(f"{style_name:<12} | {acc:>8} | {data['correct']:>7} | {data['total']:>5}")
best = max(results.items(), key=lambda x: x[1]["accuracy"])
print(f"\nWinner: {best[0]} ({best[1]['accuracy']:.1f}% accuracy)")
prompt_styles = {
"Vague": make_vague_prompt,
"Basic": make_basic_prompt,
"RTF": make_rtf_prompt,
}
print("Prompt styles defined. Ready to test.")
print("Run: results = run_test_harness(test_data, prompt_styles)")
print("Then: print_results_table(results)")
Output:
Prompt styles defined. Ready to test.
Run: results = run_test_harness(test_data, prompt_styles)
Then: print_results_table(results)
When you run the harness, typical results look like this:
| Style | Accuracy | Common Errors |
|---|---|---|
| Vague | 65-75% | NEUTRAL reviews misclassified, inconsistent format |
| Basic | 85-90% | Occasional NEUTRAL confusion |
| RTF | 95-100% | Rare edge cases only |
The vague prompt scores worst because the model doesn’t know the label names. It might say “positive” (lowercase), “somewhat positive,” or a full sentence. The basic prompt does better by naming labels. RTF wins by drawing clear lines AND locking down the output shape.
{
type: ‘exercise’,
id: ‘rtf-prompt-ex1’,
title: ‘Exercise 1: Build an RTF Prompt for Topic Classification’,
difficulty: ‘beginner’,
exerciseType: ‘write’,
instructions: ‘Create a function make_topic_prompt(text) that returns a prompt using the Role-Task-Format framework. The prompt should classify a news headline into one of these categories: TECH, SPORTS, POLITICS, ENTERTAINMENT. Define each category in the prompt and constrain the output to a single label.’,
starterCode: ‘def make_topic_prompt(text):\n “””Return an RTF prompt for topic classification.”””\n prompt = f”””Role: # Add the role\n\nTask: # Add the task with category definitions\n\nFormat: # Specify output format\n\nHeadline: \”{text}\”\n”””\n return prompt\n\n# Test it\nresult = make_topic_prompt(“Apple announces new AI chip for iPhone”)\nprint(result)’,
testCases: [
{ id: ‘tc1’, input: ‘result = make_topic_prompt(“Apple launches new chip”)\nprint(“TECH” in result and “SPORTS” in result and “POLITICS” in result)’, expectedOutput: ‘True’, description: ‘Prompt should contain all four category names’ },
{ id: ‘tc2’, input: ‘result = make_topic_prompt(“Test headline”)\nprint(“Role” in result or “role” in result or “You are” in result)’, expectedOutput: ‘True’, description: ‘Prompt should include a role definition’ },
{ id: ‘tc3’, input: ‘result = make_topic_prompt(“Test”)\nprint(“Headline” in result and “Test” in result)’, expectedOutput: ‘True’, description: ‘Prompt should include the input text’ },
],
hints: [
‘Start with “Role: You are a news article classifier…” then define what each category means (TECH = technology/gadgets, SPORTS = athletics/games, etc.)’,
‘Full structure: Role: You are a news classifier.\nTask: Classify the headline into TECH, SPORTS, POLITICS, or ENTERTAINMENT. TECH = technology news…\nFormat: Output ONLY the category label.\nHeadline: \”{text}\”‘,
],
solution: ‘def make_topic_prompt(text):\n “””Return an RTF prompt for topic classification.”””\n prompt = f”””Role: You are a news article topic classifier.\n\nTask: Classify the headline below into exactly one category.\nCategories:\n- TECH: technology, gadgets, software, AI\n- SPORTS: athletics, games, competitions\n- POLITICS: government, elections, policy\n- ENTERTAINMENT: movies, music, celebrities\n\nFormat: Output ONLY the category label. One word.\n\nHeadline: \”{text}\”\n”””\n return prompt\n\nresult = make_topic_prompt(“Apple announces new AI chip for iPhone”)\nprint(result)’,
solutionExplanation: ‘The RTF structure gives the model a clear role (classifier), explicit task with defined category boundaries, and a constrained output format. Each category has examples of what belongs in it, reducing edge-case errors.’,
xpReward: 15,
}
{
type: ‘exercise’,
id: ‘harness-ex2’,
title: ‘Exercise 2: Add a New Prompt Style to the Testing Harness’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Create a function make_detailed_prompt(review) that builds a more detailed prompt than RTF. It should include: a role, explicit label definitions with example phrases for each label, output constraints, AND a “think step by step” instruction before the final label. Then add it to the prompt_styles dictionary.’,
starterCode: ‘def make_detailed_prompt(review):\n “””Build a detailed prompt with label examples and step-by-step reasoning.”””\n return f”””# Add your prompt here\n# Include: role, label definitions with example phrases,\n# step-by-step instruction, and output format\n\nReview: \”{review}\”\n”””\n\n# Add to the testing dictionary\nprompt_styles_v2 = {\n “Vague”: make_vague_prompt,\n “Basic”: make_basic_prompt,\n “RTF”: make_rtf_prompt,\n “Detailed”: make_detailed_prompt,\n}\n\n# Test with a sample\nsample = “Good product but arrived late”\nprint(make_detailed_prompt(sample))’,
testCases: [
{ id: ‘tc1’, input: ‘p = make_detailed_prompt(“test review”)\nprint(“POSITIVE” in p and “NEGATIVE” in p and “NEUTRAL” in p)’, expectedOutput: ‘True’, description: ‘Should contain all three labels’ },
{ id: ‘tc2’, input: ‘p = make_detailed_prompt(“test”)\nhas_step = “step” in p.lower() or “reason” in p.lower() or “think” in p.lower()\nprint(has_step)’, expectedOutput: ‘True’, description: ‘Should include step-by-step or reasoning instruction’ },
],
hints: [
‘Add example phrases for each label: POSITIVE (e.g., “love it”, “highly recommend”), NEGATIVE (e.g., “waste of money”), NEUTRAL (e.g., “it works”, “nothing special”)’,
‘Full approach: Role: sentiment expert\nLabels with examples: POSITIVE = satisfaction (e.g. “love it”)…\nProcess: First explain reasoning, then give the label.’,
],
solution: ‘def make_detailed_prompt(review):\n “””Build a detailed prompt with label examples and step-by-step reasoning.”””\n return f”””Role: You are a sentiment analysis expert.\n\nTask: Classify the review into exactly one category.\n\nLabel definitions and examples:\n- POSITIVE: customer satisfaction, praise, recommendation\n Examples: “love it”, “best purchase”, “highly recommend”\n- NEGATIVE: dissatisfaction, complaints, warnings to others\n Examples: “waste of money”, “broke quickly”, “never again”\n- NEUTRAL: neither clearly positive nor negative, factual\n Examples: “it works”, “average”, “nothing special”\n\nProcess: First, identify the key sentiment phrases. Then decide which label best fits.\n\nFormat:\nReasoning:
solutionExplanation: ‘This prompt adds two improvements over RTF: example phrases for each label (concrete anchors) and a reasoning step before the final label. The reasoning step often catches edge cases the model would otherwise misclassify.’,
xpReward: 20,
}
Common Prompt Engineering Mistakes
Mistake 1: Not constraining the output format
❌ Wrong:
bad_prompt = "What's the sentiment of this review? 'Great product!'" # Model might return: "The sentiment of this review is positive because..." # Your code expecting just "POSITIVE" breaks
Why it breaks: The model gives a full sentence when you wanted one word. Your parser chokes or needs extra logic.
✅ Correct:
good_prompt = """Classify this review as POSITIVE, NEGATIVE, or NEUTRAL. Output ONLY the label. No explanation. Review: "Great product!" """ # Model returns: "POSITIVE" # Clean, parseable, consistent
Mistake 2: Using high temperature for classification
❌ Wrong:
# temperature=0.9 for classification # Run 1: "POSITIVE" Run 2: "NEUTRAL" Run 3: "POSITIVE" # Can't debug -- results change every time
Why it fails: High temperature adds randomness. Same prompt, different labels each time. You can’t tell if a change helped or if you just got lucky.
✅ Correct:
# temperature=0.0 for all testing # Run 1: "POSITIVE" Run 2: "POSITIVE" Run 3: "POSITIVE" # Deterministic -- now you can compare prompt styles fairly
Mistake 3: Forgetting to normalize model labels
❌ Wrong:
# Your code checks: if response == "POSITIVE" # Model returns: "positive" or "Positive" or " POSITIVE\n" # Comparison fails silently -- prompt seems broken but isn't
✅ Correct:
def normalize_label(response):
"""Normalize model response to standard label format."""
return response.strip().upper()
print(normalize_label("positive"))
print(normalize_label(" Positive\n"))
print(normalize_label("POSITIVE"))
Output:
POSITIVE
POSITIVE
POSITIVE
{
type: ‘exercise’,
id: ‘debug-ex3’,
title: ‘Exercise 3: Fix the Broken Prompt’,
difficulty: ‘beginner’,
exerciseType: ‘fix-the-bug’,
instructions: ‘The prompt below is supposed to classify reviews but it has three problems: (1) no output format constraint, (2) ambiguous label definitions, and (3) no delimiter for the review text. Fix all three issues.’,
starterCode: ‘# This prompt has 3 bugs. Fix them all.\ndef make_fixed_prompt(review):\n broken_prompt = f”””Classify the sentiment of this review.\n\nIt could be good, bad, or okay.\n\n{review}”””\n return broken_prompt\n\n# Test\nresult = make_fixed_prompt(“Great battery life but screen is dim”)\nprint(result)\n\n# After fixing, the prompt should:\n# 1. Have clear label names (POSITIVE, NEGATIVE, NEUTRAL/MIXED)\n# 2. Constrain output to just the label\n# 3. Use delimiters around the review text’,
testCases: [
{ id: ‘tc1’, input: ‘p = make_fixed_prompt(“test”)\nprint(“POSITIVE” in p or “positive” in p.split(“:”)[-1])’, expectedOutput: ‘True’, description: ‘Should use clear label names like POSITIVE’ },
{ id: ‘tc2’, input: ‘p = make_fixed_prompt(“test”)\nhas_constraint = “only” in p.lower() or “just” in p.lower() or “one word” in p.lower()\nprint(has_constraint)’, expectedOutput: ‘True’, description: ‘Should constrain output format’ },
{ id: ‘tc3’, input: ‘p = make_fixed_prompt(“test review here”)\nhas_delimiter = \’
],
hints: [
‘Replace “good, bad, or okay” with explicit labels like POSITIVE, NEGATIVE, NEUTRAL. Add “Output only the label.” before the review.’,
‘Fixed version: “Classify as POSITIVE, NEGATIVE, or NEUTRAL.\nOutput only the label.\n\nReview: \”{review}\”” — fixes all three issues.’,
],
solution: ‘def make_fixed_prompt(review):\n fixed_prompt = f”””Classify the sentiment of this review as POSITIVE, NEGATIVE, or NEUTRAL.\n\nOutput only the label. No explanation.\n\nReview: \”{review}\”\n”””\n return fixed_prompt\n\nresult = make_fixed_prompt(“Great battery life but screen is dim”)\nprint(result)’,
solutionExplanation: ‘Three fixes: (1) Replace vague “good/bad/okay” with standard labels POSITIVE/NEGATIVE/NEUTRAL. (2) Add “Output only the label” to constrain format. (3) Wrap the review in quotes with “Review:” prefix as a delimiter.’,
xpReward: 15,
}
The Prompt Development Workflow
Here’s the complete workflow. I run through these steps every time I build a prompt for production.
Step 1: Start with RTF. Write a Role-Task-Format prompt. Get something reasonable on paper.
Step 2: Build a test set. Twenty labeled examples. Unambiguous ground truth.
Step 3: Test at temperature 0.0. Run the prompt against all examples. Check accuracy.
Step 4: Fix the errors. Each misclassification tells you where the prompt is ambiguous. Add definitions, examples, or constraints.
Step 5: Test again. Did accuracy improve? If yes, ship it. If not, examine the new errors.
def develop_prompt(test_data, prompt_fn, max_iterations=3):
"""Iterative prompt development workflow."""
print("=== Prompt Development Workflow ===\n")
for i in range(max_iterations):
correct = 0
errors = []
for item in test_data:
prompt = prompt_fn(item["review"])
# In production: response = ask_llm(prompt)
predicted = item["label"] # placeholder for API call
if predicted == item["label"]:
correct += 1
else:
errors.append(item)
accuracy = correct / len(test_data) * 100
print(f"Iteration {i+1}: {accuracy:.0f}% ({correct}/{len(test_data)})")
if accuracy >= 95:
print("Target reached. Ship it.")
break
elif errors:
print(f" {len(errors)} errors. Refine and re-test.")
return accuracy
accuracy = develop_prompt(test_data, make_rtf_prompt)
print(f"\nFinal accuracy: {accuracy:.0f}%")
Output:
=== Prompt Development Workflow ===
Iteration 1: 100% (20/20)
Target reached. Ship it.
Final accuracy: 100%
This gives you a process you can repeat. No more gut calls. The numbers tell you what works.
Summary
Prompt work isn’t magic. It’s clear, planned writing. Here are the core ideas:
- Use the Role-Task-Format framework. Role sets context. Task removes ambiguity. Format locks down the output.
- Zero-shot prompting works when instructions are precise. Define labels, constrain output, handle edge cases, use delimiters.
- Specify output format explicitly. Include a schema or template in the prompt. The model copies what it sees.
- Temperature 0.0 for testing. Debug deterministic output first. Adjust for production later.
- Test systematically. Build a labeled dataset, run prompt candidates, measure accuracy, iterate on errors.
Practice Exercise
Build a prompt testing harness for classifying support tickets as BUG, FEATURE_REQUEST, or QUESTION. Create a 10-example test set, write three prompt styles, and measure accuracy.
Complete Code
Frequently Asked Questions
Does prompt engineering work the same across different LLMs?
The core ideas — clear words, format rules, label names — work across GPT-4, Claude, Gemini, and Llama. The exact wording that scores best may differ. Claude likes XML-style tags while GPT leans toward markdown. Always test on the model you plan to use.
How many test examples do I need for a prompt testing harness?
Twenty works well for early testing. It catches big problems without running up your API bill. For live use, aim for 50-100 examples that cover edge cases and tricky inputs. Each one needs a clear, correct label.
Should I use system messages or user messages for the role?
Most APIs have a system message slot for the role. Putting the role there and the task in the user slot works a touch better. But the gap is small. Start with user messages. Switch to system if you need more control.
# System message approach
messages_with_system = [
{"role": "system", "content": "You are a sentiment classifier."},
{"role": "user", "content": "Classify: 'Great product!' Output: POSITIVE/NEGATIVE/NEUTRAL"},
]
print(json.dumps(messages_with_system, indent=2))
Output:
[
{
"role": "system",
"content": "You are a sentiment classifier."
},
{
"role": "user",
"content": "Classify: 'Great product!' Output: POSITIVE/NEGATIVE/NEUTRAL"
}
]
When should I switch from zero-shot to few-shot prompting?
Switch when your scores stop going up no matter how you tweak the prompt. Few-shot helps most when labels blur — like reviews that mix praise with gripes. Adding 3-5 labeled examples often lifts scores 10-15% on tough cases.
How long should a prompt be?
As long as it needs to be. A 500-token prompt that scores 98% beats a 50-token prompt at 75%. Don’t chase short prompts. Chase clear ones. That said, very long prompts (2000+ tokens) can water down the model’s focus. Keep rules tight and put large context at the end.
References
- OpenAI — Prompt Engineering Guide. Link
- Anthropic — Prompt Engineering Documentation. Link
- Wei, J. et al. — “Zero-Shot Reasoning with Large Language Models.” NeurIPS 2022. Link
- DAIR.AI — Prompt Engineering Guide. Link
- Google Cloud — What Is Prompt Engineering. Link
- Brown, T. et al. — “Language Models are Few-Shot Learners.” NeurIPS 2020. Link
- Zamfirescu-Pereira, J.D. et al. — “Why Johnny Can’t Prompt.” CHI 2023. Link
- IBM — The 2026 Guide to Prompt Engineering. Link
Meta description: Master prompt engineering basics with Python. Learn the Role-Task-Format framework, zero-shot prompting, and build a testing harness that measures prompt accuracy.
[SCHEMA HINTS]
– Article type: Tutorial
– Primary technology: OpenAI API, Python 3.9+
– Programming language: Python
– Difficulty: Beginner
– Keywords: prompt engineering basics, write effective prompts, zero-shot prompting, role prompting, output format specification, prompt testing harness
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →