machine learning +
LLM Temperature, Top-P, and Top-K Explained — With Python Simulations
Prompt Engineering Fundamentals — Reliable LLM Outputs
Apply prompt engineering fundamentals — zero-shot, few-shot, chain-of-thought, and structured output — to get consistent, reliable results from any LLM.
Prompt engineering is the practice of writing clear, structured inputs so that an LLM gives you the right answer, in the right format, every time. In this post, I’ll walk you through the five core techniques — zero-shot, few-shot, role-based, chain-of-thought, and structured output — with runnable Python code you can try right now.
Here’s a scene you’ve likely lived through. You type a clear request to an LLM. What comes back isn’t broken — it’s just off. Wrong layout, wrong voice, missing half the info you asked for.
You fiddle with the prompt. Swap a word, toss in a line. Somehow it starts working. But you can’t say why. A week later the task changes, and you’re guessing all over again.
Prompt engineering closes that loop. It’s not about magic words or clever hacks. It’s about seeing how the model reads your input — and shaping that input so you get reliable results every time.
What Is Prompt Engineering?
In plain terms, prompt engineering is the art of writing inputs that guide the model toward the answer you need. No secret sauce — just clear, organized directions.
Here’s a useful way to think about it. Imagine handing a task to a brand-new intern. If you say “summarize this report,” the result is a coin flip. But tell them “write 3 bullets — focus on revenue trends” and you get exactly the output you had in mind.
LLMs behave the same way. What you get out tracks almost perfectly with how clearly you spelled out what you wanted.
python
from openai import OpenAI
import json
import os
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def ask_llm(prompt, model="gpt-4o-mini", temperature=0.0):
"""Send a prompt to the LLM and return the response."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
)
return response.choices[0].message.content
This tiny helper does one job: send a prompt and return what the model wrote. I’ll call it throughout the post. The temperature=0.0 setting pins the output down — feed in the same words, get the same reply.
Key Insight: Prompt engineering is about cutting ambiguity, not tricking the model. The sharper you describe what you want, the closer the model gets to giving it to you.
Prerequisites
- Python version: 3.9+
- Required library: openai (1.0+)
- Install:
pip install openai - API key: You need an OpenAI API key. Create one at platform.openai.com/api-keys. Set it as an environment variable:
export OPENAI_API_KEY="your-key-here" - Time to complete: 20-25 minutes
What Is Zero-Shot Prompting?
If you’ve ever typed a question into an LLM without showing it an example first, you’ve already done zero-shot prompting. You give a task, provide zero samples of what the right output looks like, and let the model figure it out from its training.
Here’s a zero-shot request for sentiment tagging:
python
result = ask_llm(
"Classify the sentiment as positive, negative, or neutral: "
"'The food was okay but the service was terrible.'"
)
print(result)
python
Negative
Clean and correct. The model picked up sentiment tagging from the text it was trained on — no help needed.
But zero-shot stumbles when you need the result in a rigid format. Take a look:
python
result = ask_llm(
"Extract the product name and price from: "
"'The new MacBook Pro 16-inch starts at $2,499'"
)
print(result)
python
Product Name: MacBook Pro 16-inch
Price: $2,499
The facts are correct. But the way they’re arranged changes from run to run — colons one time, dashes the next, bullets after that. If your code tries to parse this, the inconsistency will bite you.
Tip: Zero-shot shines for tasks the model already knows — sorting sentiment, translating text, writing summaries. The moment you need a locked-down format, reach for few-shot examples or structured output.
Two signs you need something stronger:
- The format matters. If the result flows into another system, you need it to be the same every time.
- The task calls for niche reasoning. Medical, legal, or domain-specific labels need extra guidance.
How Does Few-Shot Prompting Work?
Few-shot prompting tackles the number-one weakness of zero-shot: unreliable formatting. Instead of crossing your fingers that the model picks your preferred layout, you demonstrate the layout with real examples.
The recipe: place 2–5 input-output pairs above your actual question. The model reads the pattern and mirrors it.
Let me redo the sentiment task, this time with a locked format:
python
few_shot_prompt = """Classify the sentiment and confidence.
Use exactly this format:
Sentiment: [positive/negative/neutral]
Confidence: [high/medium/low]
Text: "This is the best phone I've ever owned!"
Sentiment: positive
Confidence: high
Text: "The battery life is decent but nothing special."
Sentiment: neutral
Confidence: medium
Text: "Broke after two days. Complete waste of money."
Sentiment: negative
Confidence: high
Text: "The camera quality is amazing but it overheats."
"""
result = ask_llm(few_shot_prompt)
print(result)
python
Sentiment: negative
Confidence: medium
The reply locks onto the template set by the examples — every run, without fail. That’s the whole point of few-shot: you draw the shape, the model colors it in.
So how many samples should you provide? For most tasks, three is the magic number:
| Examples | Effect |
|---|---|
| 1 (one-shot) | Picks up the format but may miss edge cases |
| 2-3 | Strong pattern lock, handles tricky inputs |
| 4-5 | Diminishing returns, useful for complex tasks |
| 6+ | Wastes tokens, rarely boosts quality |
Warning: Your examples can plant bias. If every positive example is short and every negative one is long, the model may learn “short = positive.” Mix up the length and style of your samples.
How Do You Pick Good Examples?
How good your samples are matters more than how many you include. Three guidelines:
Guideline 1: Cover the tricky cases. Doing sentiment? Include one mixed-feeling example — not just clear positives and negatives.
Guideline 2: Make every example look the same. If one sample writes “Sentiment:” and the next writes “Sentiment -“, the model has no idea which format to follow.
Guideline 3: Use messy, real-life text. Skip the perfect toy sentences. Feed in inputs that look like what the model will see in the wild.
Quick Check: What would the model return for “It’s fine. Nothing amazing, nothing terrible.” using the few-shot prompt above? Think first. The answer should be Sentiment: neutral / Confidence: high — the text is clearly neutral with strong certainty.
With format issues solved by few-shot, let’s look at a technique that changes the quality and depth of what the model writes.
{
type: ‘exercise’,
id: ‘prompt-eng-ex1’,
title: ‘Exercise 1: Build a Few-Shot Classifier’,
difficulty: ‘beginner’,
exerciseType: ‘write’,
instructions: ‘Create a few-shot prompt that classifies support tickets into categories: billing, technical, account, or general. Test it with the sample ticket provided.’,
starterCode: ‘# Create a few-shot prompt for ticket classification\nfew_shot_prompt = “””\nClassify support tickets into one category: billing, technical, account, or general.\n\n# Add 3 examples here\n\nTicket: “I can\’t log into my account after changing my password.”\nCategory:”””\n\nresult = ask_llm(few_shot_prompt)\nprint(result)’,
testCases: [
{ id: ‘tc1’, input: ‘print(“account” in result.lower())’, expectedOutput: ‘True’, description: ‘Should classify as account’ },
{ id: ‘tc2’, input: ‘print(len(few_shot_prompt) > 200)’, expectedOutput: ‘True’, description: ‘Prompt includes examples (>200 chars)’ },
],
hints: [
‘Add 3 example tickets before the test ticket. One billing, one technical, one account.’,
‘Full format:\nTicket: “Why was I charged twice?”\nCategory: billing\n\nTicket: “The app crashes on upload”\nCategory: technical\n\nTicket: “How do I update my email?”\nCategory: account’,
],
solution: ‘few_shot_prompt = “””\nClassify support tickets into one category: billing, technical, account, or general.\n\nTicket: “Why was I charged twice this month?”\nCategory: billing\n\nTicket: “The app crashes when I upload large files”\nCategory: technical\n\nTicket: “How do I update my email address?”\nCategory: account\n\nTicket: “I can\’t log into my account after changing my password.”\nCategory:”””\n\nresult = ask_llm(few_shot_prompt)\nprint(result)’,
solutionExplanation: ‘Three examples teach the model the classification pattern. Each maps a ticket to one category. The login/password ticket correctly maps to “account.”‘,
xpReward: 15,
}
How Does Role-Based Prompting Work?
You’ve probably seen the “You are a…” line at the top of prompts. That’s role-based prompting — and it does far more than people give it credit for.
By setting a role, you activate a specific pocket of the model’s knowledge. Tell it “You are a senior Python dev” and you get different code style than “You are a data scientist.” The role steers vocabulary, depth, and the assumptions baked into the reply.
python
def ask_with_role(system_prompt, user_prompt, temperature=0.0):
"""Send a prompt with a system role."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
temperature=temperature,
)
return response.choices[0].message.content
Here I’ve split the system message (the persona) from the user message (the task). OpenAI’s API reads the system message as standing instructions that color every reply.
Watch how switching the role changes the output. I’ll ask the same question with two different personas:
python
question = "How should I handle missing data in my dataset?"
generic = ask_with_role(
"You are a helpful assistant.",
question
)
expert = ask_with_role(
"You are a senior data scientist at a Fortune 500 company. "
"Give practical, opinionated advice. Be direct.",
question
)
print("=== Generic ===")
print(generic[:200])
print("\n=== Expert ===")
print(expert[:200])
The generic reply treats every option the same. The expert takes a position, gives concrete advice, and warns about pitfalls. Exact same question — miles apart in usefulness.
Key Insight: A focused system prompt is like calling in a specialist instead of a generalist. The tighter the role, the sharper and more useful the reply.
Note: System messages differ across LLMs. OpenAI puts them in the `messages` array with `role: “system”`. Anthropic’s Claude uses a separate `system` parameter. Google’s Gemini uses `system_instruction`. The idea is the same — lasting context that shapes every reply — but the API call looks different. Check each provider’s docs.
What Makes a Strong System Prompt?
Vague prompts produce vague results: “You are an expert” gives the model nothing to latch onto. Strong prompts add rules, scope, and a clear persona:
python
system_prompt = """You are a senior data engineer reviewing code
for a production ETL pipeline.
Rules:
- Flag code that won't scale past 1 million rows
- Suggest polars or duckdb when appropriate
- Use concrete numbers in performance estimates
- Be direct. Skip pleasantries."""
More constraints means sharper output. Constraints don’t shrink what the model can do — they point it at the right target.
Now that we’ve covered roles, let’s look at tasks where the model needs to think through several steps before giving an answer.
What Is Chain-of-Thought Prompting?
Not every problem can be solved in one mental leap. When you hand the model a multi-step math or logic puzzle, it sometimes blurts out the wrong answer because it tried to do everything in its head at once.
Chain-of-thought (CoT) prompting is the fix. You ask the model to lay out its reasoning — one step at a time. Think about the difference between shouting a guess across the room and working through the problem on a whiteboard.
python
cot_prompt = """A farmer has 3 fields.
Field A: 120 bushels/acre, 5 acres.
Field B: 95 bushels/acre, 8 acres.
Field C: 110 bushels/acre, 3 acres.
Sells at \(4.50/bushel, transport costs \)1.20/bushel.
What is the total profit?
Think step by step. Show your work."""
result = ask_llm(cot_prompt)
print(result)
The model lays out the work in stages: figure out each field’s harvest, total them up, compute the gross, and take away shipping costs. Each stage is easy to verify. If the final number is wrong, you can find exactly where the logic went off track.
What Is Zero-Shot CoT?
You don’t always have to spell out each step. Sometimes just tacking on “Let’s think step by step” does the trick. That’s zero-shot chain-of-thought.
python
simple = "What is 23 * 17 + 45 - 12 * 3?"
cot = simple + "\n\nLet's think step by step."
print(f"Direct: {ask_llm(simple)}")
print(f"CoT: {ask_llm(cot)}")
Kojima et al. (2022) showed that just appending “Let’s think step by step” lifts reasoning scores across arithmetic, common-sense, and symbolic tasks — no examples needed.
Tip: Reach for chain-of-thought any time the task has multiple steps. Calculations, multi-criteria choices, debugging, logic puzzles — anywhere the middle steps matter.
When Does CoT Hurt More Than Help?
Keep in mind: CoT isn’t free. It uses more tokens and takes longer. For simple labels or data pulls, the extra steps add overhead without lifting accuracy. A good rule: if a human could answer without jotting anything down, you probably don’t need CoT.
Think ahead: You ask: “Is 97 a prime number? Let’s think step by step.” What will the model do? It will test divisibility by 2, 3, 5, 7, then conclude 97 is prime. The step-by-step layout makes the reasoning see-through and easy to verify.
{
type: ‘exercise’,
id: ‘prompt-eng-ex2’,
title: ‘Exercise 2: Chain-of-Thought for Multi-Step Reasoning’,
difficulty: ‘beginner’,
exerciseType: ‘write’,
instructions: ‘Write a chain-of-thought prompt for this problem: “A store offers 20% off on orders over \(100. Tax is 8%. Someone buys 3 items at \)45 each. What is the final price?” Make the model show each step.’,
starterCode: ‘# Write a CoT prompt for the pricing problem\nproblem = “A store offers 20% off on orders over \(100. Tax is 8%. Someone buys 3 items at \)45 each. What is the final price?”\n\ncot_prompt = f”””{problem}\n\n# Add CoT instructions here\n”””\n\nresult = ask_llm(cot_prompt)\nprint(result)\nprint(“DONE”)’,
testCases: [
{ id: ‘tc1’, input: ‘print(“step” in cot_prompt.lower() or “think” in cot_prompt.lower())’, expectedOutput: ‘True’, description: ‘Prompt includes step-by-step instructions’ },
{ id: ‘tc2’, input: ‘print(“DONE”)’, expectedOutput: ‘DONE’, description: ‘Code runs successfully’ },
],
hints: [
‘Add “Think step by step” or list numbered steps after the problem.’,
‘Numbered steps: “1. Calculate subtotal\n2. Check if discount applies\n3. Apply discount\n4. Calculate tax\n5. Find final price\n\nShow your work.”‘,
],
solution: ‘problem = “A store offers 20% off on orders over \(100. Tax is 8%. Someone buys 3 items at \)45 each. What is the final price?”\n\ncot_prompt = f”””{problem}\n\nSolve step by step:\n1. Calculate the subtotal\n2. Check if discount applies\n3. Apply discount if applicable\n4. Calculate tax on the discounted price\n5. Calculate the final price\n\nShow your work.”””\n\nresult = ask_llm(cot_prompt)\nprint(result)\nprint(“DONE”)’,
solutionExplanation: ‘Numbered steps force the model to break the problem apart: subtotal (\(135), discount check (yes, over \)100), discounted price (\(108), tax (\)8.64), final price ($116.64). Each step is verifiable.’,
xpReward: 15,
}
How Does Structured Output Work — JSON Mode and Beyond?
So far, every method has given you free-form text. That works when a person reads it. But when your code needs to parse the reply, free text is fragile.
Structured output makes the model reply in a fixed shape — usually JSON. No fluff, no prose, just clean data your program can eat.
OpenAI gives you two ways to do this: JSON mode (simple) and Structured Outputs (strict schema lock).
What Is JSON Mode?
JSON mode makes sure the reply is valid JSON. Turn it on with response_format={"type": "json_object"} and tell the model what fields you want in the prompt.
python
json_prompt = """Extract product info from this review as JSON:
"I bought the Sony WH-1000XM5 headphones for $348.
The noise cancellation is the best I've tried.
Battery lasts about 30 hours."
Fields: product_name, brand, price, key_features (list),
rating_sentiment"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": json_prompt}],
response_format={"type": "json_object"},
temperature=0.0,
)
data = json.loads(response.choices[0].message.content)
print(json.dumps(data, indent=2))
python
{
"product_name": "WH-1000XM5",
"brand": "Sony",
"price": 348,
"key_features": [
"Noise cancellation",
"30-hour battery life"
],
"rating_sentiment": "positive"
}
The output is always valid JSON. No preamble, no markdown wrapper, no “Here’s the JSON:” prefix. Just raw data you can hand to json.loads().
How Do Structured Outputs with Pydantic Work?
For real apps, JSON mode alone isn’t enough. You want the full package — locked field names, correct types, and a shape that never shifts.
That’s what Structured Outputs gives you. It works with GPT-4o and later models via the client.beta.chat.completions.parse call. (The .beta tag means the API is stable but may evolve in future SDK versions.)
You set up a Pydantic model, and the API makes sure the reply fits it exactly:
python
from pydantic import BaseModel
from typing import Optional
class ProductReview(BaseModel):
product_name: str
brand: str
price: float
currency: str
sentiment: str
key_features: list[str]
recommendation: bool
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": (
"Extract product info from: 'The Dyson V15 at $749 "
"is expensive but the laser dust detection is a "
"game-changer. Absolutely worth it.'"
)}
],
response_format=ProductReview,
)
review = response.choices[0].message.parsed
print(f"Product: {review.product_name}")
print(f"Price: {review.currency}{review.price}")
print(f"Sentiment: {review.sentiment}")
print(f"Recommend: {review.recommendation}")
python
Product: Dyson V15
Price: $749.0
Sentiment: positive
Recommend: True
All fields present. All types correct. price arrives as a float, not a string. recommendation is a Python bool, not “yes.” This is what solid LLM integration looks like.
Warning: Always set `temperature=0.0` for data extraction. Higher temps add randomness that can shift field values between runs. For pulling data, you want the same answer every time.
How Do Prompt Templates Help?
Prompts pasted all over your codebase turn into a mess fast. Once a prompt works well, wrap it in a reusable template.
A template is just a string with blanks you fill at runtime. Here’s an example for data extraction:
python
def create_extraction_prompt(text, fields):
"""Build a reusable extraction prompt."""
field_list = "\n".join(f"- {field}" for field in fields)
return f"""Extract information from the text below.
Fields to extract:
{field_list}
Rules:
- If a field is not found, use null
- Return valid JSON only
- No explanations or extra text
Text: "{text}"
"""
prompt = create_extraction_prompt(
text="John Smith, age 34, works at Google as a Senior "
"Engineer since 2019.",
fields=["name", "age", "company", "job_title", "start_year"]
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.0,
)
data = json.loads(response.choices[0].message.content)
print(json.dumps(data, indent=2))
python
{
"name": "John Smith",
"age": 34,
"company": "Google",
"job_title": "Senior Engineer",
"start_year": 2019
}
Change the text and fields values and you’ve built a new extraction pipeline — zero prompt rewriting needed.
Tip: Version your prompt templates. Keep them in a config file or database, not sprinkled through your app code. That way you can A/B test prompts without shipping new code.
How Do Temperature and Parameters Shape Responses?
Temperature sets how much randomness the model adds. It pairs with your prompt in ways that aren’t always clear at first glance.
At temperature=0.0, the model grabs the most likely next token each time. Same input, same output. At temperature=1.0, the model casts a wider net, pulling in less obvious choices.
Here’s what that looks like in practice:
python
creative_prompt = (
"Write a one-sentence tagline for a coffee shop "
"called 'Midnight Brew'."
)
for temp in [0.0, 0.5, 1.0]:
result = ask_llm(creative_prompt, temperature=temp)
print(f"Temperature {temp}: {result}")
At 0.0, you’ll see the same tagline every run. At 1.0, each call gives you a fresh spin. Neither is “right” — it depends on the job.
How Do You Pick the Right Temperature?
| Task Type | Temperature | Why |
|---|---|---|
| Data extraction | 0.0 | Consistency matters, creativity doesn’t |
| Classification | 0.0 – 0.3 | Small flex for gray areas |
| Summarization | 0.3 – 0.5 | Varied wording, but stay factual |
| Creative writing | 0.7 – 1.0 | You want novel phrasing |
| Brainstorming | 0.9 – 1.2 | Maximum spread of ideas |
Warning: Don’t change temperature AND top_p at the same time. OpenAI says to tweak one or the other. They both control randomness, and stacking them leads to weird results.
What Other Parameters Are Worth Knowing?
max_tokens caps how long the reply can be. Set it too low and you’ll get cut-off sentences. For extraction tasks, 500–1000 tokens is enough. For long-form writing, go 2000–4000.
top_p (nucleus sampling) is a different knob for randomness. It limits the model to tokens whose combined probability hits a threshold. top_p=0.1 means only the top 10% gets considered.
Here’s the hidden link: a sharp prompt lowers the need for low temperature. A solid few-shot prompt gives steady results even at 0.5. A vague prompt needs 0.0 just to keep the model on track.
How Do You Build a Real-World Review Analyzer?
Let’s pull everything together into a production-grade example. We’ll build a review analyzer that uses role-based prompting, structured output, and explicit rules — all at once.
The analyze_review function below calls the client.beta.chat.completions.parse method from the Structured Output section. It sends a system prompt (the role) and user prompt (the review), and gets back a typed Python object:
python
from pydantic import BaseModel
class ReviewAnalysis(BaseModel):
sentiment: str
confidence: float
key_themes: list[str]
pros: list[str]
cons: list[str]
summary: str
action_items: list[str]
def analyze_review(review_text):
"""Analyze a product review with structured output."""
system_prompt = """You are a product analyst at an e-commerce
company specializing in customer feedback analysis.
Rules:
- sentiment: exactly positive, negative, mixed, or neutral
- confidence: float between 0.0 and 1.0
- key_themes: 2-5 recurring themes
- Be specific in pros/cons — quote the review
- summary: one sentence, max 20 words
- action_items: what should the product team do?"""
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Analyze:\n\n{review_text}"},
],
response_format=ReviewAnalysis,
temperature=0.0,
)
return response.choices[0].message.parsed
Notice how three methods stack in a single function: a persona (system prompt), a locked data shape (Pydantic), and clear guardrails (the rules list).
Time to run it on a real review:
python
sample_review = """
I've been using this standing desk for 6 months now. The motor
is whisper-quiet and the height presets are fantastic. Build
quality feels solid — no wobble even at max height.
However, the cable management tray is too small for a full
setup, and the desktop surface scratches easily. Customer
support took 2 weeks to respond to a warranty question.
Overall worth the $599 price tag, but not perfect.
"""
analysis = analyze_review(sample_review)
print(f"Sentiment: {analysis.sentiment} ({analysis.confidence})")
print(f"Themes: {', '.join(analysis.key_themes)}")
print(f"\nPros:")
for pro in analysis.pros:
print(f" + {pro}")
print(f"\nCons:")
for con in analysis.cons:
print(f" - {con}")
print(f"\nSummary: {analysis.summary}")
print(f"\nAction items:")
for item in analysis.action_items:
print(f" -> {item}")
Every call gives back the exact same shape — same keys, same types, same nesting. You can push this into a database or dashboard with no glue code at all.
And that’s the line between dabbling in prompts and mastering them. The dabbler gets answers. The skilled engineer gets answers that are stable, parseable, and testable.
Which Technique Should You Pick?
With five approaches on the table, choosing matters. Here’s a quick decision tree based on what your task actually needs:
Start here: Can the model already nail this task with no help?
– Yes → Zero-shot. Don’t add layers you don’t need.
– No → Keep reading.
Does the format need to be exact?
– Yes → Few-shot (show samples) or Structured Output (enforce a schema)
– No, free text is fine → Move to the reasoning check
Does the task require multi-step logic?
– Yes → Chain-of-thought. Add “Think step by step” or list the steps out.
– No → Move to the domain check
Does the task call for special knowledge or a certain voice?
– Yes → Role-based prompting with a detailed system prompt
– No → Clear zero-shot instructions should do
Will code — not a human — read the output?
– Yes → Structured Output with Pydantic. No debate for production work.
– No → Any text-based method works
Most real prompts blend 2–3 approaches. The review analyzer above used role + structured output + rules. A complex analysis pipeline might layer role + CoT + few-shot. Start simple. Add methods when the simpler version misses your quality bar.
What Are the Most Common Prompt Engineering Mistakes?
These errors show up over and over in LLM-powered apps. They’re not just bad habits — they cause real failures in production.
Mistake 1: Leaving the Output Format Vague
❌ Wrong:
python
bad = "Tell me about the planets in our solar system."
result = ask_llm(bad)
print(result[:150])
What comes back is a block of unstructured text. Could be numbered, could be paragraphs, could be bullets. If your code expects a table, everything falls apart.
✅ Fixed:
python
good = """List the 8 planets in order from the Sun.
For each: name, type (rocky/gas/ice giant), moon count.
Format as a markdown table."""
result = ask_llm(good)
print(result)
Be explicit about the format you want. The model obeys instructions — but only the ones you actually give it.
Mistake 2: Cramming Too Many Tasks into One Prompt
❌ Wrong:
python
overloaded = """Analyze this review, extract the product name,
determine sentiment, identify issues, suggest a response,
and rate quality 1-10.
Review: 'The headphones sound great but the cushions wore
out after 3 months.'"""
Piling five tasks into one prompt drags down quality across the board. The model tries to juggle everything and ends up doing none of them well.
✅ Fixed: One task per prompt. Chain the results if needed.
python
# Step 1: Extract and classify
extract = """Extract from this review as JSON:
product_type, sentiment, issues (list).
Review: 'The headphones sound great but the cushions
wore out after 3 months.'"""
# Step 2 would use Step 1's output for the response
Mistake 3: Not Giving the Model an Escape Hatch
If the info isn’t in the provided text, the model will invent something. That’s hallucination — and it’s easy to prevent.
❌ Wrong:
python
no_escape = (
"What is the CEO's favorite color? Context: "
"'Acme Corp, founded in 2015, makes widgets.'"
)
✅ Fixed:
python
with_escape = """Answer ONLY from the context below.
If the answer isn't in the context, say "Not found."
Context: 'Acme Corp, founded in 2015, makes widgets.'
Question: What is the CEO's favorite color?"""
result = ask_llm(with_escape)
print(result)
python
Not found.
Always let the model say “I don’t know.” That one escape hatch cuts hallucination way down.
Mistake 4: Ignoring Token Limits and Position
Every model has a context window — GPT-4o and GPT-4o-mini cap out at 128K tokens. But longer prompts cost more, run slower, and make the model lose focus.
Research shows that models pay less attention to content buried in the middle of a long input. Material near the top and bottom gets stronger weight.
Key Insight: Put your most critical instructions at the top and bottom of long prompts. Those positions get the most attention from the model. Formatting rules belong up front, not buried in the middle.
Summary
Let’s wrap up. Everything in this post reduces to five core moves:
| Technique | When to Use | Key Benefit |
|---|---|---|
| Zero-shot | Simple, well-known tasks | No setup cost |
| Few-shot | Need a locked output format | Consistent results |
| Role-based | Need domain expertise or a certain voice | Deeper, targeted replies |
| Chain-of-thought | Multi-step reasoning | Checkable logic |
| Structured output | Output feeds into code | Schema-enforced data |
Start with zero-shot. If the output drifts, add few-shot samples. If the task needs reasoning, layer in CoT. If another program reads the output, enforce a schema. These methods layer on top of each other — and they’re strongest in combination.
Next up, we put these skills to work by building our first LLM-powered app with LangChain.
Practice Exercise
Frequently Asked Questions
Does prompt engineering transfer across different LLMs?
The core ideas — be clear, show examples, add structure — apply to every LLM out there. But the fine details shift. GPT-4o handles dense system prompts better than smaller models. Claude follows formatting rules very closely. Open-source options like Llama need more samples to lock in a pattern. Bottom line: always test on the model you plan to ship with.
How do I fix prompts that break on edge cases?
Quickest fix: turn the failing cases into new few-shot examples. If one pattern keeps tripping up, write a constraint that tackles it head-on. In production, validate every reply — confirm the JSON parses, key fields exist, and values fall in the expected range. Reject and re-run when any check fails.
python
# Example: Retry logic for structured output
import json
def safe_extract(prompt, max_retries=3):
for attempt in range(max_retries):
result = ask_llm(prompt)
try:
data = json.loads(result)
if "name" in data and "age" in data:
return data
except json.JSONDecodeError:
continue
return None
What’s the difference between JSON mode and Structured Outputs?
JSON mode only promises valid JSON — it won’t enforce a schema. So the model might return {"answer": "yes"} one call and {"result": true} the next. Both are valid JSON, but your code expects one specific layout. Structured Outputs with Pydantic lock down both the format and the field definitions. Names and types match your model every time. For any real app, always choose Structured Outputs.
Will prompt engineering become obsolete as models get smarter?
Models are getting better at handling fuzzy input, yes. But structured prompts solve engineering problems, not just model limits. Schema enforcement, clear roles, and step-by-step traces bring a level of reliability that vague inputs will never match. Even the strongest 2026 models do better work when you hand them a well-built prompt.
References
- OpenAI — Prompt engineering guide. Link
- OpenAI — Structured Outputs documentation. Link
- Kojima, T., et al. — “Large Language Models are Zero-Shot Reasoners.” NeurIPS 2022. Link
- Wei, J., et al. — “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022. Link
- Brown, T., et al. — “Language Models are Few-Shot Learners.” NeurIPS 2020. Link
- DAIR.AI — Prompt Engineering Guide. Link
- OpenAI — Chat Completions API reference. Link
- OpenAI — Best practices for prompt engineering. Link
- Anthropic — Prompt engineering documentation. Link
- Google — Gemini API prompting guide. Link
Complete Code
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Up Next in Learning Path
LangChain Crash Course -- Chains, Models, and Output Parsers
