Prompt Engineering Fundamentals — Reliable LLM Outputs
You send a clear request to an LLM. What comes back is… wrong. Not broken — just off. Wrong format, wrong tone, half the details missing.
So you tinker with the prompt. Swap a word, add a line. After a few tries it works. But you don’t know why it works. And the next time the task shifts even a little, you’re guessing again.
That’s the problem prompt engineering solves. It’s not about magic words. It’s about learning how LLMs read your input — and using that knowledge to get solid results every time.
What Is Prompt Engineering?
Prompt engineering is the art of writing inputs that make LLMs give you what you actually want. No tricks, no secrets — just clear, well-structured requests.
Think of it like giving tasks to a new intern. If you say “summarize this,” you’ll get something. Maybe useful, maybe not. But if you say “write a 3-bullet summary focused on revenue changes” — now you get exactly what you need.
LLMs work the same way. What you get out depends almost entirely on what you put in.
from openai import OpenAI
import json
import os
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def ask_llm(prompt, model="gpt-4o-mini", temperature=0.0):
"""Send a prompt to the LLM and return the response."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
)
return response.choices[0].message.content
We’ll use this helper all through the article. It sends a prompt to OpenAI’s API and hands back the text. Setting temperature=0.0 pins the output down — same prompt, same result.
Key Insight: Prompt engineering isn’t about gaming the model. It’s about cutting out vagueness. The more clearly you spell out what you want, the better the model delivers.
Prerequisites
-
Python version: 3.9+
-
Required library: openai (1.0+)
-
Install:
pip install openai -
API key: You need an OpenAI API key. Create one at platform.openai.com/api-keys. Set it as an environment variable:
export OPENAI_API_KEY="your-key-here" -
Time to complete: 20-25 minutes
Zero-Shot Prompting — The Starting Point
Any prompt you’ve written without giving examples is a zero-shot prompt. You hand the model a task with zero samples of what good output looks like. The model leans on what it picked up during training.
Here’s a zero-shot prompt for sentiment analysis:
result = ask_llm(
"Classify the sentiment as positive, negative, or neutral: "
"'The food was okay but the service was terrible.'"
)
print(result)
Negative
It works. The model already knows how to do sentiment analysis from its training data.
But zero-shot has limits. Watch what happens with a more hands-on task:
result = ask_llm(
"Extract the product name and price from: "
"'The new MacBook Pro 16-inch starts at $2,499'"
)
print(result)
Product Name: MacBook Pro 16-inch
Price: $2,499
Looks right. But the format is a coin flip. Sometimes the model uses colons, sometimes dashes, sometimes bullets. If your code has to parse this output, that kind of drift breaks things fast.
Tip: Zero-shot is best for tasks the model already knows well — things like sorting by sentiment, translation, or summarizing. For anything that needs a locked-down output format, you’ll want few-shot examples or structured output.
Two signs it’s time to step up:
-
The format matters. If the result feeds into another system, you need it to look the same every time.
-
The task calls for niche knowledge. Medical, legal, or domain-specific work needs more guidance.
Few-Shot Prompting — Teaching by Example
Few-shot prompting fixes zero-shot’s biggest gap: format drift. Instead of hoping the model guesses your layout, you show it exactly what you expect.
Give 2–5 input-output pairs before your real question. The model spots the pattern and follows it.
Here’s the sentiment classifier again, this time with examples that lock the format in place:
few_shot_prompt = """Classify the sentiment and confidence.
Use exactly this format:
Sentiment: [positive/negative/neutral]
Confidence: [high/medium/low]
Text: "This is the best phone I've ever owned!"
Sentiment: positive
Confidence: high
Text: "The battery life is decent but nothing special."
Sentiment: neutral
Confidence: medium
Text: "Broke after two days. Complete waste of money."
Sentiment: negative
Confidence: high
Text: "The camera quality is amazing but it overheats."
"""
result = ask_llm(few_shot_prompt)
print(result)
Sentiment: negative
Confidence: medium
The output matches the format from the examples. Every single time. That’s the whole point of few-shot — you set the pattern, and the model sticks to it.
How many examples do you actually need? Three hits the sweet spot for most tasks:
| Examples | Effect |
|---|---|
| 1 (one-shot) | Gets the format but may not cover edge cases |
| 2–3 | Strong pattern matching, handles tricky inputs |
| 4–5 | Helps with complex tasks, but returns start to flatten |
| 6+ | Burns tokens, rarely makes things better |
Warning: Your examples can steer the model in ways you don’t intend. If every positive example is short and every negative one is long, the model might learn “short = positive.” Mix up the length and style of your examples.
Picking Good Examples
The quality of your examples matters more than how many you include. Three rules:
Rule 1: Cover the tricky cases. If you’re classifying sentiment, throw in one mixed-sentiment example.
Rule 2: Keep the format the same across all examples. If one uses “Sentiment:” and another says “Sentiment -“, the model won’t know which to follow.
Rule 3: Use messy, real-world text. Don’t use clean toy sentences. Use the kind of rough input the model will see in the wild.
Quick Check: What would the model say for “It’s fine. Nothing amazing, nothing terrible” using our few-shot prompt? Think first. The answer: Sentiment: neutral / Confidence: high — the text is clearly in the middle, and the model can tell with high certainty.
Now that few-shot has locked down your formats, let’s look at how giving the model a role changes the depth of its answers.
Role-Based Prompting — Setting the Context
You’ve seen the “You are a…” prefix in plenty of prompts. That’s role-based prompting — and it does more than you’d think.
When you assign a role, you’re tapping into a specific slice of the model’s knowledge. “You are a senior Python developer” gives you different code than “You are a data scientist.” The role shapes the tone, the word choices, and how deep the answer goes.
def ask_with_role(system_prompt, user_prompt, temperature=0.0):
"""Send a prompt with a system role."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
temperature=temperature,
)
return response.choices[0].message.content
This function splits the system message (the role) from the user message (the task). OpenAI’s API treats the system message as a backdrop that frames every response.
See how role context shifts the output:
question = "How should I handle missing data in my dataset?"
generic = ask_with_role(
"You are a helpful assistant.",
question
)
expert = ask_with_role(
"You are a senior data scientist at a Fortune 500 company. "
"Give practical, opinionated advice. Be direct.",
question
)
print("=== Generic ===")
print(generic[:200])
print("\n=== Expert ===")
print(expert[:200])
The plain assistant lists every option as if they’re all equal. The expert picks sides — it tells you what to do and warns you about common traps.
Key Insight: A sharp system prompt is like hiring a specialist instead of a jack-of-all-trades. The more precise the role, the more focused and useful the answer.
Note: System messages look different across providers. OpenAI uses a system role in the messages array. Anthropic’s Claude takes a separate system field. Google’s Gemini uses system_instruction. The idea is the same — set the tone up front — but the API calls differ. Check each provider’s docs.
What Makes a Good System Prompt
Bad system prompts are vague: “You are an expert.” Good ones are specific and have clear guardrails:
system_prompt = """You are a senior data engineer reviewing code
for a production ETL pipeline.
Rules:
- Flag code that won't scale past 1 million rows
- Suggest polars or duckdb when appropriate
- Use concrete numbers in performance estimates
- Be direct. Skip pleasantries."""
More rules means more predictable output. Rules don’t box the model in — they focus it.
With role context in place, let’s tackle tasks that need the model to reason through several steps.
Chain-of-Thought Prompting — Making the Model Think
Some problems can’t be solved in one jump. When you throw a multi-step math or logic question at an LLM, it sometimes leaps to the wrong answer. It tries to do everything in its head at once.
Chain-of-thought (CoT) prompting fixes this by telling the model to show its work — one step at a time. It’s the difference between asking someone to blurt out an answer and asking them to work through it on a whiteboard.
cot_prompt = """A farmer has 3 fields.
Field A: 120 bushels/acre, 5 acres.
Field B: 95 bushels/acre, 8 acres.
Field C: 110 bushels/acre, 3 acres.
Sells at \(4.50/bushel, transport costs \)1.20/bushel.
What is the total profit?
Think step by step. Show your work."""
result = ask_llm(cot_prompt)
print(result)
The model walks through each step: figure out each field’s output, add them up, work out the revenue, then subtract costs. Each step can be checked. If the final number is wrong, you can trace exactly where the logic broke.
Zero-Shot CoT — The Quick Version
You don’t always need to spell the steps out. Sometimes just adding “Let’s think step by step” does the job. This is zero-shot chain-of-thought.
simple = "What is 23 * 17 + 45 - 12 * 3?"
cot = simple + "\n\nLet's think step by step."
print(f"Direct: {ask_llm(simple)}")
print(f"CoT: {ask_llm(cot)}")
Kojima et al. (2022) showed that “Let’s think step by step” lifts reasoning scores across math, common sense, and logic tasks — with no examples needed.
Tip: Reach for chain-of-thought any time the task has more than one step. Calculations, multi-part decisions, debugging, logical puzzles — anything where the middle steps matter.
When CoT Doesn’t Help
Chain-of-thought isn’t free. It burns more tokens and takes longer. For simple labeling or data extraction, CoT adds overhead with no payoff. If a person could answer without writing anything down, you probably don’t need CoT either.
Predict the Output: You ask: “Is 97 a prime number? Let’s think step by step.” What does the model do? It checks whether 2, 3, 5, and 7 divide evenly into 97, finds none of them do, and says yes. The step-by-step layout makes the answer easy to verify.
Structured Output — JSON Mode and Beyond
Every method so far gives you free-form text. That’s fine when a person reads it. But when your code has to parse the response, free text is fragile.
Structured output makes the model reply in a fixed format — usually JSON. No filler, no intro, just clean data you can parse.
OpenAI gives you two paths: JSON mode (the simple one) and Structured Outputs (strict schema checks).
JSON Mode
JSON mode makes sure the response is valid JSON. Turn it on with response_format={"type": "json_object"} and describe the shape you want in the prompt.
json_prompt = """Extract product info from this review as JSON:
"I bought the Sony WH-1000XM5 headphones for $348.
The noise cancellation is the best I've tried.
Battery lasts about 30 hours."
Fields: product_name, brand, price, key_features (list),
rating_sentiment"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": json_prompt}],
response_format={"type": "json_object"},
temperature=0.0,
)
data = json.loads(response.choices[0].message.content)
print(json.dumps(data, indent=2))
{
"product_name": "WH-1000XM5",
"brand": "Sony",
"price": 348,
"key_features": [
"Noise cancellation",
"30-hour battery life"
],
"rating_sentiment": "positive"
}
The response is always valid JSON. No intro text, no markdown wrapper, no “Here’s the JSON:” prefix. Just data you can feed straight into json.loads().
Structured Outputs with Pydantic
For production systems, JSON mode alone isn’t enough. You want the schema locked down — field names, types, and shape all checked for you.
OpenAI’s Structured Outputs feature does this. It works with GPT-4o and later models through client.beta.chat.completions.parse. (The .beta tag means the API is stable but may change in future SDK versions.)
You define a Pydantic model, and the API makes sure the response fits it:
from pydantic import BaseModel
from typing import Optional
class ProductReview(BaseModel):
product_name: str
brand: str
price: float
currency: str
sentiment: str
key_features: list[str]
recommendation: bool
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": (
"Extract product info from: 'The Dyson V15 at $749 "
"is expensive but the laser dust detection is a "
"game-changer. Absolutely worth it.'"
)}
],
response_format=ProductReview,
)
review = response.choices[0].message.parsed
print(f"Product: {review.product_name}")
print(f"Price: {review.currency}{review.price}")
print(f"Sentiment: {review.sentiment}")
print(f"Recommend: {review.recommendation}")
Product: Dyson V15
Price: $749.0
Sentiment: positive
Recommend: True
Every field is there. Every type is right. The price is a float, not a string. The recommendation is a boolean, not the word “yes.” This is what solid LLM work looks like in the real world.
Warning: Always set temperature=0.0 when pulling out structured data. Higher values add noise that can shift field values between runs. For data work, you want the same answer every time.
Prompt Templates — Reusable, Testable Prompts
Hard-coded prompts spread across your codebase get messy fast. Once your prompts work, wrap them in templates you can reuse.
A prompt template is a string with blanks you fill in at runtime. Here’s one for pulling data out of text:
def create_extraction_prompt(text, fields):
"""Build a reusable extraction prompt."""
field_list = "\n".join(f"- {field}" for field in fields)
return f"""Extract information from the text below.
Fields to extract:
{field_list}
Rules:
- If a field is not found, use null
- Return valid JSON only
- No explanations or extra text
Text: "{text}"
"""
prompt = create_extraction_prompt(
text="John Smith, age 34, works at Google as a Senior "
"Engineer since 2019.",
fields=["name", "age", "company", "job_title", "start_year"]
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.0,
)
data = json.loads(response.choices[0].message.content)
print(json.dumps(data, indent=2))
{
"name": "John Smith",
"age": 34,
"company": "Google",
"job_title": "Senior Engineer",
"start_year": 2019
}
Swap the text and fields, and you have a new extraction pipeline — no prompt rewriting needed.
Tip: Version your prompt templates. Keep them in a config file or database, not buried in your app code. That way you can A/B test prompts without pushing a new deploy.
Temperature and Parameters — How They Shape Responses
Temperature controls how much randomness the model uses when it writes. It interacts with your prompt in ways that aren’t always obvious.
At temperature=0.0, the model always picks the most likely next token. The output stays the same each time. At temperature=1.0, it explores more options and you get variety.
Here’s what that looks like in practice:
creative_prompt = (
"Write a one-sentence tagline for a coffee shop "
"called 'Midnight Brew'."
)
for temp in [0.0, 0.5, 1.0]:
result = ask_llm(creative_prompt, temperature=temp)
print(f"Temperature {temp}: {result}")
At 0.0, the tagline is the same every time. At 1.0, each call gives you something new. Neither is “better” — it depends on the job.
Choosing the Right Temperature
| Task Type | Temperature | Why |
|---|---|---|
| Data extraction | 0.0 | You need the same result every time |
| Classification | 0.0–0.3 | A little wiggle room for edge cases |
| Summarization | 0.3–0.5 | Lets the phrasing vary, but keeps facts tight |
| Creative writing | 0.7–1.0 | You want fresh word choices |
| Brainstorming | 0.9–1.2 | Max spread of ideas |
Warning: Don’t change temperature AND top_p at the same time. OpenAI’s docs say to adjust one or the other, not both. They control the same thing from different angles, and stacking them gives odd results.
Other Parameters Worth Knowing
max_tokens caps how long the response can be. Set it too low and the model cuts off mid-sentence. For structured extraction, 500–1000 tokens is plenty. For long-form content, go with 2000–4000.
top_p (nucleus sampling) is another way to dial randomness. It limits the model to the most likely tokens that add up to a target chance. top_p=0.1 means only the top 10% of the odds are on the table.
Here’s the key link: a clear prompt with good examples gives steady output even at temperature 0.5. A vague prompt needs temperature 0.0 just to avoid wild swings.
Real-World Example: Building a Review Analyzer
Let’s pull it all together into a real example. We’ll build a review analyzer that uses role-based prompting, structured output, and clear rules — all in one function.
The analyze_review function below uses client.beta.chat.completions.parse (from the Structured Output section). It sends a system prompt (the role) and a user prompt (the review), then hands back a typed Python object:
from pydantic import BaseModel
class ReviewAnalysis(BaseModel):
sentiment: str
confidence: float
key_themes: list[str]
pros: list[str]
cons: list[str]
summary: str
action_items: list[str]
def analyze_review(review_text):
"""Analyze a product review with structured output."""
system_prompt = """You are a product analyst at an e-commerce
company specializing in customer feedback analysis.
Rules:
- sentiment: exactly positive, negative, mixed, or neutral
- confidence: float between 0.0 and 1.0
- key_themes: 2-5 recurring themes
- Be specific in pros/cons — quote the review
- summary: one sentence, max 20 words
- action_items: what should the product team do?"""
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Analyze:\n\n{review_text}"},
],
response_format=ReviewAnalysis,
temperature=0.0,
)
return response.choices[0].message.parsed
Three methods in one function: role-based prompting (the system prompt), structured output (the Pydantic schema), and clear rules (the constraint list).
Let’s test it:
sample_review = """
I've been using this standing desk for 6 months now. The motor
is whisper-quiet and the height presets are fantastic. Build
quality feels solid — no wobble even at max height.
However, the cable management tray is too small for a full
setup, and the desktop surface scratches easily. Customer
support took 2 weeks to respond to a warranty question.
Overall worth the $599 price tag, but not perfect.
"""
analysis = analyze_review(sample_review)
print(f"Sentiment: {analysis.sentiment} ({analysis.confidence})")
print(f"Themes: {', '.join(analysis.key_themes)}")
print(f"\nPros:")
for pro in analysis.pros:
print(f" + {pro}")
print(f"\nCons:")
for con in analysis.cons:
print(f" - {con}")
print(f"\nSummary: {analysis.summary}")
print(f"\nAction items:")
for item in analysis.action_items:
print(f" -> {item}")
Every response has the same shape. Same fields, same types, same layout. You can pipe this straight into a database or dashboard — no parsing glue needed.
That’s the gap between prompt engineering as a hobby and as a real skill. The hobby version gets answers. The skill version gets answers that are steady, parseable, and testable.
Which Technique Should You Use?
With five techniques on the table, picking the right one matters. Here’s a quick decision guide:
Start here: Can the model already do this well on its own?
– Yes → Zero-shot. Don’t pile on methods you don’t need.
– No → Keep reading.
Does the output format matter?
– Yes, I need the same shape every time → Few-shot (show examples) or Structured Output (lock down the schema)
– No, free text is fine → Skip to the reasoning check
Does the task need multi-step thinking?
– Yes → Chain-of-thought. Add “Think step by step” or list the steps yourself.
– No → Skip to the domain check
Does the task need deep or niche knowledge?
– Yes → Role-based prompting with a tailored system prompt
– No → Zero-shot with clear wording should do it
Will code — not a person — read the output?
– Yes → Structured Output with Pydantic. No debate for production work.
– No → Any text-based method works
Most real prompts mix 2–3 of these. The review analyzer above used role-based + structured output + rules. A tricky analysis might pair role-based + chain-of-thought + few-shot. Start simple, then layer on more when the basic version falls short.
Common Prompt Engineering Mistakes
Some mistakes show up again and again in LLM-powered apps. These aren’t style choices — they’re bugs that break things.
Mistake 1: Being Vague About Output Format
Wrong:
bad = "Tell me about the planets in our solar system."
result = ask_llm(bad)
print(result[:150])
You get a wall of text. Maybe numbered, maybe paragraphs, maybe bullets. If your code expects a table, this falls apart.
Correct:
good = """List the 8 planets in order from the Sun.
For each: name, type (rocky/gas/ice giant), moon count.
Format as a markdown table."""
result = ask_llm(good)
print(result)
Say exactly what format you want. The model follows what you tell it — but only what you tell it.
Mistake 2: Stuffing Too Many Tasks into One Prompt
Wrong:
overloaded = """Analyze this review, extract the product name,
determine sentiment, identify issues, suggest a response,
and rate quality 1-10.
Review: 'The headphones sound great but the cushions wore
out after 3 months.'"""
Five tasks at once means weaker results on all of them. The model juggles everything and does nothing well.
Correct: One task per prompt. Chain the results if you need to.
# Step 1: Extract and classify
extract = """Extract from this review as JSON:
product_type, sentiment, issues (list).
Review: 'The headphones sound great but the cushions
wore out after 3 months.'"""
# Step 2 would use Step 1's output for the response
Mistake 3: Not Giving the Model an Out
When the answer isn’t in the text you provide, the model makes one up. That’s called a hallucination — and you can stop it.
Wrong:
no_escape = (
"What is the CEO's favorite color? Context: "
"'Acme Corp, founded in 2015, makes widgets.'"
)
Correct:
with_escape = """Answer ONLY from the context below.
If the answer isn't in the context, say "Not found."
Context: 'Acme Corp, founded in 2015, makes widgets.'
Question: What is the CEO's favorite color?"""
result = ask_llm(with_escape)
print(result)
Not found.
Always let the model say “I don’t know.” It cuts made-up answers way down.
Mistake 4: Ignoring Token Limits and Position
Every model has a context window. GPT-4o and GPT-4o-mini both handle 128,000 tokens. But longer prompts are slower, cost more, and lose focus.
Models tend to skim over content buried in the middle of a long prompt. They focus most on the start and the end.
Key Insight: Put your key rules at the top and bottom of long prompts. The model gives these spots the most weight. Format rules belong up front, not buried in the middle.
Summary
Prompt engineering boils down to five core techniques:
| Technique | When to Use | Key Benefit |
|---|---|---|
| Zero-shot | Simple, clear tasks | No setup needed |
| Few-shot | You need a fixed output format | Same shape every time |
| Role-based | The task needs domain expertise | Deeper, more targeted answers |
| Chain-of-thought | Multi-step reasoning | Steps you can check |
| Structured output | Output feeds into code | Format is locked down |
Start with zero-shot. If the output drifts, add few-shot examples. If reasoning matters, add chain-of-thought. If code reads the output, use structured output. These methods layer — they work well together.
In the next article in this learning path, we’ll put these prompt skills to work by building our first LLM-powered app with LangChain.
Practice Exercise
Challenge: Build a multi-technique prompt pipeline
Task: Create a function that takes a raw job posting and extracts structured data using at least three techniques.
The function should:
1. Use a role-based system prompt (senior recruiter)
2. Use structured output (Pydantic model)
3. Include chain-of-thought for salary guessing when no salary is listed
Solution:
from pydantic import BaseModel
from typing import Optional
class JobPosting(BaseModel):
title: str
company: str
location: str
salary_range: Optional[str]
required_skills: list[str]
experience_level: str
remote_friendly: bool
def extract_job_info(posting_text):
system = """You are a senior technical recruiter with
15 years of experience.
When salary isn't stated, estimate a range based on:
1. Job title and seniority
2. Location (cost of living)
3. Required skills (specialized = higher pay)
Mark estimates with '(estimated)'.
experience_level: entry, mid, senior, or lead."""
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": f"Extract:\n\n{posting_text}"},
],
response_format=JobPosting,
temperature=0.0,
)
return response.choices[0].message.parsed
job = """
Senior ML Engineer at DataCorp (San Francisco, hybrid)
Must have: Python, PyTorch, MLOps, 5+ years experience
Nice to have: Kubernetes, Ray, Spark
"""
info = extract_job_info(job)
print(f"Title: {info.title}")
print(f"Company: {info.company}")
print(f"Salary: {info.salary_range}")
print(f"Skills: {', '.join(info.required_skills)}")
print(f"Level: {info.experience_level}")
This combines role-based prompting, chain-of-thought (the salary logic), and structured output into one reusable function.
Frequently Asked Questions
Does prompt engineering work the same across different LLMs?
The big ideas — be clear, give examples, structure your input — carry across all LLMs. But the details shift. GPT-4o handles long system prompts better than smaller models. Claude tends to follow format rules more closely. Open-source models like Llama need more hand-holding with examples. Always test on the model you plan to ship with.
How do I fix prompts that fail on edge cases?
Add the failing cases as few-shot examples. That’s the fastest fix. If the same type of input keeps tripping the model, add a rule that covers it. For production, always check the output — make sure the JSON parses, the right fields exist, and values make sense. If a check fails, retry.
# Example: Retry logic for structured output
import json
def safe_extract(prompt, max_retries=3):
for attempt in range(max_retries):
result = ask_llm(prompt)
try:
data = json.loads(result)
if "name" in data and "age" in data:
return data
except json.JSONDecodeError:
continue
return None
What’s the difference between JSON mode and Structured Outputs?
JSON mode makes sure the response is valid JSON, but doesn’t check the shape. The model might return {"answer": "yes"} or {"result": true} — both valid JSON, but your code expects one layout. Structured Outputs with Pydantic locks down the JSON and the schema. Every field name and type matches what you defined. For production, always pick Structured Outputs.
Will prompt engineering become obsolete as models get smarter?
Models keep getting better at reading vague prompts, but good prompts solve engineering problems — not just model gaps. Schema rules, role context, and step-by-step logic give you a level of trust that vague prompts never will. Even the best 2026 models do better with a clear, well-built prompt.
References
- OpenAI — Prompt engineering guide. Link
- OpenAI — Structured Outputs documentation. Link
- Kojima, T., et al. — “Large Language Models are Zero-Shot Reasoners.” NeurIPS 2022. Link
- Wei, J., et al. — “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022. Link
- Brown, T., et al. — “Language Models are Few-Shot Learners.” NeurIPS 2020. Link
- DAIR.AI — Prompt Engineering Guide. Link
- OpenAI — Chat Completions API reference. Link
- OpenAI — Best practices for prompt engineering. Link
- Anthropic — Prompt engineering documentation. Link
- Google — Gemini API prompting guide. Link
Complete Code
# Complete code from: Prompt Engineering Fundamentals
# Requires: pip install openai pydantic
# Python 3.9+
# Set OPENAI_API_KEY environment variable before running
from openai import OpenAI
import json
import os
from pydantic import BaseModel
from typing import Optional
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# --- Helper Functions ---
def ask_llm(prompt, model="gpt-4o-mini", temperature=0.0):
"""Send a prompt to the LLM and return the response."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
)
return response.choices[0].message.content
def ask_with_role(system_prompt, user_prompt, temperature=0.0):
"""Send a prompt with a system role."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
temperature=temperature,
)
return response.choices[0].message.content
# --- Zero-Shot ---
result = ask_llm(
"Classify the sentiment as positive, negative, or neutral: "
"'The food was okay but the service was terrible.'"
)
print("Zero-shot:", result)
# --- Few-Shot ---
few_shot_prompt = """Classify the sentiment and confidence.
Use exactly this format:
Sentiment: [positive/negative/neutral]
Confidence: [high/medium/low]
Text: "This is the best phone I've ever owned!"
Sentiment: positive
Confidence: high
Text: "The battery life is decent but nothing special."
Sentiment: neutral
Confidence: medium
Text: "Broke after two days. Complete waste of money."
Sentiment: negative
Confidence: high
Text: "The camera quality is amazing but it overheats."
"""
result = ask_llm(few_shot_prompt)
print("\nFew-shot:", result)
# --- Role-Based ---
expert = ask_with_role(
"You are a senior data scientist. Give practical advice.",
"How should I handle missing data in my dataset?"
)
print("\nExpert:", expert[:200])
# --- Chain-of-Thought ---
cot = """A farmer has 3 fields.
Field A: 120 bushels/acre, 5 acres.
Field B: 95 bushels/acre, 8 acres.
Field C: 110 bushels/acre, 3 acres.
Sells at \(4.50/bushel, transport costs \)1.20/bushel.
What is the total profit? Think step by step."""
print("\nCoT:", ask_llm(cot))
# --- JSON Mode ---
json_prompt = """Extract product info as JSON:
"I bought the Sony WH-1000XM5 headphones for $348.
Noise cancellation is incredible. Battery lasts 30 hours."
Fields: product_name, brand, price, key_features, sentiment"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": json_prompt}],
response_format={"type": "json_object"},
temperature=0.0,
)
print("\nJSON:", json.dumps(json.loads(
response.choices[0].message.content), indent=2))
# --- Structured Output ---
class ReviewAnalysis(BaseModel):
sentiment: str
confidence: float
key_themes: list[str]
pros: list[str]
cons: list[str]
summary: str
action_items: list[str]
def analyze_review(review_text):
system = """You are a product analyst.
Rules:
- sentiment: positive, negative, mixed, or neutral
- confidence: float 0.0 to 1.0
- summary: one sentence, max 20 words"""
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": f"Analyze:\n\n{review_text}"},
],
response_format=ReviewAnalysis,
temperature=0.0,
)
return response.choices[0].message.parsed
review = """Standing desk, 6 months in. Motor is quiet, presets
are great. But cable tray is too small and surface scratches.
Worth $599 but not perfect."""
a = analyze_review(review)
print(f"\nAnalysis: {a.sentiment} ({a.confidence})")
print(f"Summary: {a.summary}")
print("\nScript completed successfully.")
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →