Zero-Shot vs Few-Shot Prompting: Complete Guide

Build a text classifier that hits 90%+ accuracy using only prompt engineering. Learn zero-shot, one-shot, and few-shot prompting with hands-on Python examples.

Written by Selva Prabhakaran | 28 min read

Build a text classifier that goes from ~70% to 90%+ accuracy — with no training data at all.

Interactive Code Blocks — The Python code blocks in this article are runnable. Click the Run button to execute them right in your browser.

You need to sort 500 support tickets into groups. The old way? Label hundreds of rows, train a model, tune it, test it, retrain it. That’s two weeks of work before you see one result.

Or you write a prompt. Add three good examples. And the LLM sorts all 500 tickets in minutes — with over 90% accuracy. No labeled data. No model training. Just prompt craft.

That’s the gap between zero-shot and few-shot prompting. In this piece, you’ll build a text classifier step by step. We start with a bare zero-shot prompt (~70% right), add one example (one-shot), then add three examples (few-shot) and watch the score climb past 90%. You’ll also see why the examples you pick — and the order you put them in — matter more than most people think.

What Are Zero-Shot, One-Shot, and Few-Shot Prompting?

These three terms tell you how many examples you put in your prompt before asking the model to do a task.

Think of it like this. You hand a task to a new hire. You could say “sort these emails” and hope for the best (zero-shot). You could show one sorted email first (one-shot). Or you could show three sorted emails from different groups (few-shot). More examples, clearer pattern.

Approach	Examples in Prompt	Best For
Zero-shot	0	Simple tasks the model knows well
One-shot	1	Tasks where format matters most
Few-shot	2-5	Sorting, tagging, or any task that needs steady output

Let’s set up our code. We’ll use raw HTTP calls — no SDK needed — so this runs anywhere.

import os
import json

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
API_URL = "https://api.openai.com/v1/chat/completions"
MODEL = "gpt-4o-mini"

This sets up our config. You’ll need an OpenAI API key saved as OPENAI_API_KEY. Make one at platform.openai.com/api-keys if you don’t have one yet.

Next, our helper function. It sends a prompt to the API and gives back the text reply. We use urllib from the standard library — nothing to install.

def ask_llm(prompt, temperature=0.0):
    """Send a prompt to OpenAI and return the text reply."""
    import urllib.request

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {OPENAI_API_KEY}",
    }
    body = json.dumps({
        "model": MODEL,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": temperature,
    })

    req = urllib.request.Request(
        API_URL, data=body.encode(), headers=headers
    )
    with urllib.request.urlopen(req) as resp:
        data = json.loads(resp.read().decode())

    return data["choices"][0]["message"]["content"]

We set temperature=0.0 so we get the same answer each time we run the same prompt. You can’t test a classifier if the answers keep changing.

How to Get Your API Key

Go to platform.openai.com/api-keys
Click “Create new secret key”
Copy the key and set it in your shell: export OPENAI_API_KEY="sk-your-key-here"
The full tutorial costs under $0.05 with gpt-4o-mini

What You’ll Need

Python: 3.9+
Libraries: None (uses built-in urllib)
API key: OpenAI (see above)
Time: 25-30 minutes

Our Test Data — 20 Support Tickets

Before we test any prompt style, we need data with known right answers. That way we can measure accuracy — not just guess.

We’ll make 20 support tickets in 4 groups: billing, technical, account, and general. Each ticket has a text and a label (the true answer).

test_tickets = [
    {"text": "I was charged twice for my plan this month", "label": "billing"},
    {"text": "The app crashes when I try to upload a file", "label": "technical"},
    {"text": "How do I change my email on my profile?", "label": "account"},
    {"text": "What are your work hours?", "label": "general"},
    {"text": "I need a refund for the yearly plan I just bought", "label": "billing"},
    {"text": "The PDF export gives a blank page", "label": "technical"},
    {"text": "Can I move my account to a new team?", "label": "account"},
    {"text": "Do you offer student pricing?", "label": "general"},
    {"text": "My card was declined but I still got charged", "label": "billing"},
    {"text": "Search gives no results for exact matches", "label": "technical"},
    {"text": "I forgot my password and the reset email never came", "label": "account"},
    {"text": "Where is your API docs page?", "label": "general"},
    {"text": "Please cancel my plan right now", "label": "billing"},
    {"text": "Photos don't load on the phone version", "label": "technical"},
    {"text": "How do I turn on two-step login?", "label": "account"},
    {"text": "Can I reach support by phone?", "label": "general"},
    {"text": "I was billed for a tool I never turned on", "label": "billing"},
    {"text": "The main page takes 30 seconds to load", "label": "technical"},
    {"text": "I need to change the billing address on file", "label": "account"},
    {"text": "What tools do you plug into?", "label": "general"},
]

print(f"Test set: {len(test_tickets)} tickets")
print(f"Groups: {sorted(set(t['label'] for t in test_tickets))}")

Output:

python

Test set: 20 tickets
Groups: ['account', 'billing', 'general', 'technical']

Five tickets per group. Balanced across all four classes. Small enough to run cheaply, big enough to show real gaps between prompt styles.

Key Insight: Always build a labeled test set before you compare prompt styles. Even 20 labeled rows lets you measure accuracy. Without ground truth, you’re just guessing which approach works better.

Zero-Shot — The Baseline

Zero-shot means you tell the model what to do but give zero examples. It relies on what it learned during training.

Here’s a zero-shot version for our tickets. We list the groups and ask the model to pick one.

def classify_zero_shot(text):
    """Sort a ticket with no examples."""
    prompt = f"""Sort this support ticket into one group.

Groups: billing, technical, account, general

Ticket: {text}

Reply with only the group name. Nothing else."""

    return ask_llm(prompt).strip().lower()

Short and direct. Here are the groups, here’s the ticket, pick one. No hints about what each group means.

Now let’s test it on all 20 tickets. This helper runs the classifier, checks each guess, and prints the score.

def evaluate(classify_fn, tickets, label="Classifier"):
    """Run a classifier on all tickets. Print the score."""
    correct = 0
    results = []

    for ticket in tickets:
        guess = classify_fn(ticket["text"])
        hit = guess == ticket["label"]
        correct += int(hit)
        results.append({
            "text": ticket["text"][:50] + "...",
            "true": ticket["label"],
            "guess": guess,
            "hit": hit,
        })

    score = correct / len(tickets) * 100
    print(f"\n{label}")
    print(f"Score: {correct}/{len(tickets)} ({score:.0f}%)")

    misses = [r for r in results if not r["hit"]]
    if misses:
        print(f"\nWrong ({len(misses)}):")
        for m in misses:
            print(f"  '{m['text']}' -> {m['guess']} (should be {m['true']})")

    return score, results

zero_acc, zero_res = evaluate(
    classify_zero_shot, test_tickets, "Zero-Shot"
)

You’ll likely see a score around 70-85%. Not bad for no effort — but not great, either.

Look at the wrong ones. You’ll spot a trend: the model mixes up general and account on lines like “Where is your API docs page?” It can’t tell if that’s a broad question or an account access issue. That’s what examples fix.

Tip: Start every sorting task with zero-shot. If the score is already above 95%, don’t bother with examples. Only add more when you need to.

One-Shot — One Example Changes a Lot

One-shot adds a single example to the prompt. That one sample does two things: it shows the model your rules, and it locks the output shape.

Watch how we tweak the prompt. We put one labeled ticket before the real one.

def classify_one_shot(text):
    """Sort a ticket with one example."""
    prompt = f"""Sort this support ticket into one group.

Groups: billing, technical, account, general

Example:
Ticket: "I was overcharged on my last bill"
Group: billing

Now sort this ticket:
Ticket: "{text}"
Group:"""

    return ask_llm(prompt).strip().lower()

One change: we added a sample showing that bill-related gripes go to billing. That one line gives the model a clear feel for the task.

one_acc, one_res = evaluate(
    classify_one_shot, test_tickets, "One-Shot"
)
print(f"\nGain over zero-shot: {one_acc - zero_acc:+.0f}%")

You’ll often see a 5-15% jump. That one sample isn’t just teaching about billing. It’s teaching how to reply. The format is now clear: read the ticket, write one word. No fluff, no hedge.

But one sample has limits. The model still doesn’t know where you draw the line between account and general. It has one data point for billing and zero for the rest.

Few-Shot — Breaking Past 90%

Few-shot uses several examples — usually 2-5 — to show the model your sorting rules through demos. Each example is a mini lesson baked right into the prompt.

I like three examples for most sorting tasks. Two isn’t quite enough to cover tricky cases, and more than five starts eating tokens without much gain.

Here’s the few-shot version. We show one example per group so the model sees each class.

def classify_few_shot(text):
    """Sort a ticket with three examples."""
    prompt = f"""Sort this support ticket into one group.

Groups: billing, technical, account, general

Examples:
Ticket: "I was overcharged on my last bill"
Group: billing

Ticket: "The app freezes when I click export"
Group: technical

Ticket: "How do I reset my password?"
Group: account

Now sort this ticket:
Ticket: "{text}"
Group:"""

    return ask_llm(prompt).strip().lower()

Three samples, one per group (we left out general on purpose — the model handles those fine on its own). Each sample is a clear, clean case for its group.

few_acc, few_res = evaluate(
    classify_few_shot, test_tickets, "Few-Shot (3 examples)"
)
print(f"\nGain over zero-shot: {few_acc - zero_acc:+.0f}%")
print(f"Gain over one-shot:  {few_acc - one_acc:+.0f}%")

Few-shot usually lands at 90-100%. The leap from one-shot is often big because the model now knows what billing looks like, what technical looks like, AND what account looks like. It can cross-check.

Let’s see all three side by side.

print("=" * 45)
print("SCORE CARD")
print("=" * 45)
print(f"Zero-shot:  {zero_acc:.0f}%")
print(f"One-shot:   {one_acc:.0f}%")
print(f"Few-shot:   {few_acc:.0f}%")
print("=" * 45)

Key Insight: Few-shot works because examples show borders between groups. The model doesn’t just learn what each group *is* — it learns what each group *isn’t*. One example per tricky group beats five examples from one group.

Which Examples You Pick Matters More Than You’d Think

Not all examples work the same. Good picks can push you from 85% to 95%. Bad picks can drop you below the zero-shot score.

Three rules for strong examples:

Cover every group — at least one sample per class
Pick border cases — cases near the line between two groups teach the most
Use varied samples — don’t pick three that all sound the same

Let’s test this. We’ll make two sets: one with easy, clear examples and one with tricky, border-line cases.

def classify_easy_examples(text):
    """Few-shot with very clear, easy examples."""
    prompt = f"""Sort this ticket: billing, technical, account, general

Examples:
Ticket: "I want a refund"
Group: billing

Ticket: "The site is down"
Group: technical

Ticket: "Delete my account"
Group: account

Ticket: "{text}"
Group:"""

    return ask_llm(prompt).strip().lower()

Those are too easy. “I want a refund” is clearly billing. The model knew that already. These samples don’t add new info.

Now compare with border-line picks — cases where a person might pause for a beat.

def classify_border_examples(text):
    """Few-shot with tricky, border-line examples."""
    prompt = f"""Sort this ticket: billing, technical, account, general

Examples:
Ticket: "I need to change the card linked to my profile"
Group: billing

Ticket: "The pay page times out when I try to check out"
Group: technical

Ticket: "Can I merge my two accounts into one?"
Group: account

Ticket: "{text}"
Group:"""

    return ask_llm(prompt).strip().lower()

“Change the card linked to my profile” — that sits on the line between billing and account. By tagging it billing, you teach the model: card stuff goes to billing, even when it mentions a profile.

“Pay page times out” — that mixes money (billing?) with a broken page (technical). By tagging it technical, you teach: system failures go to tech, no matter the context.

easy_acc, _ = evaluate(
    classify_easy_examples, test_tickets, "Easy Examples"
)
border_acc, _ = evaluate(
    classify_border_examples, test_tickets, "Border-Line Examples"
)
print(f"\nEasy picks:      {easy_acc:.0f}%")
print(f"Border picks:    {border_acc:.0f}%")
print(f"Gap:             {border_acc - easy_acc:+.0f}%")

Border cases almost always win. Easy ones tell the model what it knows. Border cases teach it where your lines are drawn.

Warning: Bad examples do real harm. If your labels are wrong or clash with each other, the model gets confused and does *worse* than zero-shot. Always double-check your example labels.

{
type: ‘exercise’,
id: ‘example-selection-ex1’,
title: ‘Exercise 1: Pick Better Examples’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘The current few-shot setup below uses weak examples. Swap in better ones that help split account from general (the pair most often confused). Keep one example each for billing, account, and general. Run it on the test ticket to check your picks.\n\nThe test ticket “Where is your API docs page?” should come back as general, not account.’,
starterCode: ‘# Replace these examples with better ones\nex_billing = “I want a refund” # Keep as billing\nex_account = “” # Pick a clear account action\nexgeneral = “_” # Pick a clear broad question\n\ndef my_sorter(text):\n prompt = f”””Sort this ticket: billing, technical, account, general\n\nExamples:\nTicket: “{ex_billing}”\nGroup: billing\n\nTicket: “{ex_account}”\nGroup: account\n\nTicket: “{ex_general}”\nGroup: general\n\nTicket: “{text}”\nGroup:”””\n return ask_llm(prompt).strip().lower()\n\nresult = my_sorter(“Where is your API docs page?”)\nprint(result)’,
testCases: [
{ id: ‘tc1’, input: ‘print(result)’, expectedOutput: ‘general’, description: ‘Should sort as general’ },
{ id: ‘tc2’, input: ‘print(type(ex_account))’, expectedOutput: ““, hidden: true, description: ‘ex_account must be a string’ },
],
hints: [
‘For account, pick something about changing your own settings — things tied to a logged-in user. For general, pick a broad question any visitor might ask.’,
‘Try: ex_account = “How do I change my login email?” and ex_general = “What coding languages does your API support?”‘,
],
solution: ‘ex_billing = “I want a refund”\nex_account = “How do I change my login email?”\nex_general = “What coding languages does your API support?”\n\ndef my_sorter(text):\n prompt = f”””Sort this ticket: billing, technical, account, general\n\nExamples:\nTicket: “{ex_billing}”\nGroup: billing\n\nTicket: “{ex_account}”\nGroup: account\n\nTicket: “{ex_general}”\nGroup: general\n\nTicket: “{text}”\nGroup:”””\n return ask_llm(prompt).strip().lower()\n\nresult = my_sorter(“Where is your API docs page?”)\nprint(result)’,
solutionExplanation: ‘Account examples should be about personal settings (email, password, profile) — things you do while logged in. General examples should be broad questions with no tie to a specific user. This line helps the model split “questions about the product” (general) from “actions on my account” (account).’,
xpReward: 15,
}

Example Order — Sequence Shifts Results

Here’s a fact that trips up most people: the order of your examples can change the model’s answers. This isn’t a bug. It’s how LLMs read text.

LLMs give more weight to the examples near the end of the prompt. This is called recency bias. The last example in your list has the most pull on the output. The first example has some pull too (primacy). The middle gets the least.

Let’s test this. Same three examples, two different orders.

def classify_billing_last(text):
    """Billing example at the end."""
    prompt = f"""Sort this ticket: billing, technical, account, general

Ticket: "The app freezes when I click export"
Group: technical

Ticket: "How do I reset my password?"
Group: account

Ticket: "I was overcharged on my last bill"
Group: billing

Ticket: "{text}"
Group:"""
    return ask_llm(prompt).strip().lower()


def classify_tech_last(text):
    """Technical example at the end."""
    prompt = f"""Sort this ticket: billing, technical, account, general

Ticket: "I was overcharged on my last bill"
Group: billing

Ticket: "How do I reset my password?"
Group: account

Ticket: "The app freezes when I click export"
Group: technical

Ticket: "{text}"
Group:"""
    return ask_llm(prompt).strip().lower()

Same samples. Just shuffled. Let’s see if it matters.

bill_last_acc, _ = evaluate(
    classify_billing_last, test_tickets, "Billing Last"
)
tech_last_acc, _ = evaluate(
    classify_tech_last, test_tickets, "Technical Last"
)

print(f"\nBilling last:   {bill_last_acc:.0f}%")
print(f"Technical last: {tech_last_acc:.0f}%")
print(f"Gap:            {abs(bill_last_acc - tech_last_acc):.0f}%")

The gap from order alone is often 0-10%. Sounds small, but on edge cases it can flip the call.

Tip: Put your hardest group last. If `billing` and `account` are your trickiest split, put a clear billing sample last. The model leans on recent context.

Practical rules I follow for ordering:

Hardest group last — the one most often mixed up with others
Easiest group first — it anchors the model’s sense of the task
Space out similar groups — don’t put billing and account back to back
When in doubt, shuffle and vote — run three times with shuffled order, take the most common answer

A Reusable Few-Shot Classifier

We’ve tested each idea on its own. Now let’s wrap it all into one clean function. It takes any set of groups, any set of examples, and sorts any text.

The key design choice: we pass groups and examples as data, not as hardcoded strings. That means you can reuse this for a whole new task just by swapping the inputs.

def few_shot_classify(text, groups, examples):
    """
    Reusable few-shot sorter.

    Args:
        text: The text to sort
        groups: List of group names
        examples: List of dicts with 'text' and 'group' keys

    Returns:
        The predicted group (string)
    """
    groups_str = ", ".join(groups)

    samples = ""
    for ex in examples:
        samples += f'\nTicket: "{ex["text"]}"\nGroup: {ex["group"]}\n'

    prompt = f"""Sort this text into one group.

Groups: {groups_str}

Examples:{samples}
Now sort:
Ticket: "{text}"
Group:"""

    return ask_llm(prompt, temperature=0.0).strip().lower()

This is the function you’ll want to keep. It splits the prompt logic from the task data. Today it sorts support tickets. Next week it could sort job posts, product reviews, or bug reports — same function, new inputs.

Let’s test it with our support tickets.

my_groups = ["billing", "technical", "account", "general"]

my_examples = [
    {"text": "I need to change the card linked to my profile", "group": "billing"},
    {"text": "The pay page times out when I try to check out", "group": "technical"},
    {"text": "Can I merge my two accounts into one?", "group": "account"},
]

sample_texts = [
    "I was charged twice for my plan this month",
    "Search gives no results for exact matches",
    "Where is your API docs page?",
]

for text in sample_texts:
    result = few_shot_classify(text, my_groups, my_examples)
    print(f"'{text[:50]}...' -> {result}")

Now the full test run.

def classify_reuse(text):
    return few_shot_classify(text, my_groups, my_examples)

reuse_acc, _ = evaluate(
    classify_reuse, test_tickets, "Reusable Few-Shot Sorter"
)

Key Insight: Keep your prompt logic and task data apart. A reusable sorter function lets you test new examples, new orders, and new groups without touching the prompt code.

{
type: ‘exercise’,
id: ‘reusable-ex2’,
title: ‘Exercise 2: Build a Sentiment Classifier’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Use the few_shot_classify function to build a sentiment sorter. Set up three groups (positive, negative, neutral) and give one example per group. Then sort the test line: “The product works fine but the box was crushed.”\n\nPrint just the predicted group.’,
starterCode: ‘# Set up groups and examples for sentiment\nmy_sent_groups = [“positive”, “negative”, “neutral”]\n\nmy_sent_examples = [\n {“text”: ““, “group”: “positive”},\n {“text”: ““, “group”: “negative”},\n {“text”: “___”, “group”: “neutral”},\n]\n\ntest_line = “The product works fine but the box was crushed.”\nresult = few_shot_classify(test_line, my_sent_groups, my_sent_examples)\nprint(result)’,
testCases: [
{ id: ‘tc1’, input: ‘print(len(my_sent_examples))’, expectedOutput: ‘3’, description: ‘Need 3 examples’ },
{ id: ‘tc2’, input: ‘print(len(my_sent_groups))’, expectedOutput: ‘3’, description: ‘Need 3 groups’ },
],
hints: [
‘Pick one strong sample per mood. Positive: “Best thing I ever bought!” Negative: “Broke on day one, total waste.” Neutral: “It came on the date they said.”‘,
‘Full answer: my_sent_examples = [{“text”: “Best thing I ever bought!”, “group”: “positive”}, {“text”: “Broke on day one, total waste”, “group”: “negative”}, {“text”: “It came on the date they said”, “group”: “neutral”}]’,
],
solution: ‘my_sent_groups = [“positive”, “negative”, “neutral”]\n\nmy_sent_examples = [\n {“text”: “Best thing I ever bought!”, “group”: “positive”},\n {“text”: “Broke on day one, total waste”, “group”: “negative”},\n {“text”: “It came on the date they said”, “group”: “neutral”},\n]\n\ntest_line = “The product works fine but the box was crushed.”\nresult = few_shot_classify(test_line, my_sent_groups, my_sent_examples)\nprint(result)’,
solutionExplanation: ‘One strong sample per mood. The test line mixes good (“works fine”) and bad (“box was crushed”), so it sits on the line between neutral and negative. That makes it a solid test for how well your examples teach the model.’,
xpReward: 20,
}

When NOT to Use Few-Shot Prompting

Few-shot isn’t the right tool every time. Here’s when to skip it.

Zero-shot is enough. If the task is simple and the model nails it at 95%+ with no examples, adding them just wastes tokens and money. Things like “Is this English or French?” don’t need demos.

You need near-perfect results. Few-shot tops out around 92-96% on hard tasks. If you need 99%+ — think medical sorting, legal tags, fraud flags — you need a fine-tuned model or classic ML trained on real data.

Your groups keep changing. If new groups pop up each week, keeping example sets fresh becomes a chore. Try zero-shot with a second pass for edge cases.

Case	Best Tool	Why
Simple task, few groups	Zero-shot	Examples add cost, not accuracy
85-95% accuracy needed	Few-shot (3-5 examples)	Best bang for the buck
99%+ accuracy needed	Fine-tuned model	Few-shot has a ceiling
Groups change often	Zero-shot + check step	Example upkeep is too costly
Very long input text	Zero-shot (save tokens)	Examples eat the context window

Warning: Watch your token bill. Each example adds tokens to *every single* API call. Five examples at 30 tokens each adds 150 tokens per request. At one million calls, that’s 150 million extra tokens. For high-volume work, think about fine-tuning to cut per-call cost.

Common Mistakes and How to Fix Them

Mistake 1: Using test data as examples

Wrong:

# DON'T use a ticket from your test set as an example
examples = [
    {"text": "I was charged twice for my plan", "group": "billing"},
    # This IS a test ticket — you're leaking answers
]

Why it’s wrong: This is data leakage. Your score looks too high because the model has already seen the answer. On fresh data, the score drops.

Fix: Always pick examples that are NOT in your test set. Keep them apart.

Mistake 2: All examples from one group

Wrong:

# Three billing samples, nothing else
examples = [
    {"text": "I want a refund", "group": "billing"},
    {"text": "I was overcharged", "group": "billing"},
    {"text": "Cancel my plan", "group": "billing"},
]

Why it’s wrong: The model sees billing three times and the rest zero times. It starts to lean toward billing — vague tickets land there because that’s the pattern it saw most.

Fix: Put at least one example per group, especially for groups that get mixed up.

Mistake 3: No format rules

Wrong:

# No format guide — model might return:
# "billing", "Billing", "BILLING", "The group is billing", etc.
prompt = "Sort this ticket: " + text

Why it’s wrong: Without clear rules, the model might wrap the answer in a sentence, use caps, or add notes. Your code that checks guess == label breaks quietly.

Fix:

prompt = f"""Sort this ticket into one group.
Groups: billing, technical, account, general
Reply with ONLY the group name in lowercase. No other text.

Ticket: "{text}"
Group:"""

Summary

You built a text sorter that went from ~70% (zero-shot) to 90%+ (few-shot) — with no trained model at all. Here’s the rundown:

Zero-shot gives a quick base score. Fine for easy tasks
One-shot locks the output shape and gives one data point
Few-shot (3-5 examples) teaches group borders and hits 90%+
Example choice matters — border cases beat easy ones
Example order matters — put the hardest group last
The reusable sorter splits logic from data so you can sort anything

Practice task: Take the few_shot_classify function and build an email priority sorter with three levels: urgent, normal, low. Write 4 examples (one per level, plus one border case). Test it on 5 emails you write yourself.

Click to see a sample answer

pri_groups = ["urgent", "normal", "low"]

pri_examples = [
    {"text": "Server is down, all users hit", "group": "urgent"},
    {"text": "User asking about a tool we already have", "group": "normal"},
    {"text": "Weekly team standup notes", "group": "low"},
    {"text": "Client says data was lost after the update", "group": "urgent"},
]

test_emails = [
    {"text": "Payment system failing for all users", "label": "urgent"},
    {"text": "User wants to know how to export data", "label": "normal"},
    {"text": "Weekly team standup notes", "label": "low"},
    {"text": "Security breach found in prod", "label": "urgent"},
    {"text": "Request to swap the logo on the site", "label": "normal"},
]

for email in test_emails:
    pred = few_shot_classify(email["text"], pri_groups, pri_examples)
    tag = "right" if pred == email["label"] else "WRONG"
    print(f"[{tag}] '{email['text'][:45]}...' -> {pred} (want: {email['label']})")

One example (“Client says data was lost after the update”) sits on the line between urgent and normal. That border case helps the model tell apart big failures from small asks.

Complete Code

Click to expand the full script (copy and run)

# Full code from: Zero-Shot vs Few-Shot Prompting
# Needs: OPENAI_API_KEY set in your shell
# No pip install needed — uses built-in urllib
# Python 3.9+

import os
import json
import urllib.request

# --- Config ---
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
API_URL = "https://api.openai.com/v1/chat/completions"
MODEL = "gpt-4o-mini"

# --- Helper ---
def ask_llm(prompt, temperature=0.0):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {OPENAI_API_KEY}",
    }
    body = json.dumps({
        "model": MODEL,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": temperature,
    })
    req = urllib.request.Request(API_URL, data=body.encode(), headers=headers)
    with urllib.request.urlopen(req) as resp:
        data = json.loads(resp.read().decode())
    return data["choices"][0]["message"]["content"]

# --- Test Data ---
test_tickets = [
    {"text": "I was charged twice for my plan this month", "label": "billing"},
    {"text": "The app crashes when I try to upload a file", "label": "technical"},
    {"text": "How do I change my email on my profile?", "label": "account"},
    {"text": "What are your work hours?", "label": "general"},
    {"text": "I need a refund for the yearly plan I just bought", "label": "billing"},
    {"text": "The PDF export gives a blank page", "label": "technical"},
    {"text": "Can I move my account to a new team?", "label": "account"},
    {"text": "Do you offer student pricing?", "label": "general"},
    {"text": "My card was declined but I still got charged", "label": "billing"},
    {"text": "Search gives no results for exact matches", "label": "technical"},
    {"text": "I forgot my password and the reset email never came", "label": "account"},
    {"text": "Where is your API docs page?", "label": "general"},
    {"text": "Please cancel my plan right now", "label": "billing"},
    {"text": "Photos don't load on the phone version", "label": "technical"},
    {"text": "How do I turn on two-step login?", "label": "account"},
    {"text": "Can I reach support by phone?", "label": "general"},
    {"text": "I was billed for a tool I never turned on", "label": "billing"},
    {"text": "The main page takes 30 seconds to load", "label": "technical"},
    {"text": "I need to change the billing address on file", "label": "account"},
    {"text": "What tools do you plug into?", "label": "general"},
]

# --- Scorer ---
def evaluate(classify_fn, tickets, label="Classifier"):
    correct = 0
    results = []
    for ticket in tickets:
        guess = classify_fn(ticket["text"])
        hit = guess == ticket["label"]
        correct += int(hit)
        results.append({
            "text": ticket["text"][:50] + "...",
            "true": ticket["label"],
            "guess": guess,
            "hit": hit,
        })
    score = correct / len(tickets) * 100
    print(f"\n{label}")
    print(f"Score: {correct}/{len(tickets)} ({score:.0f}%)")
    misses = [r for r in results if not r["hit"]]
    if misses:
        print(f"\nWrong ({len(misses)}):")
        for m in misses:
            print(f"  '{m['text']}' -> {m['guess']} (should be {m['true']})")
    return score, results

# --- Zero-Shot ---
def classify_zero_shot(text):
    prompt = f"""Sort this support ticket into one group.
Groups: billing, technical, account, general
Ticket: {text}
Reply with only the group name. Nothing else."""
    return ask_llm(prompt).strip().lower()

# --- One-Shot ---
def classify_one_shot(text):
    prompt = f"""Sort this support ticket into one group.
Groups: billing, technical, account, general

Example:
Ticket: "I was overcharged on my last bill"
Group: billing

Now sort this ticket:
Ticket: "{text}"
Group:"""
    return ask_llm(prompt).strip().lower()

# --- Few-Shot ---
def classify_few_shot(text):
    prompt = f"""Sort this support ticket into one group.
Groups: billing, technical, account, general

Examples:
Ticket: "I was overcharged on my last bill"
Group: billing

Ticket: "The app freezes when I click export"
Group: technical

Ticket: "How do I reset my password?"
Group: account

Now sort this ticket:
Ticket: "{text}"
Group:"""
    return ask_llm(prompt).strip().lower()

# --- Reusable Sorter ---
def few_shot_classify(text, groups, examples):
    groups_str = ", ".join(groups)
    samples = ""
    for ex in examples:
        samples += f'\nTicket: "{ex["text"]}"\nGroup: {ex["group"]}\n'
    prompt = f"""Sort this text into one group.
Groups: {groups_str}

Examples:{samples}
Now sort:
Ticket: "{text}"
Group:"""
    return ask_llm(prompt, temperature=0.0).strip().lower()

# --- Run All ---
print("=" * 55)
print("ZERO-SHOT vs ONE-SHOT vs FEW-SHOT COMPARISON")
print("=" * 55)

zero_acc, _ = evaluate(classify_zero_shot, test_tickets, "Zero-Shot")
one_acc, _ = evaluate(classify_one_shot, test_tickets, "One-Shot")
few_acc, _ = evaluate(classify_few_shot, test_tickets, "Few-Shot (3 examples)")

print("\n" + "=" * 55)
print("FINAL SCORES")
print("=" * 55)
print(f"Zero-shot:  {zero_acc:.0f}%")
print(f"One-shot:   {one_acc:.0f}%")
print(f"Few-shot:   {few_acc:.0f}%")
print("=" * 55)

print("\nDone.")

Frequently Asked Questions

How many examples do I need for few-shot prompting?

Three to five works for most sorting tasks. The landmark paper by Brown et al. (2020) shows that gains flatten past 5-8 examples. Each one adds tokens to every API call, so there’s a real cost trade. Start with one per group. Add more only if the score isn’t where you need it.

Can I use few-shot with open-source models like Llama?

Yes. Few-shot works with any model that follows instructions. The trick lives in the prompt, not the model. Smaller models (7B-13B) may need more examples to match GPT-4o, but the approach is the same. Just swap the API endpoint.

Does few-shot work for text in other languages?

It works well for major ones — Spanish, French, German, Chinese, Japanese — especially with models trained on many languages. For rare languages, you may need examples written in that language. Performance depends on the model and the language.

What’s the difference between few-shot and fine-tuning?

Few-shot puts examples in the prompt at run time. No training happens. Fine-tuning changes the model’s weights using your data. Few-shot is faster to set up (minutes, not hours), cheaper at low volume, and needs no special tools. Fine-tuning gives better scores on hard tasks and costs less per call at scale since you don’t send examples each time.

Why does example order change the results?

LLMs read tokens one by one. The attention layer gives different weight based on where a token sits. Studies show that items near the end of the prompt pull the output more — a pattern called recency bias. That’s why putting your hardest group last often helps that group’s score.

References

Brown, T. et al. — “Language Models are Few-Shot Learners.” NeurIPS 2020. arXiv:2005.14165
OpenAI — Chat Completions API Reference. Link
Liu, J. et al. — “What Makes Good In-Context Examples for GPT-3?” DeepMind, 2022. arXiv:2101.06804
Lu, Y. et al. — “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity.” ACL 2022. arXiv:2104.08786
Min, S. et al. — “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?” EMNLP 2022. arXiv:2202.12837
Zhao, Z. et al. — “Calibrate Before Use: Improving Few-Shot Performance of Language Models.” ICML 2021. arXiv:2102.09690
OpenAI — Prompt Engineering Guide. Link
Wei, J. et al. — “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022. arXiv:2201.11903

Meta description: Build a text classifier hitting 90%+ accuracy with zero-shot, one-shot, and few-shot prompting. Learn example selection, ordering effects, and reusable patterns.

[SCHEMA HINTS]
– Article type: Tutorial
– Primary technology: OpenAI GPT-4o-mini
– Programming language: Python
– Difficulty: Intermediate
– Keywords: zero-shot prompting, few-shot prompting, text classification, LLM classifier, prompt engineering

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Zero-Shot vs Few-Shot Prompting: Complete Guide

What Are Zero-Shot, One-Shot, and Few-Shot Prompting?

How to Get Your API Key

What You’ll Need

Our Test Data — 20 Support Tickets

Zero-Shot — The Baseline

One-Shot — One Example Changes a Lot

Few-Shot — Breaking Past 90%

Which Examples You Pick Matters More Than You’d Think

Example Order — Sequence Shifts Results

A Reusable Few-Shot Classifier

When NOT to Use Few-Shot Prompting

Common Mistakes and How to Fix Them

Mistake 1: Using test data as examples

Mistake 2: All examples from one group

Mistake 3: No format rules

Summary

Complete Code

Frequently Asked Questions

How many examples do I need for few-shot prompting?

Can I use few-shot with open-source models like Llama?

Does few-shot work for text in other languages?

What’s the difference between few-shot and fine-tuning?

Why does example order change the results?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Are Zero-Shot, One-Shot, and Few-Shot Prompting?

How to Get Your API Key

What You’ll Need

Our Test Data — 20 Support Tickets

Zero-Shot — The Baseline

One-Shot — One Example Changes a Lot

Few-Shot — Breaking Past 90%

Which Examples You Pick Matters More Than You’d Think

Example Order — Sequence Shifts Results

A Reusable Few-Shot Classifier

When NOT to Use Few-Shot Prompting

Common Mistakes and How to Fix Them

Mistake 1: Using test data as examples

Mistake 2: All examples from one group

Mistake 3: No format rules

Summary

Complete Code

Frequently Asked Questions

How many examples do I need for few-shot prompting?

Can I use few-shot with open-source models like Llama?

Does few-shot work for text in other languages?

What’s the difference between few-shot and fine-tuning?

Why does example order change the results?

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Prompt Engineering Tutorial for Beginners (2026)

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science