Build Your First AI App with Python — Step-by-Step Tutorial
You want to build something with AI — not just read about it. By the end of this tutorial, you will have a streaming AI assistant in Python that remembers conversations, handles API failures gracefully, and tracks every cent it spends. All in under 80 lines of core code.
What Is the OpenAI API and Why Use It?
The OpenAI API gives your Python code direct access to models like GPT-4o and GPT-4o-mini. You send a message, you get back an intelligent response — no model training, no GPU, no machine learning expertise required.
Here is the complete pattern in six lines. The OpenAI() client reads your API key from the environment, chat.completions.create() sends a message, and response.choices[0].message.content extracts the text:
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv() # loads OPENAI_API_KEY from .env file
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is Python?"}]
)
print(response.choices[0].message.content)
Python is a high-level, interpreted programming language known for its
readability and versatility. It supports multiple programming paradigms
including procedural, object-oriented, and functional programming, and
is widely used in web development, data science, artificial intelligence,
and automation.
That is the entire pattern. Every AI app — chatbots, summarizers, code generators — is a variation of this single chat.completions.create call.
The 5 steps to build your first AI app
- Get an API key from platform.openai.com/api-keys
- Install the SDK —
pip install openai python-dotenv - Store the key safely in a
.envfile (never hardcode it) - Send your first message using
client.chat.completions.create() - Extract the response from
response.choices[0].message.content
The rest of this tutorial builds on this foundation — adding system messages, conversation memory, streaming, error handling, and cost control.
Prerequisites
- Python version: 3.9+
- Required libraries: openai (1.0+), python-dotenv
- Install:
pip install openai python-dotenv - API key: From platform.openai.com/api-keys (requires billing enabled)
- Time to complete: 25-30 minutes
- Cost: Under $0.10 if you follow along with gpt-4o-mini
I prefer teaching with gpt-4o-mini. It costs $0.15 per million input tokens — cheap enough to experiment freely. You can swap to gpt-4o or gpt-4.1 with a one-line change.
Setting up your API key
Never hardcode API keys in your scripts. One accidental git push and your key is compromised.
Create a .env file in your project folder:
# .env file — add this to .gitignore immediately
OPENAI_API_KEY=sk-proj-your-key-here
The load_dotenv() call reads this file and sets the environment variable. The OpenAI() client picks it up automatically. If you get an AuthenticationError, run print(os.environ.get("OPENAI_API_KEY", "NOT SET")) to check the key loaded.
Understanding Chat Completions — Messages, Roles, and Parameters
Every interaction with GPT is a conversation made up of messages. Each message has a role and content. Three roles:
system— Sets the AI’s personality and rules. The model follows these throughout the conversation.user— Your input (or your app user’s input).assistant— The model’s previous responses. Used for conversation history.
The system message transforms a generic AI into a specialized tool. Below, we create a code reviewer that gives structured, three-bullet feedback. The system message constrains format and tone, and the user message provides the code to review:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You are a senior Python code reviewer. Give brief, actionable feedback. Max 3 bullet points."
},
{
"role": "user",
"content": "Review this code:\ndef calc(x,y): return x+y"
}
]
)
print(response.choices[0].message.content)
- Rename `calc` to something descriptive like `add_numbers` — function names should communicate intent.
- Add type hints: `def add_numbers(x: float, y: float) -> float:` for clarity and IDE support.
- Consider adding a docstring explaining the function's purpose, even for simple utilities.
I recommend writing specific system messages for every use case. “You are a helpful assistant” is too vague. “You are a Python code reviewer. Give exactly 3 bullet points.” — that gives you predictable output you can actually parse.
Controlling output with temperature and max_tokens
Two parameters matter most: temperature controls randomness, max_tokens caps response length.
This call generates a creative coffee shop tagline. Setting temperature=0.9 encourages variation, while max_tokens=50 keeps the response short:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Write a tagline for a coffee shop."}],
temperature=0.9, # higher = more creative (0.0 to 2.0)
max_tokens=50 # hard cap on response length
)
print(response.choices[0].message.content)
"Where every cup tells a story — brewed with passion, served with soul."
| Parameter | Range | Default | What It Controls |
|---|---|---|---|
temperature |
0.0 – 2.0 | 1.0 | Randomness. Low = deterministic, high = creative |
max_tokens |
1 – model max | Model-dependent | Maximum response length (hard cutoff) |
top_p |
0.0 – 1.0 | 1.0 | Nucleus sampling — alternative to temperature |
Choosing the right model
Which model should you use? Here is the practical breakdown (verify current pricing at platform.openai.com/pricing):
| Model | Cost (per 1M input tokens) | Best For | Context Window |
|---|---|---|---|
gpt-4o-mini |
$0.15 | Prototyping, simple tasks, high volume | 128K tokens |
gpt-4o |
$2.50 | Complex reasoning, multimodal (images + text) | 128K tokens |
gpt-4.1 |
$2.00 | Coding tasks, instruction following | 1M tokens |
My rule: start with gpt-4o-mini. Only upgrade if quality is insufficient for your specific task.
Quick check — predict the output:
What happens if you set temperature=0 and send the same prompt twice? Will you get identical responses?
Answer
Almost identical, but not guaranteed. With `temperature=0`, the model picks the highest-probability token at each step, so responses are highly deterministic. But the API does not guarantee bit-for-bit identical output. For stricter determinism, use the `seed` parameter — but even then, OpenAI notes results may vary across infrastructure changes.
Building a Multi-Turn Chatbot
A chatbot needs memory. The API itself is stateless — every call is independent. You create memory by sending the full conversation history with each request.
The pattern: maintain a messages list, append every user message and assistant reply, send the whole list each time. This function creates an interactive loop — the while True runs until the user types “quit”, and each exchange appends both sides to the growing messages list:
def chat_with_memory():
"""Interactive chatbot that remembers the conversation."""
messages = [
{"role": "system", "content": "You are a helpful Python tutor. Keep answers under 3 sentences."}
]
print("Python Tutor (type 'quit' to exit)")
print("-" * 40)
while True:
user_input = input("\nYou: ")
if user_input.lower() == "quit":
break
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
assistant_reply = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_reply})
print(f"\nTutor: {assistant_reply}")
# Note: uses input() — run in a terminal, not a notebook
chat_with_memory()
A sample exchange might look like:
Python Tutor (type 'quit' to exit)
----------------------------------------
You: What is a list comprehension?
Tutor: A list comprehension is a concise way to create a new list by
applying an expression to each item in an existing iterable, optionally
filtering items with a condition. The syntax is
[expression for item in iterable if condition].
You: Give me an example
Tutor: Here's one: [x**2 for x in range(5)] produces [0, 1, 4, 9, 16].
The model remembers you asked about list comprehensions because the full
conversation history is sent with each request.
We append every user message AND every assistant reply to messages. This growing list IS the memory.
This is dead simple, and it works. But it has a cost problem — we will fix that later. For a deeper dive into memory strategies including sliding windows, summarization, and persistent storage, see our Python AI chatbot memory tutorial.
Streaming Responses in Real Time
Waiting for GPT to generate a long response before showing anything feels slow. Streaming sends tokens as they are generated — the same way ChatGPT shows text appearing word by word.
Set stream=True and iterate over chunks. The critical change: streaming uses .delta.content instead of .message.content, because each chunk is a partial update. The flush=True in the print call forces Python to display each token immediately instead of buffering:
def stream_response(prompt):
"""Stream a response token by token, like ChatGPT does."""
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
stream=True
)
print("AI: ", end="")
full_response = ""
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
full_response += content
print()
return full_response
result = stream_response("Explain recursion in 3 sentences.")
AI: Recursion is when a function calls itself to solve a smaller version
of the same problem. Each recursive call works on a reduced input until it
reaches a base case that stops the recursion. Think of it like Russian
nesting dolls — you keep opening smaller dolls until you find the smallest
one, then work your way back out.
Two things to notice. First, chunk.choices[0].delta.content — not .message.content. Second, flush=True — without it, Python buffers the output and you lose the streaming effect.
I always use streaming in user-facing apps. The perceived speed difference is dramatic even when total generation time is identical.
Quick check — predict the output:
What happens if you access chunk.choices[0].message.content instead of chunk.choices[0].delta.content in a streaming response?
Answer
You get an `AttributeError`. Streaming chunks have a `delta` attribute, not `message`. The `delta` contains only the NEW tokens in each chunk. This is the #1 streaming mistake.
Building a Document Summarizer — Your First Real App
This is where it gets practical. A document summarizer takes long text and produces a concise summary — one of the most common GenAI use cases in production.
The function below uses temperature=0 because summarization is a factual task, not a creative one. The system message constrains the output to exactly N bullet points with no preamble, so your app can reliably parse the response:
def summarize_document(text, max_sentences=3):
"""Summarize a document into key points."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
f"You are a document summarizer. Produce exactly {max_sentences} "
"bullet points capturing the key information. Each bullet is one "
"sentence. No preamble, no conclusion — just the bullets."
)
},
{
"role": "user",
"content": f"Summarize this document:\n\n{text}"
}
],
temperature=0
)
return response.choices[0].message.content
Testing with a sample document:
sample_doc = """
Machine learning is a subset of artificial intelligence that enables systems to
learn from data without being explicitly programmed. There are three main types:
supervised learning uses labeled data to predict outcomes, unsupervised learning
finds patterns in unlabeled data, and reinforcement learning trains agents through
rewards and penalties. Common applications include spam detection, recommendation
systems, medical diagnosis, and autonomous vehicles. The field has grown rapidly
due to increased computing power, availability of large datasets, and advances
in neural network architectures like transformers.
"""
summary = summarize_document(sample_doc)
print(summary)
- Machine learning is an AI subset that allows systems to learn patterns from data without explicit programming.
- The three main types are supervised learning (labeled data), unsupervised learning (pattern discovery), and reinforcement learning (reward-based training).
- The field's rapid growth is driven by increased computing power, large datasets, and advances in neural network architectures like transformers.
This pattern — specific system message + temperature=0 — is the foundation of reliable GenAI features. The system message does the heavy lifting.
Exercise 1: Build a Tone Converter
Create a convert_tone(text, tone) function that rewrites text in a specified tone (formal, casual, or pirate).
Hint 1: You need a system message with the tone parameter embedded. Use an f-string.
Hint 2: The system message should be something like: f"Rewrite the user's text in a {tone} tone. Keep the same meaning. Return only the rewritten text."
# Starter code
def convert_tone(text, tone="formal"):
"""Rewrite text in the specified tone."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
# YOUR CODE: Add a system message using the tone parameter
{"role": "user", "content": text}
],
temperature=0.7
)
return response.choices[0].message.content
# Test it
original = "The quarterly revenue exceeded expectations by 15 percent."
print(convert_tone(original, "pirate"))
Test: Call convert_tone("Hello world", "formal"). If it returns None or errors, check that your system message is inside the messages list.
Click to see solution
def convert_tone(text, tone="formal"):
"""Rewrite text in the specified tone."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"Rewrite the user's text in a {tone} tone. Keep the same meaning. Return only the rewritten text."
},
{"role": "user", "content": text}
],
temperature=0.7
)
return response.choices[0].message.content
The key is embedding `tone` in the system message with an f-string. The `temperature=0.7` adds enough creativity for natural-sounding rewrites.
Error Handling and Retries — Production Essentials
API calls fail. Networks drop, rate limits hit, servers go down. I always recommend building retry logic before your first production deploy — not after the first outage.
The OpenAI SDK provides specific exception classes for each failure mode. The retry wrapper below catches transient errors (rate limits, timeouts, connection drops) and re-raises non-retryable ones. The exponential backoff (2 ** attempt) spaces out retries at 1s, 2s, then 4s:
import time
from openai import (
APIError,
APIConnectionError,
RateLimitError,
APITimeoutError,
)
def call_with_retry(messages, model="gpt-4o-mini", max_retries=3):
"""Make an API call with automatic retry on transient errors."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=messages
)
return response.choices[0].message.content
except RateLimitError:
wait = 2 ** attempt # 1s, 2s, 4s
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
except APIConnectionError:
wait = 2 ** attempt
print(f"Connection error. Retrying in {wait}s...")
time.sleep(wait)
except APITimeoutError:
wait = 2 ** attempt
print(f"Timeout. Retrying in {wait}s...")
time.sleep(wait)
except APIError as e:
print(f"API error: {e.message}")
raise # non-retryable server error
raise Exception(f"Failed after {max_retries} retries")
The exponential backoff is critical. Retrying immediately after a rate limit gets you rate-limited again. Waiting 1, 2, then 4 seconds gives the API time to recover.
result = call_with_retry(
messages=[{"role": "user", "content": "What is 2 + 2?"}]
)
print(result)
Running this produces:
2 + 2 equals 4.
Managing Costs — Tokens, Pricing, and Strategies
Every API call costs money. Understanding the pricing model prevents surprises — especially when a chatbot prototype starts sending the entire conversation history with every call.
Costs are measured in tokens — roughly 1 token per 4 characters of English text. Each call has two components: input tokens (what you send) and output tokens (what the model generates).
This wrapper reports cost per call. It extracts the usage field from the response and looks up per-token pricing to calculate the estimated cost. The pricing dictionary maps model names to their input/output rates per million tokens:
def call_with_cost_tracking(messages, model="gpt-4o-mini"):
"""Make an API call and report token usage and estimated cost."""
response = client.chat.completions.create(
model=model,
messages=messages
)
usage = response.usage
# Pricing per 1M tokens — verify at platform.openai.com/pricing
pricing = {
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4.1": {"input": 2.00, "output": 8.00},
}
rates = pricing.get(model, pricing["gpt-4o-mini"])
input_cost = (usage.prompt_tokens / 1_000_000) * rates["input"]
output_cost = (usage.completion_tokens / 1_000_000) * rates["output"]
print(f"Tokens — Input: {usage.prompt_tokens}, Output: {usage.completion_tokens}")
print(f"Estimated cost: ${input_cost + output_cost:.6f}")
return response.choices[0].message.content
result = call_with_cost_tracking(
messages=[{"role": "user", "content": "What is 2 + 2?"}]
)
print(result)
Tokens — Input: 13, Output: 8
Estimated cost: $0.000007
2 + 2 equals 4.
The cost difference between models is dramatic. A task costing $0.001 with gpt-4o-mini costs $0.017 with gpt-4o — 17x more. I always prototype with the smallest model that gives acceptable quality.
Cost control strategies:
- Set
max_tokensaggressively. Need a one-sentence answer? Setmax_tokens=100. - Trim conversation history. Keep the system message and last 10-20 exchanges.
- Cache identical requests. Same question from different users? Return the cached response.
- Set billing limits. In the OpenAI dashboard under Settings → Limits.
Putting It All Together — A Complete AI Assistant
Everything we have built — system messages, memory, streaming, error handling, cost tracking — combines into a single class. Here is what a deployable assistant looks like.
First, the constructor sets up the client, message history, and pricing lookup. The chat method dispatches to either streaming or standard mode:
class AIAssistant:
"""AI assistant with memory, streaming, and cost tracking."""
def __init__(self, system_prompt, model="gpt-4o-mini"):
self.client = OpenAI()
self.model = model
self.messages = [{"role": "system", "content": system_prompt}]
self.total_cost = 0.0
self.pricing = {
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"gpt-4o": {"input": 2.50, "output": 10.00},
}
def chat(self, user_input, stream=True):
"""Send a message and get a response."""
self.messages.append({"role": "user", "content": user_input})
if stream:
return self._stream_response()
return self._standard_response()
The streaming method iterates over chunks and accumulates the full response. Notice stream_options={"include_usage": True} — this tells the API to send token counts at the end of the stream so we can track costs:
def _stream_response(self):
"""Stream the response token by token."""
stream = self.client.chat.completions.create(
model=self.model,
messages=self.messages,
stream=True,
stream_options={"include_usage": True}
)
full_response = ""
print("AI: ", end="")
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
full_response += content
if chunk.usage:
self._track_cost(chunk.usage)
print()
self.messages.append({"role": "assistant", "content": full_response})
return full_response
def _standard_response(self):
"""Get the full response at once."""
response = self.client.chat.completions.create(
model=self.model,
messages=self.messages
)
reply = response.choices[0].message.content
self.messages.append({"role": "assistant", "content": reply})
self._track_cost(response.usage)
return reply
Finally, helper methods for cost tracking and history management. The trim_history method keeps the system message (always index 0) and only the most recent N messages to control costs:
def _track_cost(self, usage):
"""Calculate and accumulate cost."""
rates = self.pricing.get(self.model, self.pricing["gpt-4o-mini"])
cost = (usage.prompt_tokens / 1_000_000) * rates["input"]
cost += (usage.completion_tokens / 1_000_000) * rates["output"]
self.total_cost += cost
def get_cost(self):
return f"${self.total_cost:.6f}"
def trim_history(self, keep_last=20):
"""Keep only the system message and the last N messages."""
system = self.messages[0]
self.messages = [system] + self.messages[-keep_last:]
Here is the assistant in action. We create it with a system prompt defining its role, then ask a question. The second question demonstrates memory — the model knows we were talking about lists vs tuples:
assistant = AIAssistant(
system_prompt="You are a Python expert. Give concise, practical answers."
)
assistant.chat("What's the difference between a list and a tuple?")
print(f"\nSession cost: {assistant.get_cost()}")
AI: Lists are mutable — you can add, remove, and change elements after
creation. Tuples are immutable — once created, they cannot be modified.
Use lists when your data needs to change, and tuples when it should stay
fixed (like database records, dictionary keys, or function return values).
Session cost: $0.000042
# The assistant remembers context from the previous exchange
assistant.chat("When would I use a tuple instead?")
AI: Use tuples when you need an immutable sequence: as dictionary keys,
for returning multiple values from a function, for named constants like
RGB colors (255, 128, 0), or when you want to signal that the data
should not be modified. They also use slightly less memory than lists.
That is a complete, working AI assistant. The core structure stays the same across projects — only the system prompt and the UI layer change.
Exercise 2: Add Conversation Persistence
Add two methods to AIAssistant: save_conversation(filepath) saves message history to JSON, load_conversation(filepath) restores it.
Hint 1: self.messages is already a list of dictionaries — native JSON. Use json.dump and json.load.
Hint 2: Open the file with "w" mode for saving and "r" for loading. Add indent=2 to json.dump for readability.
import json
def save_conversation(self, filepath):
"""Save the conversation history to a JSON file."""
# YOUR CODE HERE
pass
def load_conversation(self, filepath):
"""Load conversation history from a JSON file."""
# YOUR CODE HERE
pass
Test: After assistant.save_conversation("chat.json"), open chat.json — you should see a JSON array of objects with role and content keys.
Click to see solution
import json
def save_conversation(self, filepath):
"""Save the conversation history to a JSON file."""
with open(filepath, "w") as f:
json.dump(self.messages, f, indent=2)
print(f"Conversation saved to {filepath}")
def load_conversation(self, filepath):
"""Load conversation history from a JSON file."""
with open(filepath, "r") as f:
self.messages = json.load(f)
print(f"Loaded {len(self.messages)} messages from {filepath}")
The messages list maps directly to JSON. The `indent=2` makes the file human-readable for debugging.
Common Mistakes and How to Fix Them
Mistake 1: Printing the response object instead of the text
Every beginner does this at least once. The response is a Pydantic object, not a string.
❌ Wrong:
response = client.chat.completions.create(...)
print(response) # prints the entire Response object
✅ Correct:
response = client.chat.completions.create(...)
print(response.choices[0].message.content)
Mistake 2: Forgetting conversation history in chatbots
Each API call is independent. If you do not send previous messages, the model has zero context.
❌ Wrong:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What about tuples?"}]
)
# The model has no idea what "What about" refers to
✅ Correct:
messages.append({"role": "user", "content": "What about tuples?"})
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages # includes all previous exchanges
)
Mistake 3: Using .message instead of .delta in streaming
This gives you an AttributeError that is confusing if you have not seen it before.
❌ Wrong:
for chunk in stream:
print(chunk.choices[0].message.content) # AttributeError!
✅ Correct:
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Mistake 4: Building a chatbot without a system message
Without a system message, the model responds generically instead of following your specialized behavior rules. I recommend always starting with a system message, even for quick prototypes.
❌ Wrong:
messages = [] # no system message!
messages.append({"role": "user", "content": "Review my code"})
✅ Correct:
messages = [{"role": "system", "content": "You are a code reviewer. Give 3 bullet points."}]
messages.append({"role": "user", "content": "Review my code"})
Troubleshooting Common Errors
openai.AuthenticationError: Incorrect API key provided
When you see it: Your API key is wrong, expired, or not loaded from the environment.
The fix: This is almost always a .env file issue. Check it step by step:
import os
print(os.environ.get("OPENAI_API_KEY", "NOT SET"))
Verify .env exists, load_dotenv() is called before creating the client, and the key starts with sk-. A common culprit: trailing whitespace in the .env file.
openai.RateLimitError: Rate limit reached
When you see it: Too many requests too quickly, or you hit your spending limit.
The fix: Use the exponential backoff retry wrapper from above. Also check your limits at platform.openai.com/settings/organization/limits.
openai.BadRequestError: maximum context length exceeded
When you see it: Your messages list exceeds the model’s context window.
The fix: Trim conversation history — keep the system message and last 10-20 exchanges:
system_msg = messages[0]
messages = [system_msg] + messages[-20:]
When NOT to Use the OpenAI API
The API is powerful, but it is not always the right tool. Being honest about limitations is what separates a useful guide from marketing copy.
- When you need deterministic output. Even with
temperature=0, slight variations occur across calls. For a calculator or lookup table, use regular code. - When latency matters more than quality. API calls take 500ms-3s. For sub-100ms responses, pre-compute answers or use a local model.
- When your data is sensitive. Data is processed on OpenAI’s servers. For healthcare or legal data, consider self-hosted models like Llama.
- When the task is simple pattern matching. Extracting dates from text? Regex is faster, cheaper, and more reliable than a GPT call.
- When cost is a constraint at scale. Processing 1M documents with GPT-4o costs ~$2,500+. A fine-tuned open-source model on your hardware costs a fraction after initial setup.
What to Build Next
You now have all the building blocks. Here are four projects to try, in increasing difficulty:
- Email classifier — Use a system message to classify emails as “urgent,” “info,” or “spam.” One API call per email,
temperature=0, simple parsing. - Meeting notes summarizer — Feed meeting transcripts to the summarizer function. Add action item extraction by modifying the system message.
- Code review bot — Hook the code reviewer into a GitHub webhook. Use the retry wrapper for reliability.
- Multi-tool assistant — Extend the
AIAssistantclass to call external APIs (weather, stock prices) based on user requests. This leads into OpenAI’s function calling feature.
Each project uses the same core pattern. The only things that change are the system message and what you do with the response.
Frequently Asked Questions
How much does it cost to use the OpenAI API?
Pricing is per token (roughly 4 characters). GPT-4o-mini costs $0.15/1M input tokens and $0.60/1M output tokens — check current pricing at platform.openai.com/pricing. A typical exchange costs $0.0001-$0.001. I recommend setting a billing limit of $10-20/month during development.
What is the difference between GPT-4o, GPT-4o-mini, and GPT-4.1?
GPT-4o is the flagship multimodal model. GPT-4o-mini is smaller, faster, cheaper — best for most tasks. GPT-4.1 is optimized for coding and instruction following. Start with mini, upgrade only if quality falls short. See the comparison table in the parameters section above.
How do I handle documents longer than the context window?
Split into chunks (2,000-4,000 tokens each), summarize each chunk, then summarize the summaries. This “map-reduce” approach handles documents of any length. With GPT-4o’s 128K context, most single documents fit in one call anyway.
Can I use the API for free?
New accounts sometimes get limited free credits. But ongoing usage requires a payment method. Check platform.openai.com/pricing for current offers.
Is my data safe when using the API?
Per OpenAI’s policy, API data is not used to train models by default. You can opt out of data retention entirely. For highly sensitive data, consider self-hosted alternatives — running Llama 3 locally is now straightforward with tools like Ollama.
Complete Code
Click to expand the full script (copy-paste and run)
# Complete code from: Build Your First AI App with Python
# Requires: pip install openai python-dotenv
# Python 3.9+
import os
import time
import json
from openai import OpenAI, APIError, APIConnectionError, RateLimitError, APITimeoutError
from dotenv import load_dotenv
# --- Setup ---
load_dotenv()
client = OpenAI()
# --- Section 1: Basic API Call ---
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is Python?"}]
)
print("Basic call:", response.choices[0].message.content[:100], "...")
# --- Section 2: System Messages ---
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a senior Python code reviewer. Max 3 bullet points."},
{"role": "user", "content": "Review this code:\ndef calc(x,y): return x+y"}
]
)
print("\nCode review:", response.choices[0].message.content)
# --- Section 3: Streaming ---
def stream_response(prompt):
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
stream=True
)
full = ""
print("\nStreaming: ", end="")
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
full += content
print()
return full
stream_response("Explain recursion in 2 sentences.")
# --- Section 4: Document Summarizer ---
def summarize_document(text, max_sentences=3):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Produce exactly {max_sentences} bullet points. Each is one sentence."},
{"role": "user", "content": f"Summarize:\n\n{text}"}
],
temperature=0
)
return response.choices[0].message.content
sample = "Machine learning is a subset of AI that enables systems to learn from data. There are three types: supervised, unsupervised, and reinforcement learning."
print("\nSummary:", summarize_document(sample))
# --- Section 5: Error Handling ---
def call_with_retry(messages, model="gpt-4o-mini", max_retries=3):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(model=model, messages=messages)
return response.choices[0].message.content
except RateLimitError:
time.sleep(2 ** attempt)
except (APIConnectionError, APITimeoutError):
time.sleep(2 ** attempt)
except APIError as e:
print(f"API error: {e.message}")
raise
raise Exception(f"Failed after {max_retries} retries")
# --- Section 6: Cost Tracking ---
def call_with_cost_tracking(messages, model="gpt-4o-mini"):
response = client.chat.completions.create(model=model, messages=messages)
usage = response.usage
pricing = {"gpt-4o-mini": {"input": 0.15, "output": 0.60}}
rates = pricing.get(model, pricing["gpt-4o-mini"])
cost = (usage.prompt_tokens / 1_000_000) * rates["input"]
cost += (usage.completion_tokens / 1_000_000) * rates["output"]
print(f"Tokens: {usage.prompt_tokens} in, {usage.completion_tokens} out. Cost: ${cost:.6f}")
return response.choices[0].message.content
call_with_cost_tracking([{"role": "user", "content": "What is 2+2?"}])
# --- Section 7: AI Assistant Class ---
class AIAssistant:
def __init__(self, system_prompt, model="gpt-4o-mini"):
self.client = OpenAI()
self.model = model
self.messages = [{"role": "system", "content": system_prompt}]
self.total_cost = 0.0
self.pricing = {"gpt-4o-mini": {"input": 0.15, "output": 0.60}}
def chat(self, user_input):
self.messages.append({"role": "user", "content": user_input})
response = self.client.chat.completions.create(
model=self.model, messages=self.messages
)
reply = response.choices[0].message.content
self.messages.append({"role": "assistant", "content": reply})
usage = response.usage
rates = self.pricing.get(self.model, self.pricing["gpt-4o-mini"])
cost = (usage.prompt_tokens / 1_000_000) * rates["input"]
cost += (usage.completion_tokens / 1_000_000) * rates["output"]
self.total_cost += cost
return reply
def get_cost(self):
return f"${self.total_cost:.6f}"
def trim_history(self, keep_last=20):
system = self.messages[0]
self.messages = [system] + self.messages[-keep_last:]
assistant = AIAssistant("You are a Python expert. Be concise.")
print("\nAssistant:", assistant.chat("What is a decorator?"))
print(f"Session cost: {assistant.get_cost()}")
print("\nScript completed successfully.")
References
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →