machine learning +
Build an AI Chatbot with Memory in Python (Step-by-Step)
OpenAI API Python Tutorial – A Complete Crash Course
Learn the OpenAI API with Python step by step. Covers chat completions, streaming, function calling, structured outputs, vision, embeddings, and a hands-on mini project.
The OpenAI API lets you add GPT’s language skills to any Python app — this guide walks you through it, from your first API call to a working project.
Want to build a chatbot? A code helper? A tool that reads images or turns speech to text? The OpenAI API makes all of this possible with just a few lines of Python. But the docs can feel dense if you are new. This crash course fixes that. We start from scratch and build up to a real project — step by step.
What Is the OpenAI API?
The OpenAI API is a web service. It lets your Python code talk to GPT models. You send a message, and the model sends back a reply — as clean, typed data.
Think of it as a remote function. You put a prompt in. You get text out. The magic happens on OpenAI’s servers, powered by large language models.
Here is what you can build with it:
- Chat and text — chatbots, writing tools, translators
- Vision — read and describe images
- Image creation — make images from text (DALL-E)
- Speech to text — turn audio into words (Whisper)
- Embeddings — change text into number arrays for search
- Function calling — let the model run your own Python code
- Structured outputs — get replies in the exact JSON shape you need
Several models are on offer. Each one trades off speed, quality, and price in a different way.
| Model | Best For | Context Window | Relative Cost |
|---|---|---|---|
| GPT-4o | High-quality text, vision, audio | 128K tokens | Medium |
| GPT-4o-mini | Fast, cheap daily tasks | 128K tokens | Low |
| o1 | Hard reasoning, math, code | 200K tokens | High |
| o3-mini | Reasoning at lower cost | 200K tokens | Medium |
| GPT-4.1 | Coding, following instructions | 1M tokens | Medium |
| GPT-4.1-mini | Quick coding tasks | 1M tokens | Low |
Key Insight: The OpenAI API is not ChatGPT. ChatGPT is a product built ON TOP of this API. When you use the API, you get full control — the model, the settings, and the output format. ChatGPT hides all of that.
How Do You Set Up the OpenAI Python SDK?
You need three things: the library, an API key, and a way to store the key safely. Let me walk you through each one.
Step 1: Install the SDK.
bash
pip install openai
Step 2: Get your API key.
Head to platform.openai.com/api-keys. Click “Create new secret key.” Copy it right away — you will not see it again.
Step 3: Save the key as a system variable.
The SDK picks up the OPENAI_API_KEY variable on its own. Set it in your shell:
bash
export OPENAI_API_KEY="sk-your-key-here"
For a lasting setup, put it in a .env file in your project root:
python
OPENAI_API_KEY=sk-your-key-here
Then load it in Python:
python
from dotenv import load_dotenv
load_dotenv() # Reads .env file and sets environment variables
Warning: Never paste your API key into a Python file. Bots scrape GitHub for keys and can rack up charges in minutes. Always use a `.env` file and add it to `.gitignore`.
Step 4: Check that it works. A quick test call does the trick.
python
from openai import OpenAI
client = OpenAI() # Reads OPENAI_API_KEY from environment
print(client.models.list().data[0].id)
Output:
python
gpt-4o-2024-08-06
If a model ID shows up, you are all set. If you see an AuthenticationError, check your key again.
How Do You Make Your First API Call?
The Chat Completions API is the classic way to talk to GPT. You send a list of messages, each with a role and some content.
Here is how to think about it. You are writing a script for a play. The system role sets the scene (“You are a helpful tutor”). The user role is you. The assistant role is the model’s lines.
Let me show you the simplest call. The code below sends one user message, and we print the reply. Watch how we pass the model name and a list of messages.
import micropip
await micropip.install('openai')
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here" # Replace with your API key
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is an API in one sentence?"}
]
)
print(response.choices[0].message.content)Output:
python
An API (Application Programming Interface) is a set of rules that lets different software programs communicate with each other.
Three lines of setup. One API call. GPT is now part of your Python script.
Understanding the Response Object
What comes back is not just a string. It is a rich object full of useful info. Let me show you.
import micropip
await micropip.install('openai')
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here" # Replace with your API key
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is an API in one sentence?"}
]
)
print(f"Response ID: {response.id}")
print(f"Model used: {response.model}")
print(f"Finish reason: {response.choices[0].finish_reason}")
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")Output:
python
Response ID: chatcmpl-abc123xyz
Model used: gpt-4o-mini-2024-07-18
Finish reason: stop
Prompt tokens: 25
Completion tokens: 22
Total tokens: 47
The usage field tells you how many tokens the call used. That number maps right to your bill. The finish_reason tells you why the model stopped — stop means it was done, length means it ran out of space.
Tip: Pick `gpt-4o-mini` while you build and test. It costs about 30x less than `gpt-4o` and runs fast. Switch to a bigger model only when you need better quality in production.
Key Parameters You Should Know
The create() method takes several options that shape the output. Here are the ones you will reach for most.
import micropip
await micropip.install('openai')
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here" # Replace with your API key
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": "Give me 3 Python project ideas."}
],
temperature=0.7, # Creativity: 0 = deterministic, 2 = very random
max_tokens=200, # Maximum length of the response
top_p=1.0, # Nucleus sampling (usually leave at 1.0)
frequency_penalty=0.0, # Penalize repeated tokens
presence_penalty=0.0, # Penalize tokens already in the conversation
)
print(response.choices[0].message.content)Output:
python
1. **Personal Finance Tracker** — Build a CLI app that logs expenses, categorizes them, and shows monthly summaries using pandas.
2. **Web Scraper Dashboard** — Create a scraper that collects job listings from multiple sites and displays them in a Streamlit dashboard.
3. **AI Flashcard Generator** — Feed lecture notes to GPT and auto-generate study flashcards with questions and answers.
The star of the show is temperature. Set it to 0 for factual work (pulling data, sorting things into groups). Set it to 0.7–1.0 for creative work (writing, brainstorming, story ideas).
What Is the Responses API and Should You Use It?
In March 2025, OpenAI shipped the Responses API — a cleaner, more capable cousin of Chat Completions. It is now the go-to API for all new projects.
What changed? The Responses API takes instructions in place of system messages, and input in place of the messages list. It also packs in tools like web search and file search right out of the box.
Let me show the same “What is an API?” call using this newer API. Notice how much leaner the code looks.
import micropip
await micropip.install('openai')
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here" # Replace with your API key
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-4o-mini",
instructions="You are a helpful assistant.",
input="What is an API in one sentence?"
)
print(response.output_text)Output:
python
An API is a set of rules and protocols that allows different software applications to communicate and exchange data with each other.
Same answer. Less code. You just pass instructions (the system prompt) and input (the user message). No need to build a messages list.
Chat Completions vs Responses API — Which Should You Use?
Let me lay it out in a table so you can choose fast.
| Feature | Chat Completions | Responses API |
|---|---|---|
| Status | Kept alive for good | Advised for new projects |
| Input style | messages list with roles | instructions + input |
| Built-in tools | None | Web search, file search, code runner |
| Multi-turn state | You manage it (append messages) | Auto (previous_response_id) |
| Structured outputs | response_format | text.format |
| Cache savings | Standard | 40-80% better |
| Function calling | Yes | Yes (strict by default) |

Key Insight: Pick the Responses API for new work. Stick with Chat Completions if your code already runs fine. The Responses API is simpler, costs less (thanks to better caching), and has built-in tools. But Chat Completions is not going away — OpenAI promises to keep it.
This guide shows both, so you can use either one. Most examples use Chat Completions (it still has the widest docs), but we show the Responses API version for key features.
How Does Streaming Work?
Normally, the API waits until the full reply is ready, then sends it all at once. With streaming, you get words as they form — token by token. This makes your app feel much snappier.
Streaming shines in chatbots and UIs. It gives users that “text appearing live” feel — just like ChatGPT in the browser.
To turn it on, set stream=True. Then loop through the pieces. Each piece holds a small chunk (called a delta) of the reply.
python
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": "Explain gradient descent in 3 sentences."}
],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content is not None:
print(content, end="", flush=True)
print() # Newline at the end
Output:
python
Gradient descent is an optimization algorithm that finds the minimum of a function by taking small steps in the direction of steepest decrease. At each step, it calculates the gradient (slope) of the loss function and moves the parameters in the opposite direction. The step size is controlled by the learning rate — too large and it overshoots, too small and it converges slowly.
Words pop up one by one in your terminal — the same effect you see in ChatGPT.
With the Responses API, streaming is almost the same:
python
stream = client.responses.create(
model="gpt-4o-mini",
input="Explain gradient descent in 3 sentences.",
stream=True
)
for event in stream:
if event.type == "response.output_text.delta":
print(event.delta, end="", flush=True)
print()
Tip: Always stream in user-facing apps. The first word shows up much sooner than if you wait for the full reply. Users see streaming as faster, even when the total time is the same.
How Do You Build Multi-Turn Conversations?
Each API call is a blank slate. The model has no memory of what came before. To hold a real conversation, you must send the full chat history with every call.
With Chat Completions, you keep a messages list and grow it after each round.
python
from openai import OpenAI
client = OpenAI()
messages = [
{"role": "system", "content": "You are a Python tutor. Keep answers short."}
]
# Turn 1
messages.append({"role": "user", "content": "What is a decorator?"})
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
assistant_msg = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_msg})
print(f"AI: {assistant_msg}\n")
# Turn 2 — the model remembers Turn 1
messages.append({"role": "user", "content": "Show me an example."})
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
assistant_msg = response.choices[0].message.content
print(f"AI: {assistant_msg}")
Output:
python
AI: A decorator is a function that takes another function as input, adds some behavior to it, and returns a modified version — without changing the original function's code.
AI: Here's a simple example:
def log_call(func):
def wrapper(*args):
print(f"Calling {func.__name__}")
return func(*args)
return wrapper
@log_call
def greet(name):
return f"Hello, {name}!"
print(greet("Alice"))
# Output: Calling greet
# Hello, Alice!
See how Turn 2 builds on Turn 1? That works because we sent the whole chat history both times.
With the Responses API, multi-turn is easier. Just pass the previous_response_id and the API tracks the state for you.
python
# Turn 1
response1 = client.responses.create(
model="gpt-4o-mini",
instructions="You are a Python tutor. Keep answers short.",
input="What is a decorator?"
)
print(f"AI: {response1.output_text}\n")
# Turn 2 — pass previous_response_id
response2 = client.responses.create(
model="gpt-4o-mini",
previous_response_id=response1.id,
input="Show me an example."
)
print(f"AI: {response2.output_text}")
No message juggling. The API holds the state on its end.
Warning: Keep an eye on your token count in long chats. Every round adds tokens. A 128K context window sounds huge, but 50 back-and-forth turns can fill it up. Trim old messages when you get close to the cap.
typescript
{
type: 'exercise',
id: 'chatbot-ex1',
title: 'Exercise 1: Build a Simple Q&A Bot',
difficulty: 'beginner',
exerciseType: 'write',
instructions: 'Complete the function below that takes a user question and a conversation history, calls the OpenAI API, and returns the assistant response as a string. The function should append the user message, make the API call, and return the content.',
starterCode: 'from openai import OpenAI\n\nclient = OpenAI()\n\ndef ask_bot(question: str, history: list) -> str:\n """Send a question to GPT and return the response."""\n # Step 1: Append the user message to history\n history.append({"role": "user", "content": question})\n \n # Step 2: Call the API (use gpt-4o-mini)\n response = # YOUR CODE HERE\n \n # Step 3: Extract and return the assistant message\n answer = # YOUR CODE HERE\n \n # Step 4: Append assistant message to history\n history.append({"role": "assistant", "content": answer})\n \n return answer\n\n# Test it\nhistory = [{"role": "system", "content": "You answer in one sentence."}]\nprint(ask_bot("What is Python?", history))\nprint("DONE")',
testCases: [
{ id: 'tc1', input: '', expectedOutput: 'DONE', description: 'Function runs and prints DONE' }
],
hints: [
'Use client.chat.completions.create(model="gpt-4o-mini", messages=history)',
'Extract the answer with: response.choices[0].message.content'
],
solution: 'from openai import OpenAI\n\nclient = OpenAI()\n\ndef ask_bot(question: str, history: list) -> str:\n history.append({"role": "user", "content": question})\n response = client.chat.completions.create(model="gpt-4o-mini", messages=history)\n answer = response.choices[0].message.content\n history.append({"role": "assistant", "content": answer})\n return answer\n\nhistory = [{"role": "system", "content": "You answer in one sentence."}]\nprint(ask_bot("What is Python?", history))\nprint("DONE")',
solutionExplanation: 'The function adds the user message to the history list, calls the Chat Completions API with the full list, pulls out the reply, adds it back for future turns, and returns it.',
xpReward: 15,
}
What Is Function Calling and How Do You Use It?
Function calling gives the model the power to reach for YOUR Python code. Instead of only writing text, it can say: “I need to call get_weather with city=’London’.”
But here is the key: the model does not run anything. You describe your functions with a JSON schema. The model picks which one to call and fills in the arguments. Then YOU run it on your side and feed the result back.
Here is the full flow in plain English:
- You list your tools (functions) along with their input shapes
- You send a user message plus those tool specs
- The model decides if it needs a tool
- If so, it hands back the function name and arguments
- You run the function on your machine
- You pass the result back to the model
- The model writes a final, human-friendly reply
Let me show you a hands-on example: a weather lookup. We write a get_weather function and let the model pick when to use it.
python
from openai import OpenAI
import json
client = OpenAI()
# Step 1: Define the function schema
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city name, e.g. London"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["city"]
}
}
}
]
# Step 2: Our actual function (in real apps, this would call a weather API)
def get_weather(city: str, unit: str = "celsius") -> str:
# Simulated weather data
weather_data = {
"London": {"temp": 15, "condition": "Cloudy"},
"Tokyo": {"temp": 22, "condition": "Sunny"},
"New York": {"temp": 18, "condition": "Partly cloudy"},
}
data = weather_data.get(city, {"temp": 20, "condition": "Unknown"})
return json.dumps({"city": city, "temperature": data["temp"], "unit": unit, "condition": data["condition"]})
Now we send a message and let the model choose the tool on its own.
python
# Step 3: Send message with tools
messages = [{"role": "user", "content": "What's the weather like in Tokyo?"}]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools
)
# Step 4: Check if the model wants to call a function
tool_call = response.choices[0].message.tool_calls[0]
print(f"Model wants to call: {tool_call.function.name}")
print(f"With arguments: {tool_call.function.arguments}")
Output:
python
Model wants to call: get_weather
With arguments: {"city": "Tokyo", "unit": "celsius"}
The model chose to call get_weather with city="Tokyo" — all by itself. Now we run the function and send the output back.
python
# Step 5: Execute the function
args = json.loads(tool_call.function.arguments)
result = get_weather(**args)
# Step 6: Send the result back to the model
messages.append(response.choices[0].message) # Add the assistant's tool call
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
# Step 7: Get the final response
final_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools
)
print(final_response.choices[0].message.content)
Output:
python
The weather in Tokyo is currently sunny with a temperature of 22°C.
The model took raw JSON from your function and turned it into a neat sentence.
Key Insight: Function calling is how you link GPT to the real world. Database lookups, API calls, math, file reads — if your Python code can do it, the model can trigger it. This is the base layer of AI agents.
How Do Structured Outputs Work?
When you ask GPT to return JSON, it works most of the time. But “most of the time” is not enough for real apps. The model might toss in extra fields, skip required ones, or wrap the JSON in backticks.
Structured Outputs fix this for good. You create a Pydantic model (a Python class that spells out your data shape), and the API makes sure the reply fits it — every time. No parse errors. No missing keys.
Let me show you. We will pull structured data out of a product review. First, we set up the schema with Pydantic.
import micropip
await micropip.install(['openai', 'pydantic'])
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here" # Replace with your API key
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI()
class ReviewAnalysis(BaseModel):
sentiment: str # "positive", "negative", or "neutral"
confidence: float # 0.0 to 1.0
key_topics: list[str] # Main topics mentioned
summary: str # One-sentence summary
response = client.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Analyze the product review."},
{"role": "user", "content": "The laptop is blazing fast and the screen is gorgeous. Battery life could be better though — I get about 6 hours. Overall, great value for the price."}
],
response_format=ReviewAnalysis,
)
review = response.choices[0].message.parsed
print(f"Sentiment: {review.sentiment}")
print(f"Confidence: {review.confidence}")
print(f"Topics: {review.key_topics}")
print(f"Summary: {review.summary}")Output:
python
Sentiment: positive
Confidence: 0.85
Topics: ['performance', 'display', 'battery life', 'value']
Summary: A fast laptop with an excellent screen and good value, though battery life is average at 6 hours.
Note the .parse() call instead of .create(). It tells the SDK to check the reply against your Pydantic model. The review object is a real typed Python object — full dot-notation access with type safety baked in.
When to Use Structured Outputs vs Function Calling
Both lean on JSON schemas. But they solve different problems.
| Use Case | Reach For |
|---|---|
| Shape the model’s reply to the user | Structured Outputs |
| Hook the model up to tools or APIs | Function Calling |
| Pull structured data from free text | Structured Outputs |
| Let the model fire actions in your system | Function Calling |
| Feed model output into a pipeline | Structured Outputs |
Tip: Pair Pydantic with Structured Outputs for production. You get type safety, auto-checks, and clean Python objects — no more `json.loads()` and manual key hunts.
typescript
{
type: 'exercise',
id: 'structured-ex1',
title: 'Exercise 2: Extract Movie Information',
difficulty: 'beginner',
exerciseType: 'write',
instructions: 'Define a Pydantic model called MovieInfo with fields: title (str), year (int), genre (str), and rating (float). Then use client.chat.completions.parse() to extract this information from the given movie description. Print the title and year.',
starterCode: 'from pydantic import BaseModel\nfrom openai import OpenAI\n\nclient = OpenAI()\n\n# Define the Pydantic model\nclass MovieInfo(BaseModel):\n # YOUR CODE HERE: define title, year, genre, rating fields\n pass\n\nresponse = client.chat.completions.parse(\n model="gpt-4o-mini",\n messages=[\n {"role": "user", "content": "The Dark Knight (2008) is a superhero thriller rated 9.0 on IMDb."}\n ],\n response_format=MovieInfo,\n)\n\nmovie = response.choices[0].message.parsed\nprint(f"{movie.title} ({movie.year})")\nprint("DONE")',
testCases: [
{ id: 'tc1', input: '', expectedOutput: 'The Dark Knight (2008)', description: 'Should extract title and year' },
{ id: 'tc2', input: '', expectedOutput: 'DONE', description: 'Completes successfully' }
],
hints: [
'Define fields like: title: str, year: int, genre: str, rating: float',
'Full model: class MovieInfo(BaseModel):\n title: str\n year: int\n genre: str\n rating: float'
],
solution: 'from pydantic import BaseModel\nfrom openai import OpenAI\n\nclient = OpenAI()\n\nclass MovieInfo(BaseModel):\n title: str\n year: int\n genre: str\n rating: float\n\nresponse = client.chat.completions.parse(\n model="gpt-4o-mini",\n messages=[\n {"role": "user", "content": "The Dark Knight (2008) is a superhero thriller rated 9.0 on IMDb."}\n ],\n response_format=MovieInfo,\n)\n\nmovie = response.choices[0].message.parsed\nprint(f"{movie.title} ({movie.year})")\nprint("DONE")',
solutionExplanation: 'We define a Pydantic model with four typed fields. The .parse() method locks the reply to this shape, giving us a typed Python object with clean dot-notation access.',
xpReward: 15,
}
How Do You Work with Images?
The OpenAI API handles images in two ways: Vision (reading images you already have) and DALL-E (making new ones from words).
Vision — Read and Understand Images
GPT-4o can look at a picture and answer questions about it. You pass the image link (or base64 bytes) right next to your text prompt.
Let me show you. We ask GPT-4o to describe a photo. The message content becomes a list that holds both text and image items.
python
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What do you see in this image? Be brief."},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
}
}
]
}
],
max_tokens=100
)
print(response.choices[0].message.content)
Output:
python
I see an orange tabby cat sitting upright on a wooden surface. The cat has bright green eyes and distinctive striped markings across its fur.
Image Creation with DALL-E
DALL-E draws images from text prompts. You pick a description, a size, and how many images you want.
python
from openai import OpenAI
client = OpenAI()
response = client.images.generate(
model="dall-e-3",
prompt="A cozy home office with a standing desk, dual monitors showing Python code, and a cat sleeping on the keyboard",
size="1024x1024",
n=1
)
image_url = response.data[0].url
print(f"Image URL: {image_url}")
Output:
python
Image URL: https://oaidalleapiprodscus.blob.core.windows.net/private/...
The link runs out after about an hour. Save the image if you want to keep it.
Note: DALL-E 3 costs $0.040 per image at 1024×1024. To save money while testing, use `dall-e-2` at $0.020 per image and 512×512 size.
How Do You Create Embeddings?
Embeddings turn text into number arrays — vectors — that capture meaning. Sentences with close meanings end up with close vectors. This is the engine behind semantic search, recs, and grouping.
Here is how it works. We create vectors for three sentences, then check which ones are close. The text-embedding-3-small model gives us 1536 numbers per sentence.
import micropip
await micropip.install('openai')
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here" # Replace with your API key
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# Create embeddings for three sentences
emb1 = get_embedding("The cat sat on the mat")
emb2 = get_embedding("A kitten rested on the rug")
emb3 = get_embedding("Stock prices rose sharply today")
# Compute cosine similarity
def cosine_similarity(a, b):
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cat vs Kitten: {cosine_similarity(emb1, emb2):.4f}")
print(f"Cat vs Stocks: {cosine_similarity(emb1, emb3):.4f}")Output:
python
Cat vs Kitten: 0.8734
Cat vs Stocks: 0.1205
The cat and kitten lines score 0.87 — very close. The cat and stock lines score 0.12 — barely linked. Semantic search works the same way: embed the query, embed every document, then grab the nearest matches.
Tip: Start with `text-embedding-3-small`. It is the cheapest option and works well for search and grouping. Only move to `text-embedding-3-large` if you need peak accuracy.
How Do You Transcribe Audio with Whisper?
The Whisper API turns speech into text. It reads mp3, wav, webm, mp4, and more. The file size cap is 25MB.
Here is the basic call. You open the audio file and hand it to the endpoint.
python
from openai import OpenAI
client = OpenAI()
with open("meeting_recording.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(transcript.text)
Output:
python
Welcome to today's team standup. Let's start with updates from the backend team...
Need to go from another language to English? Use the translate endpoint instead:
python
with open("spanish_podcast.mp3", "rb") as audio_file:
translation = client.audio.translations.create(
model="whisper-1",
file=audio_file
)
print(translation.text) # Output is in English regardless of source language
How Do You Handle Errors and Retries?
API calls break. Networks drop. Rate caps kick in. Your code needs a safety net.
The SDK throws typed errors for each problem. Here are the main ones you will see.
| Error | Cause | What to Do |
|---|---|---|
AuthenticationError | Bad API key | Check OPENAI_API_KEY |
RateLimitError | Too many calls per minute | Wait, then retry |
APIError | Server hiccup | Retry after a pause |
APIConnectionError | Network issue | Check your connection |
BadRequestError | Wrong parameters | Fix the call (model name, message shape) |
Below is a handy wrapper that retries on its own when the error is short-lived. For lasting errors (bad key, bad request), it lets them through.
python
from openai import OpenAI, RateLimitError, APIError, APIConnectionError
import time
client = OpenAI()
def call_with_retry(messages, model="gpt-4o-mini", max_retries=3):
"""Make an API call with automatic retry on transient errors."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=messages
)
return response.choices[0].message.content
except RateLimitError:
wait_time = 2 ** attempt # 1s, 2s, 4s
print(f"Rate limited. Retrying in {wait_time}s...")
time.sleep(wait_time)
except (APIError, APIConnectionError) as e:
wait_time = 2 ** attempt
print(f"API error: {e}. Retrying in {wait_time}s...")
time.sleep(wait_time)
raise Exception(f"Failed after {max_retries} retries")
# Usage
result = call_with_retry([{"role": "user", "content": "Hello!"}])
print(result)
Output:
python
Hello! How can I help you today?
Warning: Never retry `AuthenticationError` or `BadRequestError`. These are lasting failures — your key is wrong or your request is broken. Trying again will not help.
How Do You Manage Costs and Tokens?
Every API call costs money based on the tokens it uses. A token is about 4 characters or 0.75 words. Knowing this helps you set budgets and stay inside context caps.
Here are the going rates for the main models (per 1 million tokens):
| Model | Input Price | Output Price |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| o1 | $15.00 | $60.00 |
| o3-mini | $1.10 | $4.40 |
| GPT-4.1 | $2.00 | $8.00 |
| GPT-4.1-mini | $0.40 | $1.60 |
To count tokens before you send a call, use the tiktoken library. This lets you gauge costs and trim messages to fit.
python
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
"""Count the number of tokens in a text string."""
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Example
text = "The quick brown fox jumps over the lazy dog."
tokens = count_tokens(text)
print(f"Text: '{text}'")
print(f"Tokens: {tokens}")
print(f"Estimated cost (GPT-4o-mini input): ${tokens * 0.15 / 1_000_000:.8f}")
Output:
python
Text: 'The quick brown fox jumps over the lazy dog.'
Tokens: 10
Estimated cost (GPT-4o-mini input): $0.00000150
Ways to Cut Your API Bill
Here are down-to-earth tips that work:
- Use GPT-4o-mini for 90% of tasks. It handles most work well and costs 17x less than GPT-4o.
- Count tokens before you send. If the prompt is too long, drop the oldest chat turns.
- Cache answers. If users ask the same thing twice, serve the saved reply.
- Set
max_tokens. This stops runaway-long (and runaway-costly) replies. - Stream with
stream_options={"include_usage": True}to see token counts in real time.
Key Insight: At $0.15 per million input tokens, a typical GPT-4o-mini call costs under $0.001. For hobby work and small apps, the bill is tiny. Build first, trim costs later.
typescript
{
type: 'exercise',
id: 'tokens-ex1',
title: 'Exercise 3: Token-Aware Message Trimmer',
difficulty: 'intermediate',
exerciseType: 'write',
instructions: 'Write a function called `trim_messages` that takes a list of messages and a max_tokens limit. It should remove the OLDEST user/assistant messages (keep the system message) until the total token count is under the limit. Use tiktoken to count tokens. Print the number of messages before and after trimming.',
starterCode: 'import tiktoken\n\ndef count_message_tokens(messages: list, model: str = "gpt-4o-mini") -> int:\n """Count total tokens across all messages."""\n encoding = tiktoken.encoding_for_model(model)\n total = 0\n for msg in messages:\n total += len(encoding.encode(msg["content"])) + 4 # 4 tokens overhead per message\n return total\n\ndef trim_messages(messages: list, max_tokens: int = 100) -> list:\n """Remove oldest non-system messages until under max_tokens."""\n # YOUR CODE HERE\n pass\n\n# Test it\nmessages = [\n {"role": "system", "content": "You are helpful."},\n {"role": "user", "content": "Tell me about Python."},\n {"role": "assistant", "content": "Python is a popular programming language known for its readability."},\n {"role": "user", "content": "What about Java?"},\n {"role": "assistant", "content": "Java is a statically typed language widely used in enterprise applications."},\n {"role": "user", "content": "Compare them."},\n]\n\nbefore = len(messages)\ntrimmed = trim_messages(messages, max_tokens=60)\nafter = len(trimmed)\nprint(f"Before: {before}, After: {after}")\nprint("DONE")',
testCases: [
{ id: 'tc1', input: '', expectedOutput: 'DONE', description: 'Function runs successfully' }
],
hints: [
'Separate system messages from the rest. Loop and remove from the start of non-system messages while total tokens exceed max_tokens.',
'system_msgs = [m for m in messages if m["role"] == "system"]\nother_msgs = [m for m in messages if m["role"] != "system"]\nwhile count_message_tokens(system_msgs + other_msgs) > max_tokens and other_msgs:\n other_msgs.pop(0)\nreturn system_msgs + other_msgs'
],
solution: 'import tiktoken\n\ndef count_message_tokens(messages, model="gpt-4o-mini"):\n encoding = tiktoken.encoding_for_model(model)\n total = 0\n for msg in messages:\n total += len(encoding.encode(msg["content"])) + 4\n return total\n\ndef trim_messages(messages, max_tokens=100):\n system_msgs = [m for m in messages if m["role"] == "system"]\n other_msgs = [m for m in messages if m["role"] != "system"]\n while count_message_tokens(system_msgs + other_msgs) > max_tokens and other_msgs:\n other_msgs.pop(0)\n return system_msgs + other_msgs\n\nmessages = [\n {"role": "system", "content": "You are helpful."},\n {"role": "user", "content": "Tell me about Python."},\n {"role": "assistant", "content": "Python is a popular programming language known for its readability."},\n {"role": "user", "content": "What about Java?"},\n {"role": "assistant", "content": "Java is a statically typed language widely used in enterprise applications."},\n {"role": "user", "content": "Compare them."},\n]\n\nbefore = len(messages)\ntrimmed = trim_messages(messages, max_tokens=60)\nafter = len(trimmed)\nprint(f"Before: {before}, After: {after}")\nprint("DONE")',
solutionExplanation: 'We split system messages (always kept) from chat messages. Then we pop the oldest chat messages one at a time until tokens drop below the cap. This keeps the freshest context while staying on budget.',
xpReward: 20,
}
How Do You Use Async for Many API Calls?
When you need to fire off dozens of calls — say, summing up 50 docs — doing them one by one is slow. Each call waits for the last one to wrap up.
The AsyncOpenAI client lets you run many calls side by side. This is a must for web servers (FastAPI, Django) and batch jobs.
Here is what it looks like. We sum up 5 texts at the same time instead of one after another.
python
import asyncio
from openai import AsyncOpenAI
async_client = AsyncOpenAI()
async def summarize(text: str) -> str:
"""Summarize a text using the async client."""
response = await async_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Summarize in one sentence."},
{"role": "user", "content": text}
]
)
return response.choices[0].message.content
async def main():
texts = [
"Python was created by Guido van Rossum and first released in 1991.",
"JavaScript is the language of the web browser.",
"Rust focuses on memory safety without garbage collection.",
"Go was designed at Google for concurrent systems programming.",
"TypeScript adds static types to JavaScript.",
]
# Process all 5 in parallel
summaries = await asyncio.gather(*[summarize(t) for t in texts])
for text, summary in zip(texts, summaries):
print(f"Original: {text[:50]}...")
print(f"Summary: {summary}\n")
asyncio.run(main())
Output:
python
Original: Python was created by Guido van Rossum and first ...
Summary: Python is a programming language created by Guido van Rossum, first released in 1991.
Original: JavaScript is the language of the web browser....
Summary: JavaScript is the primary programming language used in web browsers.
Original: Rust focuses on memory safety without garbage col...
Summary: Rust is a programming language that ensures memory safety without relying on garbage collection.
Original: Go was designed at Google for concurrent systems p...
Summary: Go is a Google-designed language built for concurrent systems programming.
Original: TypeScript adds static types to JavaScript....
Summary: TypeScript is a superset of JavaScript that introduces static typing.
All 5 calls run at once. A job that takes 10 seconds in sequence finishes in about 2.
Tip: Use `asyncio.gather()` for batch work, but mind the rate limit. If you blast 100 calls at once, you will hit the cap. Use `asyncio.Semaphore` to allow, say, 10 at a time.
Let’s Build a Mini Project — AI Research Helper
Time to bring it all together. We will build a research helper that takes a topic, uses function calling to look up facts, and hands back a tidy report via Structured Outputs.
This mini project taps three skills we covered: chat completions, function calling, and structured outputs.
First, we set up the tools and the report schema.
python
from openai import OpenAI
from pydantic import BaseModel
import json
client = OpenAI()
# Output schema — what our report looks like
class ResearchReport(BaseModel):
topic: str
summary: str
key_findings: list[str]
sources_used: list[str]
difficulty_level: str # "beginner", "intermediate", "advanced"
# Tool definition — our "search" function
tools = [
{
"type": "function",
"function": {
"name": "search_knowledge_base",
"description": "Search for information about a topic in the knowledge base",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
}
},
"required": ["query"]
}
}
}
]
Next, the search function and the main loop.
python
def search_knowledge_base(query: str) -> str:
"""Simulated knowledge base search."""
knowledge = {
"neural networks": "Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes. Key types: CNNs for images, RNNs for sequences, Transformers for attention-based processing.",
"transformers": "Transformers use self-attention to process sequences in parallel. Introduced in 'Attention Is All You Need' (2017). Foundation of GPT, BERT, and all modern LLMs.",
"attention mechanism": "Attention lets the model focus on relevant parts of the input. Self-attention computes relationships between all positions in a sequence. Multi-head attention runs multiple attention operations in parallel.",
}
for key, value in knowledge.items():
if key in query.lower():
return json.dumps({"results": [value], "source": f"knowledge_base/{key}"})
return json.dumps({"results": ["No specific information found."], "source": "knowledge_base/general"})
def run_research(topic: str) -> ResearchReport:
"""Run the research assistant pipeline."""
messages = [
{"role": "system", "content": "You are a research assistant. Search the knowledge base to gather information, then produce a structured report."},
{"role": "user", "content": f"Research this topic: {topic}"}
]
# Let the model search (up to 3 searches)
for _ in range(3):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools
)
if response.choices[0].finish_reason == "stop":
break # Model is done searching
if response.choices[0].message.tool_calls:
messages.append(response.choices[0].message)
for tool_call in response.choices[0].message.tool_calls:
args = json.loads(tool_call.function.arguments)
result = search_knowledge_base(**args)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
# Now get the structured report
report_response = client.chat.completions.parse(
model="gpt-4o-mini",
messages=messages + [
{"role": "user", "content": "Now produce the structured research report based on what you found."}
],
response_format=ResearchReport,
)
return report_response.choices[0].message.parsed
Let us run it and see the output.
python
report = run_research("How do transformers and attention mechanisms work?")
print(f"Topic: {report.topic}")
print(f"Difficulty: {report.difficulty_level}")
print(f"\nSummary: {report.summary}")
print(f"\nKey Findings:")
for i, finding in enumerate(report.key_findings, 1):
print(f" {i}. {finding}")
print(f"\nSources: {report.sources_used}")
Output:
python
Topic: Transformers and Attention Mechanisms
Difficulty: intermediate
Summary: Transformers are a neural network architecture that uses self-attention mechanisms to process sequences in parallel, enabling powerful language models like GPT and BERT.
Key Findings:
1. Transformers process sequences in parallel using self-attention, unlike RNNs which process sequentially.
2. Self-attention computes relationships between all positions in a sequence simultaneously.
3. Multi-head attention runs multiple attention operations in parallel for richer representations.
4. Introduced in the 2017 paper 'Attention Is All You Need', transformers are the foundation of all modern LLMs.
Sources: ['knowledge_base/transformers', 'knowledge_base/attention mechanism']
This tiny project shows the real power of mixing API features. In a live app, you would swap the fake search for a real web lookup or database query.
Common Mistakes and How to Fix Them
Mistake 1: Pasting API Keys into Source Code
❌ Wrong:
python
client = OpenAI(api_key="sk-abc123...") # Key exposed in source code!
Why it breaks: Push this to GitHub and bots find the key in minutes. You wake up to a big bill.
✅ Right:
python
from openai import OpenAI
client = OpenAI() # Reads OPENAI_API_KEY from environment automatically
Mistake 2: Ignoring Rate Limits
❌ Wrong:
python
# Fires 100 requests with no throttling
for item in items:
response = client.chat.completions.create(model="gpt-4o-mini", messages=[...])
Why it breaks: After a few dozen calls you hit the cap. The API sends back a 429 error and your script dies.
✅ Right:
python
import time
from openai import RateLimitError
for item in items:
try:
response = client.chat.completions.create(model="gpt-4o-mini", messages=[...])
except RateLimitError:
time.sleep(2) # Wait and retry
Mistake 3: Letting Chat History Grow Forever
❌ Wrong:
python
# History grows forever — eventually exceeds context window
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
Why it breaks: After enough turns, the list blows past the 128K token window. The API throws an error — or quietly chops your input.
✅ Right:
python
# Trim old messages to stay within budget
if count_tokens(messages) > 100000:
messages = [messages[0]] + messages[-10:] # Keep system message + last 10 turns
Mistake 4: Using Old Models and Endpoints
❌ Wrong:
python
# Old completions API — deprecated
response = openai.Completion.create(model="text-davinci-003", prompt="Hello")
Why it breaks: The old Completions API and text-davinci-003 are gone. They can stop working any day.
✅ Right:
python
# Modern SDK with Chat Completions or Responses API
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}]
)
Mistake 5: Parsing JSON by Hand When You Could Use Structured Outputs
❌ Wrong:
python
import json
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Return a JSON with name and age"}]
)
data = json.loads(response.choices[0].message.content) # Fragile! May fail
Why it breaks: The model might wrap JSON in backticks, add stray text, or ship bad JSON. Your json.loads() call crashes.
✅ Right:
python
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: int
response = client.chat.completions.parse(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Return info for John, age 30"}],
response_format=Person
)
person = response.choices[0].message.parsed # Guaranteed to match schema
Frequently Asked Questions
How much does the OpenAI API cost?
It depends on the model and the tokens you use. GPT-4o-mini runs $0.15 per 1M input tokens and $0.60 per 1M output tokens. A normal chat message (50 tokens in, 100 out) costs roughly $0.00007. Most builders spend under $5 a month while developing. See openai.com/pricing for live rates.
Can I use the OpenAI API for free?
New accounts get a small credit (often $5–$18, varies by region). After that, you add a card. There is no lasting free tier, but costs are low for small projects.
Is the API the same as ChatGPT?
No. ChatGPT is a chat app built on top of the API. The API gives you raw access to the same models — plus full control over prompts, settings, and output shape. You can build your own ChatGPT or something completely different.
How do I pick between GPT-4o and GPT-4o-mini?
Default to GPT-4o-mini. It nails most tasks and costs 17x less. Move to GPT-4o when quality slips — tricky reasoning, subtle writing, or spots where accuracy matters most. For heavy math, look at o1 or o3-mini.
Can I fine-tune OpenAI models?
Yes. Fine-tuning works on GPT-4o-mini, GPT-4o, and GPT-3.5-turbo. You feed in training pairs (input/output in JSONL) and get a custom version of the model. Reach for it when prompt tweaks alone do not hit your quality bar.
What sets Chat Completions apart from the Responses API?
Chat Completions uses a messages list. The Responses API (March 2025) is leaner — it takes instructions + input and bundles tools like web search and file search. OpenAI says both will stay around, but the Responses API is the pick for new work.
Complete Code
References
- OpenAI API Documentation — Quickstart Guide. Link
- OpenAI API Reference — Chat Completions. Link
- OpenAI — Introducing Structured Outputs in the API (August 2024). Link
- OpenAI — Responses API Migration Guide. Link
- OpenAI — Function Calling Guide. Link
- OpenAI Python SDK — GitHub Repository. Link
- OpenAI — Model Pricing. Link
- OpenAI for Developers in 2025 — Blog Post. Link
- tiktoken — OpenAI’s Token Counting Library. Link
- Pydantic Documentation — Data Validation for Python. Link
[SCHEMA HINTS]
– Article type: Tutorial
– Primary technology: OpenAI Python SDK 1.x, GPT-4o, GPT-4o-mini
– Programming language: Python
– Difficulty: Beginner-Intermediate
– Keywords: openai api python tutorial, openai python sdk, chat completions, responses api, function calling, structured outputs, openai streaming, openai embeddings, openai vision, gpt-4o python
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
