OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Learn to use the OpenAI API in Python. Master chat completions, streaming responses, error handling, retries, and async calls with runnable examples.

Written by Selva Prabhakaran | 29 min read

Everything you need to build AI-powered Python apps — from your first API call to production-ready error handling.

OpenAI’s API lets you add GPT to any Python application in just five lines of code. But going from a quick demo to a production-ready system means you need streaming for real-time UX, proper error handling for reliability, and retry logic for resilience.

This tutorial walks you through every step. You’ll make your first chat completion call, stream responses token-by-token, handle errors gracefully, and build async pipelines — all with runnable code you can copy and test immediately.

What Is the OpenAI API?

The OpenAI API is a cloud service that gives your Python code access to GPT models. Instead of downloading and running a massive model on your machine, you send a request over the internet and get a response back in seconds.

Think of it like a restaurant. You don’t need a kitchen (GPU hardware) or a chef (model weights). You just place an order (API request) and receive your meal (generated text).

The API supports much more than text generation. Here’s what you can do with it:

OpenAI API capabilities overview
Figure 1: The OpenAI API covers text generation, embeddings, image creation, and audio processing — all accessible through the same Python SDK.

In this tutorial, we focus on the Chat Completions API — the endpoint that powers conversational AI. It’s the most commonly used endpoint and the foundation for chatbots, content generators, code assistants, and more.

Key Insight: The OpenAI API is a stateless service — it doesn’t remember previous requests. Every API call is independent. If you want a multi-turn conversation, you must send the entire conversation history with each request. We’ll cover exactly how to do this later.

Setup — Install the SDK and Get Your API Key

Before writing any code, you need two things: the OpenAI Python package and an API key.

Let’s install the SDK first. Open your terminal and run:

bash

pip install openai python-dotenv

We also install python-dotenv to load your API key from a file instead of hardcoding it.

Get Your API Key

Head to platform.openai.com/api-keys and create a new secret key. Copy it immediately — OpenAI won’t show it again.

Now create a .env file in your project root:

python

OPENAI_API_KEY=sk-proj-your-key-here

Warning: Never hardcode your API key in Python files or commit it to Git. A leaked key can rack up thousands of dollars in charges. Always use environment variables or a `.env` file, and add `.env` to your `.gitignore`.

Initialize the Client

Here’s how to set up the OpenAI client safely. The load_dotenv() function reads your .env file and makes the key available as an environment variable. The OpenAI() client picks it up automatically.

import os
from openai import OpenAI

# Replace "your-key-here" with your actual OpenAI API key
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-key-here")

client = OpenAI(api_key=OPENAI_API_KEY)
print("Client ready:", type(client))

Output:

python

Client ready: <class 'openai.OpenAI'>

That’s it. The client is your gateway to every OpenAI endpoint. You’ll use it for every API call in this tutorial.

Tip: You can also pass the key explicitly: `client = OpenAI(api_key=”sk-…”)`. But environment variables are safer and more portable across development, staging, and production environments.

Your First Chat Completion

Let’s make your first API call. The chat.completions.create() method sends a message to GPT and returns the model’s response.

Here’s what happens under the hood: your Python code sends a POST request to OpenAI’s servers, GPT processes your message, and the response comes back as a structured object.

Figure 2: A chat completion request flows from your Python app through the SDK to the API, then to the GPT model and back.

The code below sends a simple question to GPT-4o-mini (the most cost-effective model) and prints the response. Watch how we access the response content through choices[0].message.content.

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "What is machine learning in one sentence?"}
    ]
)

print(response.choices[0].message.content)

Output:

python

Machine learning is a branch of artificial intelligence where computers learn patterns from data to make predictions or decisions without being explicitly programmed for each task.

That’s a complete API call in four lines. But the response object contains much more than just the text. Let’s explore it.

Exploring the Response Object

The response is a ChatCompletion object with metadata about token usage, the model used, and the finish reason. Understanding this object helps you debug issues and manage costs.

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "What is machine learning in one sentence?"}
    ]
)

print(f"Model: {response.model}")
print(f"Finish reason: {response.choices[0].finish_reason}")
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Output:

python

Model: gpt-4o-mini-2025-07-18
Finish reason: stop
Prompt tokens: 14
Completion tokens: 32
Total tokens: 46

The finish_reason tells you why the model stopped generating. "stop" means it finished naturally. If you see "length", the model hit the token limit and was cut off mid-sentence — a sign you need to increase max_tokens.

The usage field shows exactly how many tokens were consumed. This is critical for cost management since you pay per token.

Understanding Messages and Roles

Every chat completion request takes a messages list. Each message has a role and content. The three roles control how the model interprets each message.

Role	Purpose	Example
`system`	Sets the model’s behavior and personality	“You are a helpful Python tutor”
`user`	The human’s input or question	“How do I read a CSV file?”
`assistant`	The model’s previous responses (for multi-turn)	“Use pd.read_csv()…”

The system message is your most powerful tool for shaping the model’s behavior. It goes first in the messages list and tells the model how to act. The user message is your actual question. The assistant role is used to include previous model responses in multi-turn conversations.

Let’s see how a system message changes the model’s response. We’ll ask the same question twice — once with no system prompt, and once with a system prompt that asks for concise answers.

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

# Without system prompt
response_default = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Explain what an API is."}
    ]
)

# With system prompt
response_concise = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a concise technical writer. Answer in 2 sentences max."},
        {"role": "user", "content": "Explain what an API is."}
    ]
)

print("Default:", response_default.choices[0].message.content[:100], "...")
print(f"\nConcise: {response_concise.choices[0].message.content}")

Output:

python

Default: An API (Application Programming Interface) is a set of rules and protocols that allows different softw ...

Concise: An API (Application Programming Interface) is a set of rules that lets different software applications communicate with each other. It defines how requests and responses are structured, enabling developers to integrate external services into their applications.

Notice the difference. The default response tends to be long and detailed. The system-prompted version stays within two sentences. This is why system prompts matter — they give you control over the model’s behavior without changing your question.

Key Insight: The messages array IS the conversation. The model has no memory between API calls. Every request must include the full conversation history — system prompt, all user messages, and all assistant responses — for the model to understand the context.

Key Parameters That Control Output

The chat.completions.create() method accepts several parameters that change how the model generates text. Understanding these gives you fine-grained control over creativity, length, and output diversity.

Temperature — The Creativity Dial

Temperature controls the randomness of the model’s output. Lower values make the model more focused and deterministic. Higher values make it more creative and unpredictable.

How temperature affects output
Figure 3: Temperature ranges from deterministic (0.0) to creative (1.0) to chaotic (1.5+). Most applications work best between 0.0 and 1.0.

Let’s compare the same prompt at two different temperatures. We’ll ask the model for a creative sentence at temperature=0.0 (fully deterministic) and temperature=1.0 (creative). Notice how the low-temperature output is almost identical every time, while the high-temperature output varies.

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

for temp in [0.0, 1.0]:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": "Write one creative sentence about robots."}
        ],
        temperature=temp
    )
    print(f"Temperature {temp}: {response.choices[0].message.content}")

Output:

python

Temperature 0.0: In a world where circuits hum with dreams, a lone robot paints sunsets on the walls of an abandoned factory, each stroke a memory of the humans it once served.
Temperature 1.0: Beneath the flickering neon of a forgotten arcade, a tiny robot composed symphonies from the static of broken screens, finding beauty in what others had discarded.

At temperature=0.0, running this code multiple times gives you nearly identical output. At temperature=1.0, each run produces a different creative response. Use 0.0 for factual tasks (code generation, data extraction) and 0.5-0.7 for balanced tasks (writing, summarization).

Max Tokens — Output Length Limit

The max_tokens parameter caps the number of tokens in the model’s response. If the model hasn’t finished its thought by this limit, it stops mid-sentence. The code below shows what happens when you set a very low limit.

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Explain gradient descent in detail."}
    ],
    max_tokens=20
)

print(f"Response: {response.choices[0].message.content}")
print(f"Finish reason: {response.choices[0].finish_reason}")

Output:

python

Response: Gradient descent is an optimization algorithm used to minimize a function by iteratively moving toward the
Finish reason: length

The response was cut off mid-sentence, and finish_reason is "length" instead of "stop". Always check the finish reason in production to detect truncated responses.

Warning: A `finish_reason` of `”length”` means the response was truncated. In production, always check this field. If you see `”length”`, either increase `max_tokens` or handle the incomplete response gracefully.

Parameter Reference

Here’s a quick reference for the most commonly used parameters:

Parameter	Type	Default	What It Does
`model`	string	required	Which GPT model to use
`temperature`	float	1.0	Randomness (0.0 = deterministic, 2.0 = max random)
`max_tokens`	int	model limit	Maximum tokens in the response
`top_p`	float	1.0	Nucleus sampling — alternative to temperature
`n`	int	1	Number of response choices to generate
`stop`	list	None	Sequences where the model stops generating
`frequency_penalty`	float	0.0	Penalizes repeated tokens (-2.0 to 2.0)
`presence_penalty`	float	0.0	Encourages new topics (-2.0 to 2.0)

Tip: Use either `temperature` OR `top_p`, not both at the same time. They both control randomness, and combining them produces unpredictable results. The OpenAI documentation recommends adjusting one and leaving the other at its default.

Streaming Responses

By default, the API waits until the entire response is generated before returning it. For short answers, this is fine. But for longer responses, the user stares at a blank screen for several seconds.

Streaming fixes this. With stream=True, the API sends tokens back one by one as they’re generated. The user sees words appearing in real-time — just like ChatGPT’s typing effect.

Streaming vs standard response
Figure 4: Standard mode waits for the full response. Streaming mode delivers tokens as they’re generated, dramatically reducing perceived latency.

Basic Streaming

To enable streaming, set stream=True in your request. Instead of a ChatCompletion object, you get an iterator (a stream) that yields chunks. Each chunk contains a delta — the incremental piece of the response.

The code below streams a response and prints each token as it arrives. Notice that chunk.choices[0].delta.content gives you the new text, and some chunks may have None content (like the final chunk that signals completion).

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Count from 1 to 5, one number per line."}
    ],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
        print(content, end="", flush=True)

print()  # newline after streaming completes

Output:

python

Each token appears the moment it’s generated. In a web app, this creates the familiar “typing” effect that makes AI responses feel responsive and engaging.

Collecting the Full Response While Streaming

In production, you usually want to both display tokens as they arrive AND save the complete response. The pattern below collects all chunks into a list and joins them after streaming finishes.

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Name three Python web frameworks."}
    ],
    stream=True
)

collected_chunks = []
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
        print(content, end="", flush=True)
        collected_chunks.append(content)

full_response = "".join(collected_chunks)
print(f"\n\nFull response saved ({len(full_response)} chars)")

Output:

python

1. **Django** — A full-featured web framework with built-in admin, ORM, and authentication.
2. **Flask** — A lightweight micro-framework that gives you flexibility to choose your own components.
3. **FastAPI** — A modern, high-performance framework designed for building APIs with automatic documentation.

Full response saved (297 chars)

This pattern gives you the best of both worlds — real-time display and a complete string for storage or further processing.

Tip: Always use `flush=True` in your print statements when streaming. Without it, Python buffers the output and tokens appear in bursts instead of one-by-one.

🏋️ Exercise 1: Build a Streaming Response Collector

Create a function called stream_and_collect that takes a prompt, streams the response to the console, and returns the complete response as a string. Also return the total number of chunks received.

import os
from openai import OpenAI

# Replace with your actual key or set OPENAI_API_KEY env var
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

def stream_and_collect(prompt):
    """Stream a response and return (full_text, chunk_count)."""
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    collected = []
    chunk_count = 0
    # TODO: Loop through the stream, collect content, count chunks
    # Hint: check if chunk.choices[0].delta.content is not None

    full_text = "".join(collected)
    return full_text, chunk_count

text, chunks = stream_and_collect("What is Python?")
print(f"\nChunks: {chunks}, Length: {len(text)}")

💡 Hints

1. Use a `for chunk in stream:` loop and check `chunk.choices[0].delta.content`
2. Append non-None content to `collected` and increment `chunk_count` for each non-None chunk

🔑 Solution

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

def stream_and_collect(prompt):
    """Stream a response and return (full_text, chunk_count)."""
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    collected = []
    chunk_count = 0
    for chunk in stream:
        content = chunk.choices[0].delta.content
        if content is not None:
            print(content, end="", flush=True)
            collected.append(content)
            chunk_count += 1

    full_text = "".join(collected)
    return full_text, chunk_count

text, chunks = stream_and_collect("What is Python?")
print(f"\nChunks: {chunks}, Length: {len(text)}")

The function iterates through the stream, printing each token in real-time while collecting them. The `chunk_count` tracks how many non-empty chunks were received, showing you how the model breaks its response into pieces.

Multi-Turn Conversations

A single question-and-answer is useful, but most real applications need multi-turn conversations. Since the API is stateless, you build conversation history by including all previous messages in each request.

The approach is simple: maintain a list of messages and append each user input and assistant response before sending the next request.

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

messages = [
    {"role": "system", "content": "You are a helpful Python tutor. Keep answers brief."}
]

# Turn 1
messages.append({"role": "user", "content": "What is a list comprehension?"})
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
assistant_msg = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_msg})
print(f"Turn 1: {assistant_msg}\n")

# Turn 2 — the model remembers Turn 1
messages.append({"role": "user", "content": "Can you give me an example?"})
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
assistant_msg = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_msg})
print(f"Turn 2: {assistant_msg}\n")

# Turn 3 — the model has context from both previous turns
messages.append({"role": "user", "content": "How is that different from a regular for loop?"})
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
assistant_msg = response.choices[0].message.content
print(f"Turn 3: {assistant_msg}")

Output:

python

Turn 1: A list comprehension is a concise way to create lists in Python. It combines a for loop and an optional condition into a single line: [expression for item in iterable if condition].

Turn 2: Sure! Here's a simple example:
squares = [x**2 for x in range(10)]
This creates a list of squares from 0 to 81.

Turn 3: A list comprehension does the same thing as a for loop but in one line. The for-loop version would be:
  squares = []
  for x in range(10):
      squares.append(x**2)
The comprehension is more concise and typically slightly faster because the iteration happens internally in CPython's C layer.

Notice how Turn 3’s response directly references “list comprehension” and “for loop” from the earlier turns. The model understands the full context because we included all previous messages.

Note: Every model has a context window limit. GPT-4o supports up to 128K tokens. If your conversation history exceeds this, you need to truncate older messages. A common strategy is keeping the system prompt and the last N turns, or summarizing older turns into a single message.

Error Handling and Retries

API calls can fail for many reasons — invalid keys, rate limits, server outages, network issues. Robust error handling separates a demo from a production application.

The OpenAI SDK raises specific exception types for each error category. Here’s the full hierarchy:

Error handling decision tree
Figure 5: Different errors require different responses. Some are retriable (rate limits, server errors, timeouts), while others require code fixes (bad request, auth error).

Catching Specific Errors

The SDK provides typed exceptions so you can handle each error appropriately. The code below shows the pattern for catching the most common errors. Notice how each exception type tells you exactly what went wrong and what to do.

import os
from openai import (
    OpenAI,
    AuthenticationError,
    RateLimitError,
    BadRequestError,
    APIConnectionError,
    APITimeoutError,
    InternalServerError
)

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

try:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello!"}]
    )
    print(response.choices[0].message.content)

except AuthenticationError:
    print("Invalid API key. Check your OPENAI_API_KEY.")

except RateLimitError as e:
    print(f"Rate limit hit. Wait and retry. Details: {e.message}")

except BadRequestError as e:
    print(f"Bad request — fix your parameters. Details: {e.message}")

except APIConnectionError:
    print("Network error. Check your internet connection.")

except APITimeoutError:
    print("Request timed out. Try again or increase timeout.")

except InternalServerError:
    print("OpenAI server error. Retry after a short delay.")

Output:

python

Hello! How can I assist you today?

When everything works, the code runs normally. When something fails, you get a clear error message instead of a cryptic traceback.

The SDK’s Built-In Retry Logic

Good news: the OpenAI SDK already retries certain errors automatically. By default, it retries up to 2 times with exponential backoff for these error types:

Error Type	Status Code	Auto-Retried?
`RateLimitError`	429	Yes
`InternalServerError`	500+	Yes
`APITimeoutError`	—	Yes
`APIConnectionError`	—	Yes
`AuthenticationError`	401	No
`BadRequestError`	400	No

You can change the retry count when creating the client. The code below sets up a client with 5 retries and a 30-second timeout — suitable for batch processing where reliability matters more than speed.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY", "your-key-here"),
    max_retries=5,     # default is 2
    timeout=30.0       # default is 600 seconds (10 minutes!)
)

# This request will automatically retry up to 5 times
# on rate limits, timeouts, and server errors
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Output:

python

Hello! How can I help you today?

Key Insight: The SDK auto-retries retriable errors with exponential backoff. The delay starts at 0.5 seconds and doubles each time, up to 8 seconds maximum. This means you get basic resilience without writing any retry logic yourself.

Custom Retry with Backoff

For more control, you can implement your own retry logic. The tenacity library makes this clean and configurable. The decorator below retries on rate limit errors, waiting 1 second initially and doubling up to 60 seconds.

python

# pip install tenacity
import time
from openai import OpenAI, RateLimitError

client = OpenAI()

def call_with_retry(prompt, max_retries=3):
    """Call the API with manual exponential backoff."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}]
            )
            return response.choices[0].message.content
        except RateLimitError:
            wait_time = 2 ** attempt  # 1, 2, 4 seconds
            print(f"Rate limited. Waiting {wait_time}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait_time)

    raise Exception("Max retries exceeded")

result = call_with_retry("What is 2 + 2?")
print(result)

Output:

python

2 + 2 equals 4.

This manual approach gives you full control over which errors to retry, how long to wait, and what to do when all retries are exhausted.

🏋️ Exercise 2: Build a Robust API Caller

Write a function safe_api_call that wraps a chat completion request with proper error handling. It should catch AuthenticationError, RateLimitError, and APIConnectionError, print a helpful message for each, and return None on failure.

import os
from openai import OpenAI, AuthenticationError, RateLimitError, APIConnectionError

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

def safe_api_call(prompt, model="gpt-4o-mini"):
    """Make an API call with proper error handling. Returns response text or None."""
    # TODO: Wrap the API call in a try/except block
    # Catch AuthenticationError, RateLimitError, APIConnectionError
    # Return the response text on success, None on failure
    pass

result = safe_api_call("What is 1 + 1?")
if result:
    print(f"Success: {result}")
else:
    print("API call failed — check the error message above")

💡 Hints

1. Use `try:` with the `client.chat.completions.create()` call, then `except AuthenticationError:`, `except RateLimitError:`, etc.
2. Return `response.choices[0].message.content` on success and `None` in each except block

🔑 Solution

import os
from openai import OpenAI, AuthenticationError, RateLimitError, APIConnectionError

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

def safe_api_call(prompt, model="gpt-4o-mini"):
    """Make an API call with proper error handling. Returns response text or None."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content
    except AuthenticationError:
        print("Error: Invalid API key. Check your OPENAI_API_KEY.")
        return None
    except RateLimitError:
        print("Error: Rate limit exceeded. Wait a moment and try again.")
        return None
    except APIConnectionError:
        print("Error: Cannot connect to OpenAI. Check your internet.")
        return None

result = safe_api_call("What is 1 + 1?")
if result:
    print(f"Success: {result}")
else:
    print("API call failed — check the error message above")

The function wraps the API call in a try/except block that catches three specific error types. Each returns `None` with a helpful message, letting the caller decide what to do next — retry, fall back to a cached response, or alert the user.

Rate Limits and Best Practices

OpenAI enforces rate limits on two dimensions: RPM (requests per minute) and TPM (tokens per minute). These limits vary by model and account tier.

Tier	GPT-4o RPM	GPT-4o TPM	GPT-4o-mini RPM	GPT-4o-mini TPM
Free	3	200	3	200
Tier 1	500	30,000	500	200,000
Tier 2	5,000	450,000	5,000	2,000,000
Tier 3	5,000	800,000	5,000	4,000,000

Strategies for Staying Within Limits

Here are the most effective strategies to avoid hitting rate limits:

Use GPT-4o-mini for development and testing. It’s 10-20x cheaper than GPT-4o and has much higher token limits.
Batch requests instead of sending them one at a time. Use the Batch API for non-time-sensitive workloads — it’s 50% cheaper.
Cache responses for repeated queries. If multiple users ask the same question, serve the cached answer.
Count tokens before sending. Use the tiktoken library to estimate token usage and avoid surprise truncation.

Estimating Costs with tiktoken

The tiktoken library lets you count tokens before making an API call. This is essential for cost estimation and avoiding context window overflows.

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4o-mini")
prompt = "Explain the difference between supervised and unsupervised learning."
token_count = len(encoding.encode(prompt))

print(f"Prompt: '{prompt}'")
print(f"Token count: {token_count}")
print(f"Estimated cost: ${token_count * 0.00000015:.8f}")

Output:

python

Prompt: 'Explain the difference between supervised and unsupervised learning.'
Token count: 10
Estimated cost: $0.00000150

Ten tokens for that prompt — about $0.0000015. GPT-4o-mini is incredibly affordable for most use cases.

Model Comparison

Choosing the right model depends on your needs. Here’s a comparison of the most popular models:

Model	Context Window	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)	Best For
GPT-4o	128K	$2.50	$10.00	Complex reasoning, analysis
GPT-4o-mini	128K	$0.15	$0.60	Most tasks, development, high volume
GPT-3.5-turbo	16K	$0.50	$1.50	Legacy support only

Tip: Start with GPT-4o-mini for everything. Only upgrade to GPT-4o when you hit quality limitations. GPT-4o-mini handles 90% of tasks at a fraction of the cost.

Async API Calls

When you need to make multiple API calls, doing them sequentially is slow. Each call waits for the previous one to finish. Async calls let you fire multiple requests concurrently and gather the results.

The OpenAI SDK provides AsyncOpenAI for this purpose. The code below sends three prompts simultaneously using asyncio.gather(), cutting total time from ~3 seconds to ~1 second.

python

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI()

async def ask(prompt):
    """Make a single async API call."""
    response = await async_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    prompts = [
        "What is Python?",
        "What is JavaScript?",
        "What is Rust?"
    ]

    # Run all three calls concurrently
    results = await asyncio.gather(*[ask(p) for p in prompts])

    for prompt, result in zip(prompts, results):
        print(f"Q: {prompt}")
        print(f"A: {result[:80]}...\n")

asyncio.run(main())

Output:

python

Q: What is Python?
A: Python is a high-level, interpreted programming language known for its readabi...

Q: What is JavaScript?
A: JavaScript is a versatile, high-level programming language primarily used for ...

Q: What is Rust?
A: Rust is a systems programming language focused on safety, concurrency, and per...

All three responses came back in roughly the time of a single request. This is a game-changer for applications that need to process multiple inputs — batch summarization, parallel classification, bulk content generation.

Note: Async streaming works too. Replace `client.chat.completions.create()` with `await async_client.chat.completions.create(stream=True)` and use `async for chunk in stream:` to iterate. This gives you concurrent streaming for multiple requests.

🏋️ Exercise 3: Async Batch Processor

Write an async function batch_ask that takes a list of prompts and returns a list of responses. Use asyncio.gather() to run all requests concurrently.

import asyncio
import os
from openai import AsyncOpenAI

async_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

async def batch_ask(prompts):
    """Send multiple prompts concurrently and return all responses."""
    # TODO: Create an async helper function for a single call
    # TODO: Use asyncio.gather() to run all calls concurrently
    # TODO: Return the list of response texts
    pass

async def main():
    prompts = ["Capital of France?", "Capital of Japan?", "Capital of Brazil?"]
    results = await batch_ask(prompts)
    for p, r in zip(prompts, results):
        print(f"{p} -> {r}")

asyncio.run(main())

💡 Hints

1. Define an inner `async def single_ask(prompt)` that returns the response text
2. Use `await asyncio.gather(*[single_ask(p) for p in prompts])` to run them all at once

🔑 Solution

import asyncio
import os
from openai import AsyncOpenAI

async_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))

async def batch_ask(prompts):
    """Send multiple prompts concurrently and return all responses."""
    async def single_ask(prompt):
        response = await async_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

    return await asyncio.gather(*[single_ask(p) for p in prompts])

async def main():
    prompts = ["Capital of France?", "Capital of Japan?", "Capital of Brazil?"]
    results = await batch_ask(prompts)
    for p, r in zip(prompts, results):
        print(f"{p} -> {r}")

asyncio.run(main())

The `batch_ask` function creates a concurrent task for each prompt using `asyncio.gather()`. All requests fire simultaneously, and the function returns results in the same order as the input prompts. This pattern scales to hundreds of concurrent requests — though you should respect rate limits.

Common Mistakes and How to Fix Them

Mistake 1: Hardcoding the API Key

❌ Wrong:

python

client = OpenAI(api_key="sk-proj-abc123...")  # exposed in git history!

Why it’s wrong: If this file is committed to Git, your API key is permanently in the repository history. Anyone with access can use it to make API calls at your expense.

✅ Correct:

python

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI()  # reads OPENAI_API_KEY from environment

Mistake 2: Ignoring Token Limits

❌ Wrong:

python

# Stuffing an entire book into the messages
messages = [{"role": "user", "content": entire_book_text}]  # 500K tokens!
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)

Why it’s wrong: If your input exceeds the model’s context window (128K tokens for GPT-4o), you get a BadRequestError. Even if it fits, very long inputs increase cost and slow down response times.

✅ Correct:

python

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4o-mini")
token_count = len(encoding.encode(entire_book_text))

if token_count > 120000:  # leave room for the response
    print(f"Input too long ({token_count} tokens). Truncating...")
    # Truncate or chunk the input

Mistake 3: Not Checking finish_reason

❌ Wrong:

python

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a 2000-word essay."}],
    max_tokens=100
)
# Using response directly without checking if it was truncated
essay = response.choices[0].message.content

Why it’s wrong: With max_tokens=100, the model stops at 100 tokens regardless of whether it finished. The finish_reason will be "length" instead of "stop", and you’ll get an incomplete essay.

✅ Correct:

python

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a 2000-word essay."}],
    max_tokens=3000
)

if response.choices[0].finish_reason == "length":
    print("Warning: Response was truncated. Increase max_tokens.")
else:
    print(response.choices[0].message.content)

Mistake 4: Using a Wrong or Deprecated Model Name

❌ Wrong:

python

response = client.chat.completions.create(
    model="gpt-4",  # older model name — may be deprecated
    messages=[{"role": "user", "content": "Hello"}]
)

Why it’s wrong: Model names change over time. gpt-4 is an older alias that points to a specific snapshot. OpenAI periodically deprecates model versions, which causes NotFoundError.

✅ Correct:

python

# Current recommended models (as of 2025):
response = client.chat.completions.create(
    model="gpt-4o-mini",  # cost-effective, fast
    # model="gpt-4o",     # most capable
    messages=[{"role": "user", "content": "Hello"}]
)

Mistake 5: Building Conversation History Without Limits

❌ Wrong:

python

messages = [{"role": "system", "content": "You are a helpful assistant."}]

while True:
    user_input = input("You: ")
    messages.append({"role": "user", "content": user_input})
    response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
    assistant_msg = response.choices[0].message.content
    messages.append({"role": "assistant", "content": assistant_msg})
    # messages list grows forever — will eventually exceed context window

Why it’s wrong: After enough turns, the messages list exceeds the model’s context window and throws a BadRequestError. Even before that, costs increase with every turn because you’re re-sending the entire history.

✅ Correct:

python

import tiktoken

MAX_TOKENS = 100000  # leave buffer for response
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

def trim_messages(messages, max_tokens=MAX_TOKENS):
    """Keep system prompt + most recent messages within token limit."""
    system_msg = [m for m in messages if m["role"] == "system"]
    other_msgs = [m for m in messages if m["role"] != "system"]

    total = sum(len(encoding.encode(m["content"])) for m in system_msg)
    trimmed = []

    for msg in reversed(other_msgs):
        msg_tokens = len(encoding.encode(msg["content"]))
        if total + msg_tokens > max_tokens:
            break
        trimmed.insert(0, msg)
        total += msg_tokens

    return system_msg + trimmed

Complete Code

Click to expand the full script (copy-paste and run)

“`python title=”Complete OpenAI API tutorial script”
# Complete code from: OpenAI API Python Tutorial
# Requires: pip install openai python-dotenv tiktoken
# Python 3.9+

import os
import asyncio
import tiktoken
from dotenv import load_dotenv
from openai import (
OpenAI,
AsyncOpenAI,
AuthenticationError,
RateLimitError,
BadRequestError,
APIConnectionError,
APITimeoutError,
InternalServerError,
)

load_dotenv()

# — Section 1: Setup —
client = OpenAI()
async_client = AsyncOpenAI()
print(“Clients initialized.\n”)

# — Section 2: Basic Chat Completion —
response = client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[{“role”: “user”, “content”: “What is machine learning in one sentence?”}]
)
print(f”Basic call: {response.choices[0].message.content}\n”)

# — Section 3: System Prompt —
response = client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[
{“role”: “system”, “content”: “You are a concise Python tutor. Answer in 2 sentences max.”},
{“role”: “user”, “content”: “What is a decorator?”}
]
)
print(f”With system prompt: {response.choices[0].message.content}\n”)

# — Section 4: Temperature Comparison —
for temp in [0.0, 1.0]:
r = client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[{“role”: “user”, “content”: “Write one sentence about AI.”}],
temperature=temp
)
print(f”Temp {temp}: {r.choices[0].message.content}”)
print()

# — Section 5: Streaming —
print(“Streaming: “, end=””)
stream = client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[{“role”: “user”, “content”: “Count from 1 to 5.”}],
stream=True
)
collected = []
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end=””, flush=True)
collected.append(content)
print(f”\nCollected {len(collected)} chunks.\n”)

# — Section 6: Multi-Turn Conversation —
messages = [{“role”: “system”, “content”: “You are a helpful assistant. Keep answers brief.”}]
for question in [“What is Python?”, “What makes it popular?”, “Give me one tip for beginners.”]:
messages.append({“role”: “user”, “content”: question})
r = client.chat.completions.create(model=”gpt-4o-mini”, messages=messages)
reply = r.choices[0].message.content
messages.append({“role”: “assistant”, “content”: reply})
print(f”Q: {question}\nA: {reply}\n”)

# — Section 7: Error Handling —
def safe_call(prompt):
try:
r = client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[{“role”: “user”, “content”: prompt}]
)
return r.choices[0].message.content
except AuthenticationError:
return “Error: Invalid API key.”
except RateLimitError:
return “Error: Rate limit exceeded.”
except BadRequestError as e:
return f”Error: Bad request — {e.message}”
except APIConnectionError:
return “Error: Network issue.”
except APITimeoutError:
return “Error: Request timed out.”
except InternalServerError:
return “Error: OpenAI server error.”

print(f”Safe call: {safe_call(‘Hello!’)}\n”)

# — Section 8: Token Counting —
encoding = tiktoken.encoding_for_model(“gpt-4o-mini”)
text = “Explain gradient descent step by step.”
tokens = len(encoding.encode(text))
print(f”Token count for ‘{text}’: {tokens}”)
print(f”Estimated cost: ${tokens * 0.00000015:.8f}\n”)

# — Section 9: Async Calls —
async def async_batch():
async def ask(prompt):
r = await async_client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[{“role”: “user”, “content”: prompt}]
)
return r.choices[0].message.content

prompts = [“Capital of France?”, “Capital of Japan?”, “Capital of Brazil?”]
results = await asyncio.gather(*[ask(p) for p in prompts])
for p, r in zip(prompts, results):
print(f”{p} -> {r}”)

asyncio.run(async_batch())

print(“\nScript completed successfully.”)
“`

Frequently Asked Questions

How much does the OpenAI API cost?

Pricing depends on the model and token usage. GPT-4o-mini costs $0.15 per million input tokens and $0.60 per million output tokens — roughly $0.01 for a 50-message conversation. GPT-4o costs about 15x more. Check openai.com/pricing for current rates.

Can I use the OpenAI API for free?

OpenAI offers free trial credits ($5-$18 depending on when you sign up) that expire after 3 months. After that, you need a paid account. GPT-4o-mini is cheap enough that most development and small-scale apps cost under $1/month.

What’s the difference between GPT-4o and GPT-4o-mini?

GPT-4o is the most capable model — better at complex reasoning, nuanced writing, and multi-step tasks. GPT-4o-mini is a smaller, faster, cheaper version that handles most tasks well. Start with GPT-4o-mini and only switch to GPT-4o if the quality isn’t sufficient for your use case.

Is the OpenAI Python SDK thread-safe?

Yes. The OpenAI client is thread-safe and can be shared across threads. For async code, use AsyncOpenAI. Both clients manage connection pools internally, so you should create one client instance and reuse it rather than creating a new client for each request.

How do I handle the OpenAI API in production?

Use environment variables for API keys, implement proper error handling with specific exception types, set appropriate max_retries (3-5 for production), log request IDs for debugging, monitor token usage for cost control, and implement rate limiting at the application level if you’re handling high traffic.

What happens when the context window is exceeded?

If your messages array exceeds the model’s context window (128K tokens for GPT-4o), the API returns a BadRequestError. To prevent this, count tokens with tiktoken before each request and trim older messages when approaching the limit.

References

OpenAI API Reference — Chat Completions. Link
OpenAI Python SDK — GitHub Repository. Link
OpenAI Cookbook — How to Handle Rate Limits. Link
OpenAI Cookbook — How to Stream Completions. Link
OpenAI API — Streaming Responses Guide. Link
OpenAI API — Error Codes Reference. Link
OpenAI API — Rate Limits Guide. Link
tiktoken — OpenAI’s Token Counting Library. Link
OpenAI Pricing. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

What Is the OpenAI API?

Setup — Install the SDK and Get Your API Key

Get Your API Key

Initialize the Client

Your First Chat Completion

Exploring the Response Object

Understanding Messages and Roles

Key Parameters That Control Output

Temperature — The Creativity Dial

Max Tokens — Output Length Limit

Parameter Reference

Streaming Responses

Basic Streaming

Collecting the Full Response While Streaming

🏋️ Exercise 1: Build a Streaming Response Collector

Multi-Turn Conversations

Error Handling and Retries

Catching Specific Errors

The SDK’s Built-In Retry Logic

Custom Retry with Backoff

🏋️ Exercise 2: Build a Robust API Caller

Rate Limits and Best Practices

Strategies for Staying Within Limits

Estimating Costs with tiktoken

Model Comparison

Async API Calls

🏋️ Exercise 3: Async Batch Processor

Common Mistakes and How to Fix Them

Mistake 1: Hardcoding the API Key

Mistake 2: Ignoring Token Limits

Mistake 3: Not Checking finish_reason

Mistake 4: Using a Wrong or Deprecated Model Name

Mistake 5: Building Conversation History Without Limits

Complete Code

Frequently Asked Questions

How much does the OpenAI API cost?

Can I use the OpenAI API for free?

What’s the difference between GPT-4o and GPT-4o-mini?

Is the OpenAI Python SDK thread-safe?

How do I handle the OpenAI API in production?

What happens when the context window is exceeded?

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

Zero-Shot vs Few-Shot Prompting: Complete Guide

Prompt Engineering Tutorial for Beginners (2026)

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.