Build Your First AI App with Python — Step-by-Step Tutorial

Build your first AI app with Python and the OpenAI API. Step-by-step tutorial covering chat completions, streaming, error handling, and cost control — with runnable code.

Written by Selva Prabhakaran | 25 min read

You want to build something with AI — not just read about it. By the end of this tutorial, you will have a streaming AI assistant in Python that remembers conversations, handles API failures gracefully, and tracks every cent it spends. All in under 80 lines of core code.

What Is the OpenAI API and Why Use It?

The OpenAI API gives your Python code direct access to models like GPT-4o and GPT-4o-mini. You send a message, you get back an intelligent response — no model training, no GPU, no machine learning expertise required.

Here is the complete pattern in six lines. The OpenAI() client reads your API key from the environment, chat.completions.create() sends a message, and response.choices[0].message.content extracts the text:

python

from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()  # loads OPENAI_API_KEY from .env file
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is Python?"}]
)
print(response.choices[0].message.content)

python

Python is a high-level, interpreted programming language known for its
readability and versatility. It supports multiple programming paradigms
including procedural, object-oriented, and functional programming, and
is widely used in web development, data science, artificial intelligence,
and automation.

That is the entire pattern. Every AI app — chatbots, summarizers, code generators — is a variation of this single chat.completions.create call.

Key Insight: **The OpenAI API is just a function call.** You send messages in, you get a response back. Every AI app is built on this one pattern — `client.chat.completions.create()`.

The 5 steps to build your first AI app

Get an API key from platform.openai.com/api-keys
Install the SDK — pip install openai python-dotenv
Store the key safely in a .env file (never hardcode it)
Send your first message using client.chat.completions.create()
Extract the response from response.choices[0].message.content

The rest of this tutorial builds on this foundation — adding system messages, conversation memory, streaming, error handling, and cost control.

Prerequisites

Python version: 3.9+
Required libraries: openai (1.0+), python-dotenv
Install: pip install openai python-dotenv
API key: From platform.openai.com/api-keys (requires billing enabled)
Time to complete: 25-30 minutes
Cost: Under $0.10 if you follow along with gpt-4o-mini

I prefer teaching with gpt-4o-mini. It costs $0.15 per million input tokens — cheap enough to experiment freely. You can swap to gpt-4o or gpt-4.1 with a one-line change.

Setting up your API key

Never hardcode API keys in your scripts. One accidental git push and your key is compromised.

Create a .env file in your project folder:

bash

# .env file — add this to .gitignore immediately
OPENAI_API_KEY=sk-proj-your-key-here

Warning: **Never commit your `.env` file to version control.** Add `.env` to your `.gitignore` immediately. A leaked API key can rack up charges on your account within minutes.

The load_dotenv() call reads this file and sets the environment variable. The OpenAI() client picks it up automatically. If you get an AuthenticationError, run print(os.environ.get("OPENAI_API_KEY", "NOT SET")) to check the key loaded.

Understanding Chat Completions — Messages, Roles, and Parameters

Every interaction with GPT is a conversation made up of messages. Each message has a role and content. Three roles:

system — Sets the AI’s personality and rules. The model follows these throughout the conversation.
user — Your input (or your app user’s input).
assistant — The model’s previous responses. Used for conversation history.

The system message transforms a generic AI into a specialized tool. Below, we create a code reviewer that gives structured, three-bullet feedback. The system message constrains format and tone, and the user message provides the code to review:

python

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a senior Python code reviewer. Give brief, actionable feedback. Max 3 bullet points."
        },
        {
            "role": "user",
            "content": "Review this code:\ndef calc(x,y): return x+y"
        }
    ]
)
print(response.choices[0].message.content)

python

- Rename `calc` to something descriptive like `add_numbers` — function names should communicate intent.
- Add type hints: `def add_numbers(x: float, y: float) -> float:` for clarity and IDE support.
- Consider adding a docstring explaining the function's purpose, even for simple utilities.

I recommend writing specific system messages for every use case. “You are a helpful assistant” is too vague. “You are a Python code reviewer. Give exactly 3 bullet points.” — that gives you predictable output you can actually parse.

Tip: **Write system messages like you are briefing a new team member.** Be specific about the format, length, and tone. The more constraints you give, the more consistent the output.

Controlling output with temperature and max_tokens

Two parameters matter most: temperature controls randomness, max_tokens caps response length.

This call generates a creative coffee shop tagline. Setting temperature=0.9 encourages variation, while max_tokens=50 keeps the response short:

python

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a tagline for a coffee shop."}],
    temperature=0.9,   # higher = more creative (0.0 to 2.0)
    max_tokens=50       # hard cap on response length
)
print(response.choices[0].message.content)

python

"Where every cup tells a story — brewed with passion, served with soul."

Parameter	Range	Default	What It Controls
`temperature`	0.0 – 2.0	1.0	Randomness. Low = deterministic, high = creative
`max_tokens`	1 – model max	Model-dependent	Maximum response length (hard cutoff)
`top_p`	0.0 – 1.0	1.0	Nucleus sampling — alternative to temperature

Choosing the right model

Which model should you use? Here is the practical breakdown (verify current pricing at platform.openai.com/pricing):

Model	Cost (per 1M input tokens)	Best For	Context Window
`gpt-4o-mini`	$0.15	Prototyping, simple tasks, high volume	128K tokens
`gpt-4o`	$2.50	Complex reasoning, multimodal (images + text)	128K tokens
`gpt-4.1`	$2.00	Coding tasks, instruction following	1M tokens

My rule: start with gpt-4o-mini. Only upgrade if quality is insufficient for your specific task.

Key Insight: **Use `temperature=0` for factual tasks (code generation, data extraction). Use `temperature=0.7-1.0` for creative tasks (writing, brainstorming).** This single parameter makes the biggest difference in output quality.

Quick check — predict the output:

What happens if you set temperature=0 and send the same prompt twice? Will you get identical responses?

Answer

Almost identical, but not guaranteed. With `temperature=0`, the model picks the highest-probability token at each step, so responses are highly deterministic. But the API does not guarantee bit-for-bit identical output. For stricter determinism, use the `seed` parameter — but even then, OpenAI notes results may vary across infrastructure changes.

Building a Multi-Turn Chatbot

A chatbot needs memory. The API itself is stateless — every call is independent. You create memory by sending the full conversation history with each request.

The pattern: maintain a messages list, append every user message and assistant reply, send the whole list each time. This function creates an interactive loop — the while True runs until the user types “quit”, and each exchange appends both sides to the growing messages list:

python

def chat_with_memory():
    """Interactive chatbot that remembers the conversation."""
    messages = [
        {"role": "system", "content": "You are a helpful Python tutor. Keep answers under 3 sentences."}
    ]

    print("Python Tutor (type 'quit' to exit)")
    print("-" * 40)

    while True:
        user_input = input("\nYou: ")
        if user_input.lower() == "quit":
            break

        messages.append({"role": "user", "content": user_input})

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )

        assistant_reply = response.choices[0].message.content
        messages.append({"role": "assistant", "content": assistant_reply})

        print(f"\nTutor: {assistant_reply}")

# Note: uses input() — run in a terminal, not a notebook
chat_with_memory()

A sample exchange might look like:

python

Python Tutor (type 'quit' to exit)
----------------------------------------

You: What is a list comprehension?
Tutor: A list comprehension is a concise way to create a new list by
applying an expression to each item in an existing iterable, optionally
filtering items with a condition. The syntax is
[expression for item in iterable if condition].

You: Give me an example
Tutor: Here's one: [x**2 for x in range(5)] produces [0, 1, 4, 9, 16].
The model remembers you asked about list comprehensions because the full
conversation history is sent with each request.

We append every user message AND every assistant reply to messages. This growing list IS the memory.

This is dead simple, and it works. But it has a cost problem — we will fix that later. For a deeper dive into memory strategies including sliding windows, summarization, and persistent storage, see our Python AI chatbot memory tutorial.

Warning: **Conversation history grows with every exchange.** A long conversation eventually hits the model’s context window limit (128K tokens for GPT-4o). For production chatbots, trim older messages.

Streaming Responses in Real Time

Waiting for GPT to generate a long response before showing anything feels slow. Streaming sends tokens as they are generated — the same way ChatGPT shows text appearing word by word.

Set stream=True and iterate over chunks. The critical change: streaming uses .delta.content instead of .message.content, because each chunk is a partial update. The flush=True in the print call forces Python to display each token immediately instead of buffering:

python

def stream_response(prompt):
    """Stream a response token by token, like ChatGPT does."""
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    print("AI: ", end="")
    full_response = ""
    for chunk in stream:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)
            full_response += content
    print()
    return full_response

result = stream_response("Explain recursion in 3 sentences.")

python

AI: Recursion is when a function calls itself to solve a smaller version
of the same problem. Each recursive call works on a reduced input until it
reaches a base case that stops the recursion. Think of it like Russian
nesting dolls — you keep opening smaller dolls until you find the smallest
one, then work your way back out.

Two things to notice. First, chunk.choices[0].delta.content — not .message.content. Second, flush=True — without it, Python buffers the output and you lose the streaming effect.

I always use streaming in user-facing apps. The perceived speed difference is dramatic even when total generation time is identical.

Quick check — predict the output:

What happens if you access chunk.choices[0].message.content instead of chunk.choices[0].delta.content in a streaming response?

Answer

You get an `AttributeError`. Streaming chunks have a `delta` attribute, not `message`. The `delta` contains only the NEW tokens in each chunk. This is the #1 streaming mistake.

Building a Document Summarizer — Your First Real App

This is where it gets practical. A document summarizer takes long text and produces a concise summary — one of the most common GenAI use cases in production.

The function below uses temperature=0 because summarization is a factual task, not a creative one. The system message constrains the output to exactly N bullet points with no preamble, so your app can reliably parse the response:

python

def summarize_document(text, max_sentences=3):
    """Summarize a document into key points."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    f"You are a document summarizer. Produce exactly {max_sentences} "
                    "bullet points capturing the key information. Each bullet is one "
                    "sentence. No preamble, no conclusion — just the bullets."
                )
            },
            {
                "role": "user",
                "content": f"Summarize this document:\n\n{text}"
            }
        ],
        temperature=0
    )
    return response.choices[0].message.content

Testing with a sample document:

python

sample_doc = """
Machine learning is a subset of artificial intelligence that enables systems to
learn from data without being explicitly programmed. There are three main types:
supervised learning uses labeled data to predict outcomes, unsupervised learning
finds patterns in unlabeled data, and reinforcement learning trains agents through
rewards and penalties. Common applications include spam detection, recommendation
systems, medical diagnosis, and autonomous vehicles. The field has grown rapidly
due to increased computing power, availability of large datasets, and advances
in neural network architectures like transformers.
"""

summary = summarize_document(sample_doc)
print(summary)

python

- Machine learning is an AI subset that allows systems to learn patterns from data without explicit programming.
- The three main types are supervised learning (labeled data), unsupervised learning (pattern discovery), and reinforcement learning (reward-based training).
- The field's rapid growth is driven by increased computing power, large datasets, and advances in neural network architectures like transformers.

This pattern — specific system message + temperature=0 — is the foundation of reliable GenAI features. The system message does the heavy lifting.

Try It Yourself

Exercise 1: Build a Tone Converter

Create a convert_tone(text, tone) function that rewrites text in a specified tone (formal, casual, or pirate).

Hint 1: You need a system message with the tone parameter embedded. Use an f-string.

Hint 2: The system message should be something like: f"Rewrite the user's text in a {tone} tone. Keep the same meaning. Return only the rewritten text."

python

# Starter code
def convert_tone(text, tone="formal"):
    """Rewrite text in the specified tone."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            # YOUR CODE: Add a system message using the tone parameter
            {"role": "user", "content": text}
        ],
        temperature=0.7
    )
    return response.choices[0].message.content

# Test it
original = "The quarterly revenue exceeded expectations by 15 percent."
print(convert_tone(original, "pirate"))

Test: Call convert_tone("Hello world", "formal"). If it returns None or errors, check that your system message is inside the messages list.

Click to see solution

python

def convert_tone(text, tone="formal"):
    """Rewrite text in the specified tone."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"Rewrite the user's text in a {tone} tone. Keep the same meaning. Return only the rewritten text."
            },
            {"role": "user", "content": text}
        ],
        temperature=0.7
    )
    return response.choices[0].message.content

The key is embedding `tone` in the system message with an f-string. The `temperature=0.7` adds enough creativity for natural-sounding rewrites.

Error Handling and Retries — Production Essentials

API calls fail. Networks drop, rate limits hit, servers go down. I always recommend building retry logic before your first production deploy — not after the first outage.

The OpenAI SDK provides specific exception classes for each failure mode. The retry wrapper below catches transient errors (rate limits, timeouts, connection drops) and re-raises non-retryable ones. The exponential backoff (2 ** attempt) spaces out retries at 1s, 2s, then 4s:

python

import time
from openai import (
    APIError,
    APIConnectionError,
    RateLimitError,
    APITimeoutError,
)

def call_with_retry(messages, model="gpt-4o-mini", max_retries=3):
    """Make an API call with automatic retry on transient errors."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages
            )
            return response.choices[0].message.content

        except RateLimitError:
            wait = 2 ** attempt  # 1s, 2s, 4s
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)

        except APIConnectionError:
            wait = 2 ** attempt
            print(f"Connection error. Retrying in {wait}s...")
            time.sleep(wait)

        except APITimeoutError:
            wait = 2 ** attempt
            print(f"Timeout. Retrying in {wait}s...")
            time.sleep(wait)

        except APIError as e:
            print(f"API error: {e.message}")
            raise  # non-retryable server error

    raise Exception(f"Failed after {max_retries} retries")

The exponential backoff is critical. Retrying immediately after a rate limit gets you rate-limited again. Waiting 1, 2, then 4 seconds gives the API time to recover.

python

result = call_with_retry(
    messages=[{"role": "user", "content": "What is 2 + 2?"}]
)
print(result)

Running this produces:

python

2 + 2 equals 4.

Tip: **For production apps, use the `tenacity` library instead of hand-rolled retries.** It handles jitter (randomized waits to prevent thundering herd) and more edge cases. Install with `pip install tenacity`.

Managing Costs — Tokens, Pricing, and Strategies

Every API call costs money. Understanding the pricing model prevents surprises — especially when a chatbot prototype starts sending the entire conversation history with every call.

Costs are measured in tokens — roughly 1 token per 4 characters of English text. Each call has two components: input tokens (what you send) and output tokens (what the model generates).

This wrapper reports cost per call. It extracts the usage field from the response and looks up per-token pricing to calculate the estimated cost. The pricing dictionary maps model names to their input/output rates per million tokens:

python

def call_with_cost_tracking(messages, model="gpt-4o-mini"):
    """Make an API call and report token usage and estimated cost."""
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )

    usage = response.usage
    # Pricing per 1M tokens — verify at platform.openai.com/pricing
    pricing = {
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4.1": {"input": 2.00, "output": 8.00},
    }

    rates = pricing.get(model, pricing["gpt-4o-mini"])
    input_cost = (usage.prompt_tokens / 1_000_000) * rates["input"]
    output_cost = (usage.completion_tokens / 1_000_000) * rates["output"]

    print(f"Tokens — Input: {usage.prompt_tokens}, Output: {usage.completion_tokens}")
    print(f"Estimated cost: ${input_cost + output_cost:.6f}")

    return response.choices[0].message.content

python

result = call_with_cost_tracking(
    messages=[{"role": "user", "content": "What is 2 + 2?"}]
)
print(result)

python

Tokens — Input: 13, Output: 8
Estimated cost: $0.000007
2 + 2 equals 4.

The cost difference between models is dramatic. A task costing $0.001 with gpt-4o-mini costs $0.017 with gpt-4o — 17x more. I always prototype with the smallest model that gives acceptable quality.

Warning: **Long conversations are expensive.** Each API call sends the ENTIRE message history. A 50-message conversation with GPT-4o could cost $0.10+ per call. Trim aggressively.

Cost control strategies:

Set max_tokens aggressively. Need a one-sentence answer? Set max_tokens=100.
Trim conversation history. Keep the system message and last 10-20 exchanges.
Cache identical requests. Same question from different users? Return the cached response.
Set billing limits. In the OpenAI dashboard under Settings → Limits.

Putting It All Together — A Complete AI Assistant

Everything we have built — system messages, memory, streaming, error handling, cost tracking — combines into a single class. Here is what a deployable assistant looks like.

First, the constructor sets up the client, message history, and pricing lookup. The chat method dispatches to either streaming or standard mode:

python

class AIAssistant:
    """AI assistant with memory, streaming, and cost tracking."""

    def __init__(self, system_prompt, model="gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model
        self.messages = [{"role": "system", "content": system_prompt}]
        self.total_cost = 0.0
        self.pricing = {
            "gpt-4o-mini": {"input": 0.15, "output": 0.60},
            "gpt-4o": {"input": 2.50, "output": 10.00},
        }

    def chat(self, user_input, stream=True):
        """Send a message and get a response."""
        self.messages.append({"role": "user", "content": user_input})
        if stream:
            return self._stream_response()
        return self._standard_response()

The streaming method iterates over chunks and accumulates the full response. Notice stream_options={"include_usage": True} — this tells the API to send token counts at the end of the stream so we can track costs:

python

    def _stream_response(self):
        """Stream the response token by token."""
        stream = self.client.chat.completions.create(
            model=self.model,
            messages=self.messages,
            stream=True,
            stream_options={"include_usage": True}
        )

        full_response = ""
        print("AI: ", end="")
        for chunk in stream:
            if chunk.choices and chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                print(content, end="", flush=True)
                full_response += content
            if chunk.usage:
                self._track_cost(chunk.usage)

        print()
        self.messages.append({"role": "assistant", "content": full_response})
        return full_response

    def _standard_response(self):
        """Get the full response at once."""
        response = self.client.chat.completions.create(
            model=self.model,
            messages=self.messages
        )
        reply = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": reply})
        self._track_cost(response.usage)
        return reply

Finally, helper methods for cost tracking and history management. The trim_history method keeps the system message (always index 0) and only the most recent N messages to control costs:

python

    def _track_cost(self, usage):
        """Calculate and accumulate cost."""
        rates = self.pricing.get(self.model, self.pricing["gpt-4o-mini"])
        cost = (usage.prompt_tokens / 1_000_000) * rates["input"]
        cost += (usage.completion_tokens / 1_000_000) * rates["output"]
        self.total_cost += cost

    def get_cost(self):
        return f"${self.total_cost:.6f}"

    def trim_history(self, keep_last=20):
        """Keep only the system message and the last N messages."""
        system = self.messages[0]
        self.messages = [system] + self.messages[-keep_last:]

Here is the assistant in action. We create it with a system prompt defining its role, then ask a question. The second question demonstrates memory — the model knows we were talking about lists vs tuples:

python

assistant = AIAssistant(
    system_prompt="You are a Python expert. Give concise, practical answers."
)
assistant.chat("What's the difference between a list and a tuple?")
print(f"\nSession cost: {assistant.get_cost()}")

python

AI: Lists are mutable — you can add, remove, and change elements after
creation. Tuples are immutable — once created, they cannot be modified.
Use lists when your data needs to change, and tuples when it should stay
fixed (like database records, dictionary keys, or function return values).

Session cost: $0.000042

python

# The assistant remembers context from the previous exchange
assistant.chat("When would I use a tuple instead?")

python

AI: Use tuples when you need an immutable sequence: as dictionary keys,
for returning multiple values from a function, for named constants like
RGB colors (255, 128, 0), or when you want to signal that the data
should not be modified. They also use slightly less memory than lists.

That is a complete, working AI assistant. The core structure stays the same across projects — only the system prompt and the UI layer change.

Try It Yourself

Exercise 2: Add Conversation Persistence

Add two methods to AIAssistant: save_conversation(filepath) saves message history to JSON, load_conversation(filepath) restores it.

Hint 1: self.messages is already a list of dictionaries — native JSON. Use json.dump and json.load.

Hint 2: Open the file with "w" mode for saving and "r" for loading. Add indent=2 to json.dump for readability.

python

import json

def save_conversation(self, filepath):
    """Save the conversation history to a JSON file."""
    # YOUR CODE HERE
    pass

def load_conversation(self, filepath):
    """Load conversation history from a JSON file."""
    # YOUR CODE HERE
    pass

Test: After assistant.save_conversation("chat.json"), open chat.json — you should see a JSON array of objects with role and content keys.

Click to see solution

python

import json

def save_conversation(self, filepath):
    """Save the conversation history to a JSON file."""
    with open(filepath, "w") as f:
        json.dump(self.messages, f, indent=2)
    print(f"Conversation saved to {filepath}")

def load_conversation(self, filepath):
    """Load conversation history from a JSON file."""
    with open(filepath, "r") as f:
        self.messages = json.load(f)
    print(f"Loaded {len(self.messages)} messages from {filepath}")

The messages list maps directly to JSON. The `indent=2` makes the file human-readable for debugging.

Common Mistakes and How to Fix Them

Mistake 1: Printing the response object instead of the text

Every beginner does this at least once. The response is a Pydantic object, not a string.

❌ Wrong:

python

response = client.chat.completions.create(...)
print(response)  # prints the entire Response object

✅ Correct:

python

response = client.chat.completions.create(...)
print(response.choices[0].message.content)

Mistake 2: Forgetting conversation history in chatbots

Each API call is independent. If you do not send previous messages, the model has zero context.

❌ Wrong:

python

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What about tuples?"}]
)
# The model has no idea what "What about" refers to

✅ Correct:

python

messages.append({"role": "user", "content": "What about tuples?"})
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages  # includes all previous exchanges
)

Mistake 3: Using `.message` instead of `.delta` in streaming

This gives you an AttributeError that is confusing if you have not seen it before.

❌ Wrong:

python

for chunk in stream:
    print(chunk.choices[0].message.content)  # AttributeError!

✅ Correct:

python

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Mistake 4: Building a chatbot without a system message

Without a system message, the model responds generically instead of following your specialized behavior rules. I recommend always starting with a system message, even for quick prototypes.

❌ Wrong:

python

messages = []  # no system message!
messages.append({"role": "user", "content": "Review my code"})

✅ Correct:

python

messages = [{"role": "system", "content": "You are a code reviewer. Give 3 bullet points."}]
messages.append({"role": "user", "content": "Review my code"})

Troubleshooting Common Errors

`openai.AuthenticationError: Incorrect API key provided`

When you see it: Your API key is wrong, expired, or not loaded from the environment.

The fix: This is almost always a .env file issue. Check it step by step:

python

import os
print(os.environ.get("OPENAI_API_KEY", "NOT SET"))

Verify .env exists, load_dotenv() is called before creating the client, and the key starts with sk-. A common culprit: trailing whitespace in the .env file.

`openai.RateLimitError: Rate limit reached`

When you see it: Too many requests too quickly, or you hit your spending limit.

The fix: Use the exponential backoff retry wrapper from above. Also check your limits at platform.openai.com/settings/organization/limits.

`openai.BadRequestError: maximum context length exceeded`

When you see it: Your messages list exceeds the model’s context window.

The fix: Trim conversation history — keep the system message and last 10-20 exchanges:

python

system_msg = messages[0]
messages = [system_msg] + messages[-20:]

When NOT to Use the OpenAI API

The API is powerful, but it is not always the right tool. Being honest about limitations is what separates a useful guide from marketing copy.

When you need deterministic output. Even with temperature=0, slight variations occur across calls. For a calculator or lookup table, use regular code.
When latency matters more than quality. API calls take 500ms-3s. For sub-100ms responses, pre-compute answers or use a local model.
When your data is sensitive. Data is processed on OpenAI’s servers. For healthcare or legal data, consider self-hosted models like Llama.
When the task is simple pattern matching. Extracting dates from text? Regex is faster, cheaper, and more reliable than a GPT call.
When cost is a constraint at scale. Processing 1M documents with GPT-4o costs ~$2,500+. A fine-tuned open-source model on your hardware costs a fraction after initial setup.

What to Build Next

You now have all the building blocks. Here are four projects to try, in increasing difficulty:

Email classifier — Use a system message to classify emails as “urgent,” “info,” or “spam.” One API call per email, temperature=0, simple parsing.
Meeting notes summarizer — Feed meeting transcripts to the summarizer function. Add action item extraction by modifying the system message.
Code review bot — Hook the code reviewer into a GitHub webhook. Use the retry wrapper for reliability.
Multi-tool assistant — Extend the AIAssistant class to call external APIs (weather, stock prices) based on user requests. This leads into OpenAI’s function calling feature.

Each project uses the same core pattern. The only things that change are the system message and what you do with the response.

Frequently Asked Questions

How much does it cost to use the OpenAI API?

Pricing is per token (roughly 4 characters). GPT-4o-mini costs $0.15/1M input tokens and $0.60/1M output tokens — check current pricing at platform.openai.com/pricing. A typical exchange costs $0.0001-$0.001. I recommend setting a billing limit of $10-20/month during development.

What is the difference between GPT-4o, GPT-4o-mini, and GPT-4.1?

GPT-4o is the flagship multimodal model. GPT-4o-mini is smaller, faster, cheaper — best for most tasks. GPT-4.1 is optimized for coding and instruction following. Start with mini, upgrade only if quality falls short. See the comparison table in the parameters section above.

How do I handle documents longer than the context window?

Split into chunks (2,000-4,000 tokens each), summarize each chunk, then summarize the summaries. This “map-reduce” approach handles documents of any length. With GPT-4o’s 128K context, most single documents fit in one call anyway.

Can I use the API for free?

New accounts sometimes get limited free credits. But ongoing usage requires a payment method. Check platform.openai.com/pricing for current offers.

Is my data safe when using the API?

Per OpenAI’s policy, API data is not used to train models by default. You can opt out of data retention entirely. For highly sensitive data, consider self-hosted alternatives — running Llama 3 locally is now straightforward with tools like Ollama.

Complete Code

Click to expand the full script (copy-paste and run)

python

# Complete code from: Build Your First AI App with Python
# Requires: pip install openai python-dotenv
# Python 3.9+

import os
import time
import json
from openai import OpenAI, APIError, APIConnectionError, RateLimitError, APITimeoutError
from dotenv import load_dotenv

# --- Setup ---
load_dotenv()
client = OpenAI()

# --- Section 1: Basic API Call ---
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is Python?"}]
)
print("Basic call:", response.choices[0].message.content[:100], "...")

# --- Section 2: System Messages ---
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a senior Python code reviewer. Max 3 bullet points."},
        {"role": "user", "content": "Review this code:\ndef calc(x,y): return x+y"}
    ]
)
print("\nCode review:", response.choices[0].message.content)

# --- Section 3: Streaming ---
def stream_response(prompt):
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    full = ""
    print("\nStreaming: ", end="")
    for chunk in stream:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)
            full += content
    print()
    return full

stream_response("Explain recursion in 2 sentences.")

# --- Section 4: Document Summarizer ---
def summarize_document(text, max_sentences=3):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Produce exactly {max_sentences} bullet points. Each is one sentence."},
            {"role": "user", "content": f"Summarize:\n\n{text}"}
        ],
        temperature=0
    )
    return response.choices[0].message.content

sample = "Machine learning is a subset of AI that enables systems to learn from data. There are three types: supervised, unsupervised, and reinforcement learning."
print("\nSummary:", summarize_document(sample))

# --- Section 5: Error Handling ---
def call_with_retry(messages, model="gpt-4o-mini", max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(model=model, messages=messages)
            return response.choices[0].message.content
        except RateLimitError:
            time.sleep(2 ** attempt)
        except (APIConnectionError, APITimeoutError):
            time.sleep(2 ** attempt)
        except APIError as e:
            print(f"API error: {e.message}")
            raise
    raise Exception(f"Failed after {max_retries} retries")

# --- Section 6: Cost Tracking ---
def call_with_cost_tracking(messages, model="gpt-4o-mini"):
    response = client.chat.completions.create(model=model, messages=messages)
    usage = response.usage
    pricing = {"gpt-4o-mini": {"input": 0.15, "output": 0.60}}
    rates = pricing.get(model, pricing["gpt-4o-mini"])
    cost = (usage.prompt_tokens / 1_000_000) * rates["input"]
    cost += (usage.completion_tokens / 1_000_000) * rates["output"]
    print(f"Tokens: {usage.prompt_tokens} in, {usage.completion_tokens} out. Cost: ${cost:.6f}")
    return response.choices[0].message.content

call_with_cost_tracking([{"role": "user", "content": "What is 2+2?"}])

# --- Section 7: AI Assistant Class ---
class AIAssistant:
    def __init__(self, system_prompt, model="gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model
        self.messages = [{"role": "system", "content": system_prompt}]
        self.total_cost = 0.0
        self.pricing = {"gpt-4o-mini": {"input": 0.15, "output": 0.60}}

    def chat(self, user_input):
        self.messages.append({"role": "user", "content": user_input})
        response = self.client.chat.completions.create(
            model=self.model, messages=self.messages
        )
        reply = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": reply})
        usage = response.usage
        rates = self.pricing.get(self.model, self.pricing["gpt-4o-mini"])
        cost = (usage.prompt_tokens / 1_000_000) * rates["input"]
        cost += (usage.completion_tokens / 1_000_000) * rates["output"]
        self.total_cost += cost
        return reply

    def get_cost(self):
        return f"${self.total_cost:.6f}"

    def trim_history(self, keep_last=20):
        system = self.messages[0]
        self.messages = [system] + self.messages[-keep_last:]

assistant = AIAssistant("You are a Python expert. Be concise.")
print("\nAssistant:", assistant.chat("What is a decorator?"))
print(f"Session cost: {assistant.get_cost()}")

print("\nScript completed successfully.")

References

OpenAI API Documentation — Chat Completions. Link
OpenAI Python SDK — GitHub Repository. Link
OpenAI API Pricing. Link
OpenAI API Rate Limits. Link
OpenAI Data Usage Policies. Link
OpenAI Cookbook — Best Practices for Prompt Engineering. Link
Python dotenv Documentation. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Python — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Build Your First AI App with Python — Step-by-Step Tutorial

What Is the OpenAI API and Why Use It?

The 5 steps to build your first AI app

Prerequisites

Setting up your API key

Understanding Chat Completions — Messages, Roles, and Parameters

Controlling output with temperature and max_tokens

Choosing the right model

Building a Multi-Turn Chatbot

Streaming Responses in Real Time

Building a Document Summarizer — Your First Real App

Error Handling and Retries — Production Essentials

Managing Costs — Tokens, Pricing, and Strategies

Putting It All Together — A Complete AI Assistant

Common Mistakes and How to Fix Them

Mistake 1: Printing the response object instead of the text

Mistake 2: Forgetting conversation history in chatbots

Mistake 3: Using `.message` instead of `.delta` in streaming

Mistake 4: Building a chatbot without a system message

Troubleshooting Common Errors

`openai.AuthenticationError: Incorrect API key provided`

`openai.RateLimitError: Rate limit reached`

`openai.BadRequestError: maximum context length exceeded`

When NOT to Use the OpenAI API

What to Build Next

Frequently Asked Questions

How much does it cost to use the OpenAI API?

What is the difference between GPT-4o, GPT-4o-mini, and GPT-4.1?

How do I handle documents longer than the context window?

Can I use the API for free?

Is my data safe when using the API?

Complete Code

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is the OpenAI API and Why Use It?

The 5 steps to build your first AI app

Prerequisites

Setting up your API key

Understanding Chat Completions — Messages, Roles, and Parameters

Controlling output with temperature and max_tokens

Choosing the right model

Building a Multi-Turn Chatbot

Streaming Responses in Real Time

Building a Document Summarizer — Your First Real App

Error Handling and Retries — Production Essentials

Managing Costs — Tokens, Pricing, and Strategies

Putting It All Together — A Complete AI Assistant

Common Mistakes and How to Fix Them

Mistake 1: Printing the response object instead of the text

Mistake 2: Forgetting conversation history in chatbots

Mistake 3: Using .message instead of .delta in streaming

Mistake 4: Building a chatbot without a system message

Troubleshooting Common Errors

openai.AuthenticationError: Incorrect API key provided

openai.RateLimitError: Rate limit reached

openai.BadRequestError: maximum context length exceeded

When NOT to Use the OpenAI API

What to Build Next

Frequently Asked Questions

How much does it cost to use the OpenAI API?

What is the difference between GPT-4o, GPT-4o-mini, and GPT-4.1?

How do I handle documents longer than the context window?

Can I use the API for free?

Is my data safe when using the API?

Complete Code

References

Related Articles

101 Polars Exercises for Data Analysis (with Solutions)

101 Pandas Exercises for Data Analysis (Interactive)

Why Python Lacks Traditional OOP Encapsulation

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Mistake 3: Using `.message` instead of `.delta` in streaming

`openai.AuthenticationError: Incorrect API key provided`

`openai.RateLimitError: Rate limit reached`

`openai.BadRequestError: maximum context length exceeded`

Python.
SQL. NumPy.
All free.