Menu

Build a Python AI Chatbot with Memory — OpenAI Tutorial

Written by Selva Prabhakaran | 28 min read

Your chatbot answers one question perfectly. Then you ask a follow-up, and it has no idea what you just said.

Every OpenAI API call is stateless — the model forgets everything between requests unless you explicitly pass the conversation history. This is the single biggest surprise for developers building their first chatbot.

This tutorial builds a chatbot that remembers your name 50 turns later, stays under a token budget, and streams responses word by word — all in under 120 lines of Python. You will go from a bare API call to a production-ready class with memory management, error handling, and streaming.

What Is Conversation Memory and Why Does the API Need It?

The OpenAI Chat Completions API is stateless. Each call to client.chat.completions.create() is independent — the model receives only what you send in the messages parameter and returns a response. It does not remember previous calls.

Memory, in this context, means maintaining a growing list of messages that you send back to the API with every new request. The model reads the full list and generates a response that accounts for everything said before.

To see the problem firsthand, make a single API call that introduces a name. The code below creates an OpenAI client, sends one user message with a system prompt, and prints the model’s response. Because this is a standalone call, the model will greet us — but it has no way to remember this exchange later.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "My name is Priya."},
    ],
)
print(response.choices[0].message.content)
python
Hello Priya! Nice to meet you. How can I help you today?

That works for one exchange. But if you send a second request asking “What is my name?”, the model has no idea — because you did not include the first exchange in the new request.

This is the moment it clicks for most people: the API has no hidden session. If you want continuity, you build it yourself.

Prerequisites

  • Python version: 3.9+
  • Required library: openai (1.0+)
  • Install: pip install openai tiktoken
  • API key: An OpenAI API key stored in the OPENAI_API_KEY environment variable
  • Time to complete: 25-30 minutes
Tip: **Store your API key in an environment variable, never in code.** On macOS/Linux: `export OPENAI_API_KEY=”sk-…”`. On Windows: `set OPENAI_API_KEY=sk-…`. The OpenAI client reads it automatically when you call `OpenAI()` with no arguments, but passing it explicitly via `os.environ.get()` makes the dependency visible in your code.

The 5 steps to build a chatbot with memory

  1. Make a single API call and understand the response structure
  2. Store messages in a list and send the full history with each request
  3. Add a system prompt to control personality and behavior
  4. Implement token counting and memory management to control costs
  5. Wrap everything in a reusable class with streaming support

Understanding Messages, Roles, and the Conversation Loop

The messages parameter is a list of dictionaries. Each dictionary has a role and content. Three roles matter:

  • system — Sets the assistant’s behavior. Sent once at the start.
  • user — The human’s input.
  • assistant — The model’s previous responses. You include these so the model knows what it already said.

Here is the core pattern that gives a chatbot memory. The chat() function appends the user’s message to a shared messages list, sends the entire list to the API, and then appends the assistant’s reply. Each subsequent call includes every prior exchange.

python
messages = [
    {"role": "system", "content": "You are a concise Python tutor."},
]

def chat(user_input):
    messages.append({"role": "user", "content": user_input})
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )
    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})
    return reply

Each call sends the entire conversation. The model sees every prior exchange and responds in context.

python
print(chat("My name is Priya."))
print(chat("What is my name?"))
python
Hi Priya! Nice to meet you. What would you like to learn about Python?
Your name is Priya! What Python topic can I help you with?

The second call works because the messages list now contains the first exchange. The model reads “My name is Priya” in the history and responds correctly.

Key Insight: **The messages list IS the chatbot’s memory.** There is no hidden state, no session ID, no server-side storage. If you lose the list, the chatbot forgets everything. If you save it, the chatbot remembers forever.

Designing System Prompts That Shape Behavior

The system message is the most underused tool in chatbot development. It controls tone, format, boundaries, and expertise — all in a single string that the model treats as background instruction.

python
system_prompt = """You are a senior Python developer acting as a code reviewer.
Rules:
- Always suggest improvements, never just say "looks good"
- Point out potential bugs, not just style issues
- Keep responses under 100 words unless the code is complex
- If the user asks something unrelated to code, redirect politely"""

messages = [{"role": "system", "content": system_prompt}]

I prefer keeping system prompts in separate text files during development. That way you can iterate on the prompt without touching Python code — just change the file, restart, and test.

A well-designed system prompt does three things: defines the role, sets constraints, and establishes response format. Here is a comparison of weak versus strong prompts.

The unfocused way:

python
# Vague — model decides everything
system = "You are a helpful assistant."

The precise way:

python
# Clear role, constraints, and format
system = """You are a data science tutor for beginners.
- Explain concepts before showing code
- Use pandas and scikit-learn only
- Never suggest deprecated APIs
- Format code examples with comments"""

The precise prompt produces dramatically more consistent responses. The model follows explicit constraints much better than vague suggestions.

Warning: **System prompts are not security boundaries.** A determined user can override system instructions with carefully crafted prompts. Never rely on the system message to enforce access control or hide sensitive information. Use server-side validation for anything security-critical.

What happens if we change the temperature parameter? Temperature controls randomness — at 0, the model gives nearly identical answers every time; at 1, it gets creative. The loop below calls the API three times with different temperatures on the same prompt, so you can see how randomness changes the output.

python
# Same question, different temperatures
for temp in [0, 0.5, 1.0]:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a creative writing assistant."},
            {"role": "user", "content": "Write one sentence about Python."},
        ],
        temperature=temp,
    )
    print(f"temp={temp}: {response.choices[0].message.content}")
python
temp=0: Python is a versatile programming language known for its readability and wide range of applications.
temp=0.5: Python slithers through complex problems with elegant, readable syntax that makes programming feel almost conversational.
temp=1.0: Like a patient craftsman shaping raw data into gleaming insight, Python turns the chaos of information into something you can actually understand.

My rule of thumb: use temperature=0 for factual tasks (code generation, data extraction) and temperature=0.7 for conversational chatbots. Start low and increase only if responses feel too repetitive.

Choosing the Right Model

The model you choose affects cost, speed, and quality. Here is a quick comparison for chatbot use cases.

Model Cost (Input/Output per 1M tokens) Best For Context Window
GPT-4o-mini ~$0.15 / $0.60 Prototyping, simple chatbots, high volume 128K
GPT-4o ~$2.50 / $10.00 Complex reasoning, nuanced conversations 128K
GPT-4.1 ~$2.00 / $8.00 Instruction following, very long conversations 1M

I always start with GPT-4o-mini for new projects. It handles most conversational tasks well, and the cost difference is dramatic — you can run roughly 15x more conversations for the same budget. Upgrade to GPT-4o only when you hit quality gaps in reasoning or complex instructions.

Building a Multi-Turn Chatbot with a Conversation Loop

We have the chat() function that handles memory. The missing piece is a loop that keeps the conversation going — read input, call chat(), print the result, repeat. This is the step where your code goes from “demo script” to “something you can actually talk to.”

The run_chatbot() function below wraps the conversation in a while True loop. It reads user input with input(), checks for exit commands (quit, exit, q), and calls the API with the full message history. The function is terminal-based — it will keep prompting until the user types an exit keyword.

python
def run_chatbot():
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
    ]
    print("Chatbot ready. Type 'quit' to exit.\n")

    while True:
        user_input = input("You: ")
        if user_input.lower() in ("quit", "exit", "q"):
            print("Goodbye!")
            break

        messages.append({"role": "user", "content": user_input})
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
        )
        reply = response.choices[0].message.content
        messages.append({"role": "assistant", "content": reply})
        print(f"Assistant: {reply}\n")

This function runs in the terminal. Type a message, get a response, and the conversation continues with full context.

python
# Run in terminal — requires interactive input
# run_chatbot()
Note: **This code requires a terminal with interactive input.** It will not work in Jupyter notebooks without modifications. For notebook use, replace `input()` with a predefined list of messages or use `ipywidgets` for an input box.

Quick Check — predict what happens: If you chat for 50 turns, how many messages does the list contain (including the system message)?

Answer: 101. One system message + 50 user messages + 50 assistant messages. Every turn adds two messages. This growth is exactly why we need token management — the topic of the next section.

Token Counting and Memory Management — Controlling Costs

Every message in the conversation costs tokens. As the chat grows, each API call gets more expensive because you send the entire history. After 20 turns of detailed conversation, you could easily hit 4,000+ tokens per request.

The tiktoken library counts tokens exactly the way the API does. The count_tokens() function below takes a messages list and a model name, then iterates through each message. It adds 4 overhead tokens per message (for internal delimiters the API uses) and 2 tokens for reply priming at the end. The function returns the total token count you can expect the API to charge for this input.

python
import tiktoken

def count_tokens(messages, model="gpt-4o-mini"):
    encoding = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += 4  # message overhead
        total += len(encoding.encode(msg["content"]))
        total += len(encoding.encode(msg["role"]))
    total += 2  # reply priming
    return total

The overhead tokens (4 per message, 2 for reply priming) account for the special tokens the API adds internally. These are approximate values from OpenAI’s token counting cookbook — the exact count can vary slightly by model.

Let’s verify it works on a simple two-message conversation. We expect a small number — the system prompt and one short user message.

python
test_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"},
]
print(f"Token count: {count_tokens(test_messages)}")
python
Token count: 24

I always check token counts during development. It is surprisingly easy to write a system prompt that eats 800 tokens before a single user message — and you will not notice until you check.

Four Memory Strategies

Not every chatbot needs the same memory approach. Here are four strategies, ranked by complexity.

Strategy How It Works Best For Drawback
Full history Send everything Short conversations (<20 turns) Costs grow linearly
Sliding window Keep last N exchanges Customer support, quick Q&A Loses early context
Summary memory Summarize old messages, keep recent ones Long conversations Summary costs tokens too
Persistent storage Save to database, load on reconnect Multi-session apps Requires infrastructure

The sliding window is the most practical starting point. The trim_to_token_budget() function below preserves the system message and works backward from the newest messages, adding each one to the trimmed list until the token budget is full. This ensures the most recent context is always kept.

python
def trim_to_token_budget(messages, max_tokens=2000, model="gpt-4o-mini"):
    """Keep the system message and trim oldest exchanges to fit budget."""
    system_msg = messages[0]  # always preserve
    trimmed = [system_msg]
    conversation = messages[1:]  # user/assistant pairs

    # Add messages from newest to oldest
    for msg in reversed(conversation):
        candidate = [system_msg] + [msg] + trimmed[1:]
        if count_tokens(candidate, model) <= max_tokens:
            trimmed.insert(1, msg)
        else:
            break

    return trimmed

This function works from the newest messages backward. It keeps adding messages until the token budget is exhausted, then stops. The system message always stays.

To see how much trimming saves, let’s simulate a 20-turn conversation where each assistant reply is about 10 words long. The trim_to_token_budget() call with a 500-token budget will cut the conversation down to only the most recent exchanges.

python
# Simulate a long conversation
long_conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
]
for i in range(20):
    long_conversation.append({"role": "user", "content": f"Tell me about topic {i}"})
    long_conversation.append({"role": "assistant", "content": f"Here is info about topic {i}. " * 10})

print(f"Full history: {count_tokens(long_conversation)} tokens")
trimmed = trim_to_token_budget(long_conversation, max_tokens=500)
print(f"After trimming: {count_tokens(trimmed)} tokens")
print(f"Messages kept: {len(trimmed)} of {len(long_conversation)}")
python
Full history: 1742 tokens
After trimming: 486 tokens
Messages kept: 9 of 41

Summary Memory — Compressing Old Context

The sliding window drops old messages entirely. Summary memory is smarter — it asks the model to compress old messages into a short summary, then keeps that summary plus the recent messages.

The summarize_old_messages() function splits the conversation into old and recent halves. It sends the old messages to the API with a summarization instruction, gets back a 2-3 sentence summary, and returns a new messages list with the system prompt, the summary as a second system message, and the recent messages intact.

python
def summarize_old_messages(messages, keep_recent=6):
    """Summarize old messages and keep recent ones."""
    if len(messages) <= keep_recent + 1:
        return messages  # nothing to summarize

    system_msg = messages[0]
    old_messages = messages[1:-keep_recent]
    recent_messages = messages[-keep_recent:]

    # Ask the model to summarize the old conversation
    summary_request = [
        {"role": "system", "content": "Summarize this conversation in 2-3 sentences. Keep key facts (names, preferences, decisions)."},
    ] + old_messages

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=summary_request,
        max_tokens=150,
    )
    summary = response.choices[0].message.content

    # Replace old messages with a single summary message
    summary_msg = {"role": "system", "content": f"Previous conversation summary: {summary}"}
    return [system_msg, summary_msg] + recent_messages

This approach costs one extra API call for the summary, but it preserves key facts that the sliding window would lose. I prefer this strategy when building chatbots where users reference things from much earlier — “remember that bug we discussed at the start?”

Key Insight: **Token cost grows with every turn because you resend the entire history.** A 10-turn conversation does not cost 10x a single call — it costs roughly 1+2+3+…+10 = 55x, because each call includes all previous messages. This is why memory management is not optional for any chatbot that runs more than a few turns.
Try It Yourself

Exercise 1: Build a Token-Aware Chat Function

Write a function chat_with_budget() that sends a message to the API but first trims the conversation to stay under a token budget. The function should return both the reply and the current token count.

Hint 1

Call `trim_to_token_budget()` on the messages list before passing it to the API.

Hint 2

After getting the response, append the assistant’s reply to the original (untrimmed) messages list. Use `count_tokens()` to report the current usage.

Starter code:

python
def chat_with_budget(messages, user_input, max_tokens=2000):
    messages.append({"role": "user", "content": user_input})
    # TODO: Trim messages to fit budget
    # TODO: Make API call with trimmed messages
    # TODO: Append assistant reply to original messages
    # TODO: Return (reply, token_count)
    pass
Solution
python
def chat_with_budget(messages, user_input, max_tokens=2000):
    messages.append({"role": "user", "content": user_input})
    trimmed = trim_to_token_budget(messages, max_tokens=max_tokens)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=trimmed,
    )
    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})
    token_count = count_tokens(messages)
    return reply, token_count

**Explanation:** We trim the messages before the API call so we stay within budget, but append the reply to the full `messages` list so no context is permanently lost from our local history. The `count_tokens()` call after appending tells us the current total — useful for monitoring cost growth over a session.

Streaming Responses — Making Your Chatbot Feel Fast

Waiting 3-5 seconds for a response with no feedback feels broken. Streaming sends tokens as they are generated, so the user sees words appearing immediately — the same experience as ChatGPT.

The chat_stream() function below works like chat() but passes stream=True to the API. Instead of returning a single response object, the API returns an iterator of chunks. Each chunk contains a delta with a partial token. We print each token as it arrives and accumulate them into full_reply for the message history.

python
def chat_stream(messages, user_input):
    messages.append({"role": "user", "content": user_input})
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        stream=True,
    )

    full_reply = ""
    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)
            full_reply += delta.content

    print()  # newline after streaming completes
    messages.append({"role": "assistant", "content": full_reply})
    return full_reply

The key difference from non-streaming: stream=True returns an iterator of chunks instead of a single response. We accumulate these into full_reply for the message history.

I use streaming in every user-facing chatbot. The perceived latency drops from seconds to milliseconds. Even if the total generation time is the same, the user experience is dramatically different.

Quick Check — what is delta.content on the very first chunk? It is usually None or an empty string. The first chunk often contains only the role metadata. Always check if delta.content: before printing.

Error Handling and Retries — Production Essentials

API calls fail. Rate limits hit. Networks drop. A chatbot without error handling just silently stops responding, leaving the user staring at a frozen screen.

The chat_with_retries() function below wraps the API call in a retry loop with exponential backoff. It catches three specific OpenAI exceptions — RateLimitError, APIConnectionError, and APITimeoutError — and waits progressively longer between attempts (1s, 2s, 4s). If the error is a non-retryable APIError, it removes the user message from history and re-raises immediately. If all retries fail, it also cleans up the message history.

python
import time
from openai import (
    APIError,
    APIConnectionError,
    RateLimitError,
    APITimeoutError,
)

def chat_with_retries(messages, user_input, max_retries=3):
    messages.append({"role": "user", "content": user_input})

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=messages,
            )
            reply = response.choices[0].message.content
            messages.append({"role": "assistant", "content": reply})
            return reply

        except RateLimitError:
            wait = 2 ** attempt
            print(f"Rate limited. Retrying in {wait}s...")
            time.sleep(wait)
        except APIConnectionError:
            wait = 2 ** attempt
            print(f"Connection error. Retrying in {wait}s...")
            time.sleep(wait)
        except APITimeoutError:
            wait = 2 ** attempt
            print(f"Timeout. Retrying in {wait}s...")
            time.sleep(wait)
        except APIError as e:
            print(f"API error: {e}")
            messages.pop()  # remove the failed user message
            raise

    messages.pop()  # remove user message if all retries failed
    raise RuntimeError("Max retries exceeded")

The exponential backoff (2 ** attempt) waits 1, 2, 4 seconds between retries. This gives the API time to recover without hammering it with requests.

Notice the messages.pop() on failure. If the API call fails, we remove the user message we appended — otherwise the conversation history would contain a question with no answer, which confuses subsequent calls.

Saving and Loading Conversations — Persistence Between Sessions

So far, all our conversation history lives in memory — close the script and it is gone. For anything beyond a quick prototype, you want to save conversations so users can resume where they left off.

The simplest approach: serialize the messages list to JSON. This works for single-user scripts and prototyping.

python
import json

def save_conversation(messages, filepath="chat_history.json"):
    with open(filepath, "w") as f:
        json.dump(messages, f, indent=2)

def load_conversation(filepath="chat_history.json"):
    try:
        with open(filepath) as f:
            return json.load(f)
    except FileNotFoundError:
        return [{"role": "system", "content": "You are a helpful assistant."}]
python
# Save at end of session
save_conversation(messages)

# Load at start of next session
messages = load_conversation()
print(f"Loaded {len(messages)} messages from previous session")
python
Loaded 15 messages from previous session

For production apps with multiple users, use a database instead — SQLite for local apps, PostgreSQL or Redis for web services. The pattern is the same: serialize messages to JSON and store them keyed by a session or user ID.

Putting It All Together — A Complete Chatbot Class

Everything we have built — memory, token management, streaming, error handling — combines into a single reusable class. The Chatbot class has three responsibilities: manage the conversation history, enforce token budgets, and handle API communication.

The constructor takes a system prompt, model name, and max token budget. It initializes the message history with the system prompt and sets up a counter for total tokens used across the session.

python
class Chatbot:
    def __init__(self, system_prompt, model="gpt-4o-mini", max_tokens=3000):
        self.model = model
        self.max_tokens = max_tokens
        self.messages = [{"role": "system", "content": system_prompt}]
        self.total_tokens_used = 0

The _trim_history() method removes the oldest non-system messages one at a time until the token count is within budget. The chat() method is the main entry point — it appends the user’s message, trims, calls the API, and returns the reply. Pass stream=True to get streaming output.

python
    def _trim_history(self):
        """Trim conversation to fit within token budget."""
        while (count_tokens(self.messages, self.model) > self.max_tokens
               and len(self.messages) > 2):
            self.messages.pop(1)  # remove oldest non-system message

    def chat(self, user_input, stream=False):
        """Send a message and return the response."""
        self.messages.append({"role": "user", "content": user_input})
        self._trim_history()

        response = client.chat.completions.create(
            model=self.model,
            messages=self.messages,
            stream=stream,
        )

        if stream:
            return self._handle_stream(response)
        else:
            reply = response.choices[0].message.content
            self.messages.append({"role": "assistant", "content": reply})
            self.total_tokens_used += response.usage.total_tokens
            return reply

The remaining methods handle streaming accumulation, history export, and usage statistics.

python
    def _handle_stream(self, stream):
        """Process streaming response and accumulate tokens."""
        full_reply = ""
        for chunk in stream:
            delta = chunk.choices[0].delta
            if delta.content:
                print(delta.content, end="", flush=True)
                full_reply += delta.content
        print()
        self.messages.append({"role": "assistant", "content": full_reply})
        return full_reply

    def get_history(self):
        """Return conversation history without system message."""
        return self.messages[1:]

    def get_stats(self):
        """Return usage statistics."""
        return {
            "messages": len(self.messages),
            "current_tokens": count_tokens(self.messages, self.model),
            "total_tokens_used": self.total_tokens_used,
        }

Here is the class in action. We create a tutor chatbot with a 2,000-token budget, ask two questions, and then check the usage stats. Expect a short Python explanation, then a follow-up that references the first answer.

python
bot = Chatbot(
    system_prompt="You are a concise Python tutor.",
    max_tokens=2000,
)

print(bot.chat("What are list comprehensions?"))
print(bot.chat("Give me an example with filtering."))
print(bot.get_stats())
python
List comprehensions are a concise way to create lists in Python. Instead of writing a for loop, you write the expression and loop in a single line: [expression for item in iterable].
Sure! Here's an example that filters even numbers: [x for x in range(10) if x % 2 == 0] — this gives [0, 2, 4, 6, 8].
{'messages': 5, 'current_tokens': 142, 'total_tokens_used': 198}
Try It Yourself

Exercise 2: Add Conversation Export

Add a method export_conversation() to the Chatbot class that returns the conversation as a formatted string. Each message should show the role and content, like: [user] What are list comprehensions?

Hint 1

Loop through `self.messages[1:]` (skip the system message) and format each entry.

Hint 2

Use an f-string: `f”[{msg[‘role’]}] {msg[‘content’]}”` and join with newlines.

Starter code:

python
def export_conversation(self):
    """Return formatted conversation string."""
    # TODO: Loop through messages (skip system)
    # TODO: Format as [role] content
    # TODO: Return joined string
    pass
Solution
python
def export_conversation(self):
    """Return formatted conversation string."""
    lines = []
    for msg in self.messages[1:]:
        lines.append(f"[{msg['role']}] {msg['content']}")
    return "\n".join(lines)

**Explanation:** We skip `self.messages[0]` because that is the system prompt — internal configuration, not conversation. The `f”[{msg[‘role’]}]”` prefix makes it clear who said what. You could extend this to save to a file with `with open(‘chat.txt’, ‘w’) as f: f.write(bot.export_conversation())`.

Common Mistakes and How to Fix Them

These are easy to make and surprisingly hard to debug because the chatbot still “works” — it just works badly.

Mistake 1: Not including previous messages

python
# Wrong — each call is isolated, no memory
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": user_input}],
)

Why it is wrong: The model receives only the current message. It cannot reference anything said before. This is the most common chatbot bug.

python
# Correct — include full conversation history
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
)

Mistake 2: Forgetting to append the assistant’s response

python
# Wrong — response not saved to history
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
print(response.choices[0].message.content)
# Missing: messages.append({"role": "assistant", ...})

Why it is wrong: The model sees user messages without corresponding answers. On the next turn, it may re-answer old questions or produce confused responses.

Mistake 3: No token budget management

python
# Wrong — messages grow forever
while True:
    user_input = input("You: ")
    messages.append({"role": "user", "content": user_input})
    response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
    # After 50+ turns, you hit context limits and costs explode

Why it is wrong: GPT-4o-mini has a 128K token context window, but each token costs money. A long conversation without trimming can easily cost 10-50x more than necessary.

Mistake 4: Hardcoding the API key

python
# Wrong — key exposed in code
client = OpenAI(api_key="sk-abc123...")

Why it is wrong: Anyone who sees your code — version control, screenshots, shared notebooks — gets your API key. Use environment variables instead.

python
# Correct
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

Troubleshooting Common Errors

openai.AuthenticationError: Incorrect API key provided

When you see it: Your API key is wrong, expired, or not set.

The fix: Check that OPENAI_API_KEY is set correctly in your environment. Print os.environ.get("OPENAI_API_KEY")[:8] to verify the first few characters match your key. Regenerate the key in the OpenAI dashboard if needed. Watch out for trailing whitespace in .env files — it is a surprisingly common culprit.

openai.RateLimitError: Rate limit reached

When you see it: You are sending too many requests too quickly, or your account spending limit is reached.

The fix: Implement exponential backoff (as shown in the error handling section). Also check your account billing settings at platform.openai.com. Free tier accounts have strict rate limits — upgrading to a paid plan increases them significantly.

openai.BadRequestError: maximum context length exceeded

When you see it: Your messages list has grown beyond the model’s context window.

The fix: Implement token counting and trimming (the trim_to_token_budget() function from earlier). Set your budget well below the model’s maximum to leave room for the response.

When NOT to Build a Custom Chatbot

Not every problem needs a hand-rolled conversation loop. Here is when to reach for a different tool.

  • Simple Q&A over documents — Use a RAG (Retrieval Augmented Generation) pipeline instead. Libraries like LangChain or LlamaIndex handle document chunking, embedding, and retrieval. A chatbot loop on raw documents will hit context limits quickly.
  • Customer-facing support with compliance requirements — Use a managed platform (Intercom, Zendesk AI, or OpenAI Assistants API) that provides audit trails, content filtering, and user management out of the box.
  • Tasks requiring real-time data — If your chatbot needs to check live prices, weather, or database records, you need function calling (tool use), not just a conversation loop. That is a separate architecture pattern.
  • High-throughput production systems — If you need to serve hundreds of concurrent users, consider the OpenAI Assistants API or Responses API, which handle conversation state server-side and reduce your infrastructure burden.

Frequently Asked Questions

How much does it cost to run a chatbot with the OpenAI API?

GPT-4o-mini costs approximately $0.15 per million input tokens and $0.60 per million output tokens. A typical 10-turn conversation uses roughly 2,000-4,000 tokens total, costing well under $0.01. Costs grow with conversation length because you resend the full history each turn. I recommend setting a spending limit on your API account during development — $10/month is plenty for prototyping.

Can I save conversation history between sessions?

Yes. Serialize the messages list to JSON and save it to a file or database. On the next session, load the JSON back into the list before starting. See the “Saving and Loading Conversations” section above for the full implementation. For production apps, use SQLite or PostgreSQL keyed by user/session ID.

What is the difference between the Chat Completions API and the Assistants API?

The Chat Completions API (used in this tutorial) is stateless — you manage conversation history yourself. The Assistants API is stateful — OpenAI stores the conversation thread server-side. The Assistants API also supports built-in tools (code interpreter, file search) and persistent threads. Use Chat Completions when you want full control. Use Assistants when you want OpenAI to manage state and tools for you.

How do I make my chatbot remember things across different topics?

The sliding window approach drops old messages, which means early context is lost. For long conversations where early context matters, use the summary memory strategy: periodically summarize old messages into a single “memory” message and prepend it to the conversation. This preserves key facts while keeping token counts low.

What model should I use for a chatbot?

GPT-4o-mini is the best starting point — fast, cheap, and handles most conversational tasks well. See the model comparison table in the “Choosing the Right Model” section above for detailed cost and capability comparisons. Start with mini, upgrade to GPT-4o only when you notice quality gaps in reasoning or complex instructions.

How do I control the length of the model’s responses?

Use the max_tokens parameter in your API call. For example, max_tokens=150 limits the response to roughly 150 tokens (~100 words). This does not guarantee exact length — the model may stop earlier if it finishes its thought. Combine with a system prompt instruction like “Keep responses under 100 words” for more consistent results.

What to Build Next

You now have a production-ready chatbot pattern. Here are four projects that build directly on what you learned, each adding one new capability.

  1. FAQ bot for your documentation — Take a product FAQ, split it into chunks, and inject the relevant chunk into the system prompt based on the user’s question. This is RAG in its simplest form — no vector database needed. Use the Chatbot class with a modified system prompt per question.
  2. Meeting summarizer with structured output — Feed meeting transcripts into the chatbot and ask it to return JSON with action items, decisions, and owners. Practice long-context handling and the response_format parameter for structured output.
  3. Code review assistant — Build a chatbot that reviews Python code, finds bugs, and suggests improvements. The system prompt from this tutorial is exactly the right foundation. Add temperature=0 for consistency.
  4. Multi-tool assistant with function calling — Add function calling so the chatbot can check live data, query a database, or run calculations mid-conversation. This is the next major capability after conversation memory — and it uses the same messages list pattern you already know.

Complete Code

Click to expand the full script (copy-paste and run)
python
# Complete code from: Build a Python AI Chatbot with Memory
# Requires: pip install openai tiktoken
# Python 3.9+

import os
import json
import time
import tiktoken
from openai import (
    OpenAI,
    APIError,
    APIConnectionError,
    RateLimitError,
    APITimeoutError,
)

# --- Setup ---
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# --- Token Counting ---
def count_tokens(messages, model="gpt-4o-mini"):
    encoding = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += 4
        total += len(encoding.encode(msg["content"]))
        total += len(encoding.encode(msg["role"]))
    total += 2
    return total

# --- Memory Management ---
def trim_to_token_budget(messages, max_tokens=2000, model="gpt-4o-mini"):
    system_msg = messages[0]
    trimmed = [system_msg]
    conversation = messages[1:]
    for msg in reversed(conversation):
        candidate = [system_msg] + [msg] + trimmed[1:]
        if count_tokens(candidate, model) <= max_tokens:
            trimmed.insert(1, msg)
        else:
            break
    return trimmed

# --- Persistence ---
def save_conversation(messages, filepath="chat_history.json"):
    with open(filepath, "w") as f:
        json.dump(messages, f, indent=2)

def load_conversation(filepath="chat_history.json"):
    try:
        with open(filepath) as f:
            return json.load(f)
    except FileNotFoundError:
        return [{"role": "system", "content": "You are a helpful assistant."}]

# --- Chatbot Class ---
class Chatbot:
    def __init__(self, system_prompt, model="gpt-4o-mini", max_tokens=3000):
        self.model = model
        self.max_tokens = max_tokens
        self.messages = [{"role": "system", "content": system_prompt}]
        self.total_tokens_used = 0

    def _trim_history(self):
        while (count_tokens(self.messages, self.model) > self.max_tokens
               and len(self.messages) > 2):
            self.messages.pop(1)

    def chat(self, user_input, stream=False):
        self.messages.append({"role": "user", "content": user_input})
        self._trim_history()
        response = client.chat.completions.create(
            model=self.model,
            messages=self.messages,
            stream=stream,
        )
        if stream:
            return self._handle_stream(response)
        else:
            reply = response.choices[0].message.content
            self.messages.append({"role": "assistant", "content": reply})
            self.total_tokens_used += response.usage.total_tokens
            return reply

    def _handle_stream(self, stream):
        full_reply = ""
        for chunk in stream:
            delta = chunk.choices[0].delta
            if delta.content:
                print(delta.content, end="", flush=True)
                full_reply += delta.content
        print()
        self.messages.append({"role": "assistant", "content": full_reply})
        return full_reply

    def get_history(self):
        return self.messages[1:]

    def get_stats(self):
        return {
            "messages": len(self.messages),
            "current_tokens": count_tokens(self.messages, self.model),
            "total_tokens_used": self.total_tokens_used,
        }

    def export_conversation(self):
        lines = []
        for msg in self.messages[1:]:
            lines.append(f"[{msg['role']}] {msg['content']}")
        return "\n".join(lines)

# --- Demo ---
bot = Chatbot(
    system_prompt="You are a concise Python tutor.",
    max_tokens=2000,
)

print(bot.chat("What are list comprehensions?"))
print(bot.chat("Give me an example with filtering."))
print(f"\nStats: {bot.get_stats()}")
print("\nScript completed successfully.")

References

  1. OpenAI API Documentation — Chat Completions. Link
  2. OpenAI — Conversation State Guide. Link
  3. OpenAI — Token Usage and Pricing. Link
  4. OpenAI Cookbook — How to count tokens with tiktoken. Link
  5. tiktoken — GitHub Repository. Link
  6. OpenAI — Assistants API Documentation. Link
  7. OpenAI Python SDK — GitHub Repository. Link
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Python — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science