Build an AI Chatbot with Memory in Python (2026)

Build a Python AI chatbot with conversation memory that actually remembers. Raw HTTP tutorial with streaming, 3 hands-on exercises, and complete code you can run today.

Written by Selva Prabhakaran | 24 min read

Send messages to an LLM, keep the chat alive, and stream replies token by token — using only raw HTTP requests.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

You type “What’s the capital of France?” The chatbot says “Paris.” You follow up with “What’s its population?” — and the chatbot has no clue what “its” refers to. It forgot everything.

Every API call to an LLM starts fresh. The model doesn’t remember what you said ten seconds ago. That’s the problem this article solves.

You’ll build a chatbot that remembers the full chat, streams replies in real time, and runs on raw Python. No SDKs needed.

Why LLM Chatbot Conversations Are Stateless

Here’s what surprises most beginners. When you call the ChatGPT API, the model doesn’t keep a chat going in the background. Each call is on its own. The model gets your message, replies, and forgets.

So how does ChatGPT seem to remember? The client sends the entire chat history with every request. Every user message and every assistant reply goes back to the API each time.

import micropip
await micropip.install('requests')

# What actually happens behind the scenes
# Request 1: You send one message
messages = [
    {"role": "user", "content": "What's the capital of France?"}
]
# Response: "The capital of France is Paris."

# Request 2: You send BOTH messages
messages = [
    {"role": "user", "content": "What's the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's its population?"}
]
# Now the model sees "its" refers to Paris

The model reads the full list top to bottom. It sees the context. It connects “its” to “Paris.” No magic memory — just a growing list.

Key Insight: LLMs have zero memory between API calls. The illusion of memory comes from resending the full chat with every request. Your code manages the memory, not the model.

This has a practical cost. The message list grows with each exchange. Every API call uses more tokens as the chat gets longer. You’ll need a trimming strategy — we’ll cover that soon.

Prerequisites

Python version: 3.9+
Required library: requests
Install: pip install requests
API key: An OpenAI API key (platform.openai.com)
Time to complete: 20-25 minutes

Note: You need an OpenAI API key. Sign up at platform.openai.com, go to API Keys, and create a new secret key. Keep it safe — treat it like a password. The examples here use GPT-3.5-turbo, which costs fractions of a cent per request.

Your First Chatbot API Call with Raw HTTP

Most tutorials reach for the openai Python SDK. We won’t. We’ll use the requests library instead.

Why? First, you’ll see exactly what happens at the network level. Second, requests works in Pyodide (browser Python), but the OpenAI SDK doesn’t.

The endpoint lives at https://api.openai.com/v1/chat/completions. You send a POST request with your API key in the header and messages in the JSON body. The model field picks the model. The messages field holds the chat.

import requests
import json
import os

API_KEY = os.environ.get("OPENAI_API_KEY", "your-api-key-here")
API_URL = "https://api.openai.com/v1/chat/completions"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

payload = {
    "model": "gpt-3.5-turbo",
    "messages": [
        {"role": "user", "content": "What is Python?"}
    ]
}

response = requests.post(API_URL, headers=headers, json=payload)
data = response.json()

print(data["choices"][0]["message"]["content"])

Result:

python

Python is a high-level, interpreted programming language known
for its simplicity and readability. It supports multiple
programming paradigms and has a large standard library.

One POST request. That’s all it takes. The response JSON has a choices array. Each choice contains a message with role and content. We grab choices[0] because we asked for one response.

Tip: Always check `response.status_code` before parsing. A 401 means a bad API key. A 429 means rate limits. A 500 means their servers broke. Handle these in production.

Controlling the Chatbot’s Output

Two settings shape every reply: temperature and max_tokens. The temperature controls how random the output is. Set it to 0 for steady, same-every-time answers. Set it to 1.0 for creative, varied replies. The default is 1.0.

max_tokens caps how long the response can be. If you don’t set it, the model uses whatever tokens remain in its context window.

payload = {
    "model": "gpt-3.5-turbo",
    "messages": [
        {"role": "user", "content": "Write a haiku about Python."}
    ],
    "temperature": 0.7,
    "max_tokens": 100
}

response = requests.post(API_URL, headers=headers, json=payload)
print(response.json()["choices"][0]["message"]["content"])

Output:

python

Indented with care,
Loops and lists in harmony,
Code that reads like prose.

For a chatbot tutor, I’d use temperature=0.3. Low enough for accurate answers. High enough to avoid robotic repetition.

Tip: Use `temperature=0` when you need repeatable output. This is great for testing and debugging. Bump it to 0.5-0.7 for more natural chat.

The Message Format — Roles Explained

Every message needs two fields: role and content. There are three roles. Each one shapes how the model acts.

Role	Purpose	When to Use
`system`	Sets personality and rules	Once, at chat start
`user`	The human’s input	Every time the user types
`assistant`	The model’s past responses	So the model “remembers”

The system message is your control lever. Want a chatbot that speaks like a pirate? System message. Want one that only answers Python questions? System message.

Here’s how a system message shapes a response. We’ll set the model to be a Python tutor that gives short answers with code.

messages = [
    {
        "role": "system",
        "content": "You are a helpful Python tutor. "
                   "Keep answers short. Include code examples."
    },
    {
        "role": "user",
        "content": "How do I reverse a list?"
    }
]

payload = {"model": "gpt-3.5-turbo", "messages": messages}
response = requests.post(API_URL, headers=headers, json=payload)
data = response.json()

print(data["choices"][0]["message"]["content"])

The model responds briefly with code:

python

You can reverse a list using reverse() or slicing:

my_list = [1, 2, 3, 4, 5]
my_list.reverse()
print(my_list)  # [5, 4, 3, 2, 1]

# Or use slicing (creates a new list)
reversed_list = my_list[::-1]

Without that system message, the answer would be longer and less focused. The system role is powerful — use it.

Building Conversation Memory in Python

Here’s where the real work starts. You need a place to store every message. Then you send that full history with each API call.

The simplest approach? A plain Python list. Each element is a dictionary with role and content. Append the user’s message. Append the model’s reply. The list grows — that’s your memory.

The chat() function below does the work. It takes user text, adds it to history, calls the API with the full history, saves the reply, and returns it.

chat_history = [
    {
        "role": "system",
        "content": "You are a friendly Python tutor. Be concise."
    }
]

def chat(user_message):
    chat_history.append(
        {"role": "user", "content": user_message}
    )

    payload = {
        "model": "gpt-3.5-turbo",
        "messages": chat_history
    }
    response = requests.post(
        API_URL, headers=headers, json=payload
    )
    data = response.json()

    assistant_message = data["choices"][0]["message"]["content"]
    chat_history.append(
        {"role": "assistant", "content": assistant_message}
    )
    return assistant_message

No output from this block — it just defines the function.

print(chat("What are Python decorators?"))

Response:

python

Decorators are functions that modify the behavior of other
functions. You write them with the @ symbol above a function
definition.

That’s turn one. The history now has three items: system message, user question, assistant reply.

What happens on a follow-up? Let’s find out.

print(chat("Can you show me a simple example?"))

Response:

python

Sure! Here's a basic decorator:

def my_decorator(func):
    def wrapper():
        print("Before the function")
        func()
        print("After the function")
    return wrapper

@my_decorator
def say_hello():
    print("Hello!")

say_hello()

We said “a simple example” — not “a decorator example.” The model connected the dots because it received the full history. It saw our earlier question about decorators.

Quick check: What would happen if we didn’t append assistant messages to the history? Try to predict before reading on.

The answer: the model would lose context. It wouldn’t know what it had already told you. Follow-up questions would fail.

Let’s verify the history is growing as expected.

for msg in chat_history:
    role = msg["role"].upper()
    content = msg["content"][:60]
    print(f"[{role}] {content}...")

python

[SYSTEM] You are a friendly Python tutor. Be concise....
[USER] What are Python decorators?...
[ASSISTANT] Decorators are functions that modify the behavior o...
[USER] Can you show me a simple example?...
[ASSISTANT] Sure! Here's a basic decorator:

def my_decorator(...

Five messages. The list grows by two per exchange. One user message. One assistant reply.

Key Insight: Conversation memory is just a Python list. Append each message, send the full list every time, and the model acts like it remembers. Simple code. Powerful concept.

[TRY IT YOURSELF] Exercise 1: Build a Personality Chatbot

You’ve seen how system messages shape behavior. Time to build your own.

Task: Create a chat_history list with a system message that makes the chatbot act as a sarcastic movie critic. Write a chat() function with memory. Send two messages: ask for a movie review, then ask a follow-up.

Hint 1

Start your `chat_history` with a system message like: `”You are a sarcastic movie critic who rates everything on a scale of 1-10 and always finds something to complain about.”`

Hint 2

Your `chat()` function should: (1) append the user message, (2) send the full history to the API, (3) append and return the assistant’s response. Same pattern as above.

Solution

chat_history = [
    {
        "role": "system",
        "content": "You are a sarcastic movie critic who rates "
                   "everything on a scale of 1-10 and always "
                   "finds something to complain about."
    }
]

def chat(user_message):
    chat_history.append(
        {"role": "user", "content": user_message}
    )
    payload = {
        "model": "gpt-3.5-turbo",
        "messages": chat_history
    }
    response = requests.post(
        API_URL, headers=headers, json=payload
    )
    data = response.json()
    reply = data["choices"][0]["message"]["content"]
    chat_history.append(
        {"role": "assistant", "content": reply}
    )
    return reply

# Turn 1
print(chat("What did you think of Inception?"))
# Turn 2 — model remembers Inception from Turn 1
print(chat("Would you recommend it to a sci-fi hater?"))

**Why it works:** The system message sets the personality. The `chat()` function appends both sides of the chat. The second question uses “it” — and the model knows that refers to Inception because it sees the full thread.

Streaming Chatbot Responses Token by Token

When you use ChatGPT’s web interface, text appears word by word. That’s streaming. Without it, you stare at a blank screen for seconds while the model generates the full response.

How does it work? The API uses Server-Sent Events (SSE). Think of SSE as a one-way data pipe. The server sends small chunks to you as they’re ready. You don’t wait for the full reply.

To enable streaming, add "stream": True to your payload. But you can’t use response.json() anymore. Instead, the API sends a series of text lines. Each starts with data: followed by JSON.

The final line says data: [DONE]. That’s the stop signal. Inside each chunk, the new text lives at choices[0]["delta"]["content"].

def chat_stream(user_message):
    chat_history.append(
        {"role": "user", "content": user_message}
    )

    payload = {
        "model": "gpt-3.5-turbo",
        "messages": chat_history,
        "stream": True
    }
    response = requests.post(
        API_URL, headers=headers, json=payload,
        stream=True
    )

    full_reply = ""
    for line in response.iter_lines():
        if not line:
            continue
        line_text = line.decode("utf-8")
        if line_text == "data: [DONE]":
            break
        if line_text.startswith("data: "):
            chunk = json.loads(line_text[6:])
            delta = chunk["choices"][0]["delta"]
            if "content" in delta:
                token = delta["content"]
                print(token, end="", flush=True)
                full_reply += token

    print()
    chat_history.append(
        {"role": "assistant", "content": full_reply}
    )
    return full_reply

Four things happen here. The user message goes into history. The request fires with streaming on in both the payload and the requests.post() call. Each chunk prints as it shows up. The full reply gets saved to memory.

chat_stream("Explain list comprehensions in 3 sentences.")

Streamed output:

python

A list comprehension creates a new list by applying an expression
to each item in an iterable. The syntax is [expression for item
in iterable if condition]. It's a concise option to a for
loop with append.

The first word shows up in under 200 milliseconds. Without streaming, you’d wait 2-3 seconds in silence.

Warning: You need `stream=True` in TWO places. In the JSON payload (tells the API to stream). And in `requests.post(…, stream=True)` (tells `requests` to read chunks, not buffer everything).

[TRY IT YOURSELF] Exercise 2: Stream with a Word Counter

Streaming is working. Let’s extend it to track output length.

Task: Modify chat_stream() to count words in the response. Print the total word count after streaming finishes.

Hint 1

You already collect the full reply in `full_reply`. After the loop, count words with `len(full_reply.split())`.

Hint 2

After the `print()` newline, add: `word_count = len(full_reply.split())` and `print(f”\n[Words: {word_count}]”)`.

Solution

def chat_stream_counted(user_message):
    chat_history.append(
        {"role": "user", "content": user_message}
    )
    payload = {
        "model": "gpt-3.5-turbo",
        "messages": chat_history,
        "stream": True
    }
    response = requests.post(
        API_URL, headers=headers, json=payload,
        stream=True
    )

    full_reply = ""
    for line in response.iter_lines():
        if not line:
            continue
        line_text = line.decode("utf-8")
        if line_text == "data: [DONE]":
            break
        if line_text.startswith("data: "):
            chunk = json.loads(line_text[6:])
            delta = chunk["choices"][0]["delta"]
            if "content" in delta:
                token = delta["content"]
                print(token, end="", flush=True)
                full_reply += token

    print()
    word_count = len(full_reply.split())
    print(f"\n[Words: {word_count}]")

    chat_history.append(
        {"role": "assistant", "content": full_reply}
    )
    return full_reply

chat_stream_counted("What are the benefits of type hints?")

**Why it works:** `full_reply` captures every token. After streaming ends, `split()` breaks the text into words. This helps you monitor response length and estimate costs.

Managing Long Conversations

Every model has a context window — a token limit per request. GPT-3.5-turbo handles 4,096 tokens (about 3,000 words). GPT-4-turbo goes up to 128,000 tokens.

What if your history exceeds the limit? The API returns an error. Your chatbot crashes.

You have three strategies. Each has tradeoffs.

Strategy 1: Sliding Window

Keep only the last N messages. Simple and predictable. The downside: the chatbot forgets early context.

def trim_history(history, max_messages=20):
    system_msg = history[0]
    if len(history) > max_messages:
        return [system_msg] + history[-(max_messages - 1):]
    return history

Strategy 2: Summarize Old Messages

Ask the model to shrink the chat so far. Swap old messages with that summary. You keep key facts and cut tokens.

def summarize_history(history):
    summary_prompt = (
        "Summarize this chat in 2-3 sentences, "
        "keeping key facts and context:\n\n"
    )
    for msg in history[1:]:
        summary_prompt += f"{msg['role']}: {msg['content']}\n"

    payload = {
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "user", "content": summary_prompt}
        ]
    }
    resp = requests.post(API_URL, headers=headers, json=payload)
    summary = resp.json()["choices"][0]["message"]["content"]

    return [
        history[0],
        {"role": "assistant",
         "content": f"[Summary: {summary}]"}
    ]

Strategy 3: Token Counting

Count tokens and trim to fit. The most exact option. Rough rule: 1 token is about 4 English characters.

def estimate_tokens(text):
    return len(text) // 4

def trim_to_token_limit(history, max_tokens=3000):
    system_msg = history[0]
    total = estimate_tokens(system_msg["content"])
    trimmed = [system_msg]

    for msg in reversed(history[1:]):
        msg_tokens = estimate_tokens(msg["content"])
        if total + msg_tokens > max_tokens:
            break
        trimmed.insert(1, msg)
        total += msg_tokens

    return trimmed

I’d start with Strategy 1 for simple chatbots. It’s easy to debug. Move to Strategy 2 when you need long-term context.

Strategy	Pros	Cons	Best For
Sliding window	Simple, predictable	Loses old context	Short chats
Summarize	Preserves key facts	Extra API call	Long chats
Token counting	Precise control	More code	Production

Tracking Token Usage and Costs

Every non-streaming API reply includes a usage field. It tells you how many tokens you spent. This matters — you pay per token.

payload = {
    "model": "gpt-3.5-turbo",
    "messages": chat_history
}
response = requests.post(API_URL, headers=headers, json=payload)
data = response.json()

usage = data["usage"]
print(f"Prompt tokens:     {usage['prompt_tokens']}")
print(f"Completion tokens: {usage['completion_tokens']}")
print(f"Total tokens:      {usage['total_tokens']}")

Example output:

python

Prompt tokens:     45
Completion tokens: 82
Total tokens:      127

Prompt tokens = what you send. Completion tokens = what the model generates. As the chat grows, prompt tokens climb fast. That’s exactly why trimming matters.

Tip: For streaming mode, add `”stream_options”: {“include_usage”: true}` to your payload. The usage data arrives in the final chunk. OpenAI added this feature in 2024.

Now you have all the building blocks. Let’s put them together.

The Complete Python AI Chatbot — All Pieces Combined

Time to put it all in one clean class. This version handles system prompts, memory, streaming, trimming, and errors.

The Chatbot class has three methods. __init__ sets up the API key and history. send handles one exchange. run starts the chat loop.

class Chatbot:
    def __init__(self, api_key, system_prompt, max_msgs=20):
        self.api_url = (
            "https://api.openai.com/v1/chat/completions"
        )
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
        self.max_msgs = max_msgs
        self.history = [
            {"role": "system", "content": system_prompt}
        ]

    def _trim(self):
        if len(self.history) > self.max_msgs:
            sys_msg = self.history[0]
            recent = self.history[-(self.max_msgs - 1):]
            self.history = [sys_msg] + recent

The send method does the hard work. It trims first, adds the user message, fires the stream request, and saves the full reply.

def send(self, user_message):
        self._trim()
        self.history.append(
            {"role": "user", "content": user_message}
        )

        payload = {
            "model": "gpt-3.5-turbo",
            "messages": self.history,
            "stream": True
        }
        resp = requests.post(
            self.api_url, headers=self.headers,
            json=payload, stream=True
        )

        if resp.status_code != 200:
            print(f"\nError {resp.status_code}: {resp.text[:100]}")
            self.history.pop()
            return None

        full_reply = ""
        for line in resp.iter_lines():
            if not line:
                continue
            text = line.decode("utf-8")
            if text == "data: [DONE]":
                break
            if text.startswith("data: "):
                chunk = json.loads(text[6:])
                delta = chunk["choices"][0]["delta"]
                if "content" in delta:
                    token = delta["content"]
                    print(token, end="", flush=True)
                    full_reply += token

        print()
        self.history.append(
            {"role": "assistant", "content": full_reply}
        )
        return full_reply

The run method creates the chat loop. It reads input until the user types “quit.”

def run(self):
        print("Chatbot ready! Type 'quit' to exit.\n")
        while True:
            user_input = input("You: ")
            if user_input.lower() in ("quit", "exit"):
                print("Goodbye!")
                break
            print("Bot: ", end="")
            self.send(user_input)
            print()

Start it up:

bot = Chatbot(
    api_key="your-api-key-here",
    system_prompt="You are a helpful Python tutor. "
                  "Give concise answers with code examples.",
    max_msgs=20
)
bot.run()

Sample session:

python

Chatbot ready! Type 'quit' to exit.

You: What's a dictionary in Python?
Bot: A dictionary stores key-value pairs. Create one with curly
braces: my_dict = {"name": "Alice", "age": 30}. Access values
by key: my_dict["name"] returns "Alice".

You: How do I add a new key?
Bot: Use assignment: my_dict["email"] = "alice@example.com".
If the key doesn't exist, it gets created.

You: quit
Goodbye!

About 60 lines of real logic. No frameworks. No SDKs. Just requests, json, and a Python list.

[TRY IT YOURSELF] Exercise 3: Add Error Handling

Real API calls fail. Networks drop. Keys expire. Rate limits hit.

Task: Modify send() to handle errors. If response.status_code isn’t 200, print a friendly error with the status code. Don’t add anything to history. Return None.

Hint 1

After `requests.post(…)`, check `resp.status_code != 200`. If it fails, print the code and response text. Remove the user message with `self.history.pop()`. Return `None`.

Hint 2

if resp.status_code != 200:
    print(f"\nAPI error {resp.status_code}: {resp.text[:100]}")
    self.history.pop()
    return None

Place this right after `requests.post()`, before the streaming loop.

Solution

def send(self, user_message):
    self._trim()
    self.history.append(
        {"role": "user", "content": user_message}
    )

    payload = {
        "model": "gpt-3.5-turbo",
        "messages": self.history,
        "stream": True
    }
    resp = requests.post(
        self.api_url, headers=self.headers,
        json=payload, stream=True
    )

    if resp.status_code != 200:
        error_info = resp.text[:200]
        print(f"\nAPI error {resp.status_code}: {error_info}")
        self.history.pop()
        return None

    full_reply = ""
    for line in resp.iter_lines():
        if not line:
            continue
        text = line.decode("utf-8")
        if text == "data: [DONE]":
            break
        if text.startswith("data: "):
            chunk = json.loads(text[6:])
            delta = chunk["choices"][0]["delta"]
            if "content" in delta:
                token = delta["content"]
                print(token, end="", flush=True)
                full_reply += token

    print()
    self.history.append(
        {"role": "assistant", "content": full_reply}
    )
    return full_reply

**Why it works:** Checking status right away avoids parsing a failed stream. Popping the user message keeps history clean. Returning `None` lets the caller detect failure.

Common Mistakes and How to Fix Them

Mistake 1: Forgetting to save the assistant’s reply

The most common bug. You append the user message and call the API. But you never add the reply to history. Next turn, the model can’t see what it said before.

# Wrong — reply never saved
def chat(msg):
    chat_history.append({"role": "user", "content": msg})
    resp = requests.post(API_URL, headers=headers,
        json={"model": "gpt-3.5-turbo",
              "messages": chat_history})
    return resp.json()["choices"][0]["message"]["content"]

# Correct — save both sides
def chat(msg):
    chat_history.append({"role": "user", "content": msg})
    resp = requests.post(API_URL, headers=headers,
        json={"model": "gpt-3.5-turbo",
              "messages": chat_history})
    reply = resp.json()["choices"][0]["message"]["content"]
    chat_history.append({"role": "assistant", "content": reply})
    return reply

Mistake 2: Missing stream=True on the requests call

You set "stream": True in the payload but forget it in requests.post(). The API streams chunks, but requests waits for them all. No real-time output.

# Wrong — buffers everything
response = requests.post(API_URL, headers=headers, json=payload)

# Correct — reads piece by piece
response = requests.post(API_URL, headers=headers,
                         json=payload, stream=True)

Mistake 3: Losing the system message when trimming

You slice from the end and accidentally chop the system message. The chatbot loses its personality.

# Wrong — system message gone
history = history[-10:]

# Correct — always keep system message first
system = history[0]
history = [system] + history[-9:]

Mistake 4: Not handling the [DONE] signal

Skip this check and json.loads() tries to parse [DONE]. It crashes right away.

# Wrong — crashes on [DONE]
for line in response.iter_lines():
    chunk = json.loads(line.decode("utf-8")[6:])

# Correct — check for stop signal
for line in response.iter_lines():
    if not line:
        continue
    text = line.decode("utf-8")
    if text == "data: [DONE]":
        break
    if text.startswith("data: "):
        chunk = json.loads(text[6:])

Switching LLM Providers

Your chatbot code works with any provider that uses the OpenAI format. Many do. Just change the URL, key, and model name.

Provider	Endpoint	Model Example
OpenAI	`api.openai.com/v1/chat/completions`	`gpt-3.5-turbo`
Groq	`api.groq.com/openai/v1/chat/completions`	`llama3-8b-8192`
Together AI	`api.together.xyz/v1/chat/completions`	`meta-llama/Llama-3-8b-chat-hf`
Ollama (local)	`localhost:11434/v1/chat/completions`	`llama3`

# Switch to Groq (free tier available)
API_URL = "https://api.groq.com/openai/v1/chat/completions"
API_KEY = "your-groq-key-here"

payload = {
    "model": "llama3-8b-8192",
    "messages": chat_history,
    "stream": True
}

The OpenAI message format is now a standard. That’s great news — your code works across providers with no changes.

Quick Reference

Component	What It Does	Key Code
API call	Send messages, get reply	`requests.post(URL, headers=h, json=p)`
Message format	Structure each message	`{"role": "user", "content": "text"}`
System prompt	Set chatbot personality	First message with `role: "system"`
Memory	Remember past exchanges	Append user + assistant to a list
Streaming	Show reply in real time	`"stream": True` + `iter_lines()`
Trimming	Stay in token limits	Keep system msg + last N messages
Error handling	Handle API failures	Check `status_code` before parsing
Token tracking	Monitor costs	Read `data["usage"]` from response

Summary

You started with a single API call. You ended with a fully interactive chatbot.

Here’s what you built. LLM APIs are stateless — each call starts fresh. Chat memory is a Python list that grows with each turn. Three roles (system, user, assistant) shape how the model acts. Streaming uses Server-Sent Events for live output. Long chats need trimming to stay under token limits.

Practice exercise: Build a quiz bot. It asks Python questions, evaluates your answers, and keeps score. Use the system message for behavior, memory for state, and streaming for flow.

Solution outline

class QuizBot(Chatbot):
    def __init__(self, api_key):
        prompt = (
            "You are a Python quiz master. Ask one question "
            "at a time. Say if the answer is right or wrong. "
            "Keep a running score. On 'stop', give final score."
        )
        super().__init__(api_key, prompt)

    def start(self):
        print("Python Quiz! Type 'stop' to end.\n")
        self.send("Start with an easy Python question.")
        print()
        while True:
            answer = input("Your answer: ")
            if answer.lower() == "stop":
                self.send("User stopped. Give final score.")
                break
            print("Quiz Master: ", end="")
            self.send(answer)
            print()

quiz = QuizBot("your-api-key-here")
quiz.start()

**How it works:** The system message defines quiz behavior. Each `send()` passes the user’s answer. The full history lets the model track questions, correctness, and score — all through the growing message list.

Complete Code

Click to expand the full script (copy-paste and run)

# Complete code: Build a Python AI Chatbot with Conversation Memory
# Requires: pip install requests
# Python 3.9+

import requests
import json
import os

API_KEY = os.environ.get("OPENAI_API_KEY", "your-api-key-here")
API_URL = "https://api.openai.com/v1/chat/completions"

class Chatbot:
    def __init__(self, api_key, system_prompt, max_msgs=20):
        self.api_url = (
            "https://api.openai.com/v1/chat/completions"
        )
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
        self.max_msgs = max_msgs
        self.history = [
            {"role": "system", "content": system_prompt}
        ]

    def _trim(self):
        if len(self.history) > self.max_msgs:
            sys_msg = self.history[0]
            recent = self.history[-(self.max_msgs - 1):]
            self.history = [sys_msg] + recent

    def send(self, user_message):
        self._trim()
        self.history.append(
            {"role": "user", "content": user_message}
        )
        payload = {
            "model": "gpt-3.5-turbo",
            "messages": self.history,
            "stream": True
        }
        resp = requests.post(
            self.api_url, headers=self.headers,
            json=payload, stream=True
        )
        if resp.status_code != 200:
            print(f"\nError {resp.status_code}: {resp.text[:100]}")
            self.history.pop()
            return None

        full_reply = ""
        for line in resp.iter_lines():
            if not line:
                continue
            text = line.decode("utf-8")
            if text == "data: [DONE]":
                break
            if text.startswith("data: "):
                chunk = json.loads(text[6:])
                delta = chunk["choices"][0]["delta"]
                if "content" in delta:
                    token = delta["content"]
                    print(token, end="", flush=True)
                    full_reply += token

        print()
        self.history.append(
            {"role": "assistant", "content": full_reply}
        )
        return full_reply

    def run(self):
        print("Chatbot ready! Type 'quit' to exit.\n")
        while True:
            user_input = input("You: ")
            if user_input.lower() in ("quit", "exit"):
                print("Goodbye!")
                break
            print("Bot: ", end="")
            self.send(user_input)
            print()

if __name__ == "__main__":
    bot = Chatbot(
        api_key=API_KEY,
        system_prompt="You are a helpful Python tutor. "
                      "Give concise answers with code.",
        max_msgs=20
    )
    bot.run()

Frequently Asked Questions

Can I use this with local models like Ollama?

Yes. Ollama runs an OpenAI-compatible endpoint at http://localhost:11434/v1/chat/completions. Change API_URL and the model name. No API key needed. The message format and streaming work the same way.

How do I save and load a chat?

The history is a list of dictionaries. JSON handles it directly:

python

# Save
with open("chat_history.json", "w") as f:
    json.dump(chat_history, f)

# Load
with open("chat_history.json", "r") as f:
    chat_history = json.load(f)

Why does my chatbot give shorter answers over time?

The context window is shared between input and output. As history grows, fewer tokens remain for the response. Trim more harder or use a model with a bigger window (GPT-4-turbo has 128K tokens).

Is the requests library good enough for production?

For one user, yes. For a web app with many users, switch to an async client like httpx or aiohttp. The requests library blocks while it waits. Async tools handle many users at once much better.

How much does each chat cost?

GPT-3.5-turbo runs about $0.0005 per 1K input tokens and $0.0015 per 1K output tokens (as of early 2025). A 10-turn chat costs roughly $0.002 to $0.005. Less than a cent.

References

OpenAI API documentation — Chat Completions. Link
OpenAI API documentation — Streaming. Link
OpenAI documentation — Models overview. Link
Python requests library documentation. Link
Server-Sent Events — MDN Web Docs. Link
OpenAI Cookbook — How to stream completions. Link
Groq API documentation — OpenAI compatibility. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Build an AI Chatbot with Memory in Python (2026)

Why LLM Chatbot Conversations Are Stateless

Prerequisites

Your First Chatbot API Call with Raw HTTP

Controlling the Chatbot’s Output

The Message Format — Roles Explained

Building Conversation Memory in Python

[TRY IT YOURSELF] Exercise 1: Build a Personality Chatbot

Streaming Chatbot Responses Token by Token

[TRY IT YOURSELF] Exercise 2: Stream with a Word Counter

Managing Long Conversations

Tracking Token Usage and Costs

The Complete Python AI Chatbot — All Pieces Combined

[TRY IT YOURSELF] Exercise 3: Add Error Handling

Common Mistakes and How to Fix Them

Mistake 1: Forgetting to save the assistant’s reply

Mistake 2: Missing stream=True on the requests call

Mistake 3: Losing the system message when trimming

Mistake 4: Not handling the [DONE] signal

Switching LLM Providers

Quick Reference

Summary

Complete Code

Frequently Asked Questions

Can I use this with local models like Ollama?

How do I save and load a chat?

Why does my chatbot give shorter answers over time?

Is the requests library good enough for production?

How much does each chat cost?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Why LLM Chatbot Conversations Are Stateless

Prerequisites

Your First Chatbot API Call with Raw HTTP

Controlling the Chatbot’s Output

The Message Format — Roles Explained

Building Conversation Memory in Python

[TRY IT YOURSELF] Exercise 1: Build a Personality Chatbot

Streaming Chatbot Responses Token by Token

[TRY IT YOURSELF] Exercise 2: Stream with a Word Counter

Managing Long Conversations

Tracking Token Usage and Costs

The Complete Python AI Chatbot — All Pieces Combined

[TRY IT YOURSELF] Exercise 3: Add Error Handling

Common Mistakes and How to Fix Them

Mistake 1: Forgetting to save the assistant’s reply

Mistake 2: Missing stream=True on the requests call

Mistake 3: Losing the system message when trimming

Mistake 4: Not handling the [DONE] signal

Switching LLM Providers

Quick Reference

Summary

Complete Code

Frequently Asked Questions

Can I use this with local models like Ollama?

How do I save and load a chat?

Why does my chatbot give shorter answers over time?

Is the requests library good enough for production?

How much does each chat cost?

References

Related Articles

Zero-Shot vs Few-Shot Prompting: Complete Guide

Prompt Engineering Tutorial for Beginners (2026)

Build a Multi-Provider LLM Toolkit (Python Project)

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.