Menu

Build an AI Chatbot with Memory in Python (2026)

Build a Python AI chatbot with conversation memory that actually remembers. Raw HTTP tutorial with streaming, 3 hands-on exercises, and complete code you can run today.

Written by Selva Prabhakaran | 24 min read

Send messages to an LLM, keep the chat alive, and stream replies token by token — using only raw HTTP requests.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

You type “What’s the capital of France?” The chatbot says “Paris.” You follow up with “What’s its population?” — and the chatbot has no clue what “its” refers to. It forgot everything.

Every API call to an LLM starts fresh. The model doesn’t remember what you said ten seconds ago. That’s the problem this article solves.

You’ll build a chatbot that remembers the full chat, streams replies in real time, and runs on raw Python. No SDKs needed.

Why LLM Chatbot Conversations Are Stateless

Here’s what surprises most beginners. When you call the ChatGPT API, the model doesn’t keep a chat going in the background. Each call is on its own. The model gets your message, replies, and forgets.

So how does ChatGPT seem to remember? The client sends the entire chat history with every request. Every user message and every assistant reply goes back to the API each time.

import micropip
await micropip.install('requests')

# What actually happens behind the scenes
# Request 1: You send one message
messages = [
    {"role": "user", "content": "What's the capital of France?"}
]
# Response: "The capital of France is Paris."

# Request 2: You send BOTH messages
messages = [
    {"role": "user", "content": "What's the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's its population?"}
]
# Now the model sees "its" refers to Paris

The model reads the full list top to bottom. It sees the context. It connects “its” to “Paris.” No magic memory — just a growing list.

Key Insight: LLMs have zero memory between API calls. The illusion of memory comes from resending the full chat with every request. Your code manages the memory, not the model.

This has a practical cost. The message list grows with each exchange. Every API call uses more tokens as the chat gets longer. You’ll need a trimming strategy — we’ll cover that soon.

Prerequisites

  • Python version: 3.9+
  • Required library: requests
  • Install: pip install requests
  • API key: An OpenAI API key (platform.openai.com)
  • Time to complete: 20-25 minutes
Note: You need an OpenAI API key. Sign up at platform.openai.com, go to API Keys, and create a new secret key. Keep it safe — treat it like a password. The examples here use GPT-3.5-turbo, which costs fractions of a cent per request.

Your First Chatbot API Call with Raw HTTP

Most tutorials reach for the openai Python SDK. We won’t. We’ll use the requests library instead.

Why? First, you’ll see exactly what happens at the network level. Second, requests works in Pyodide (browser Python), but the OpenAI SDK doesn’t.

The endpoint lives at https://api.openai.com/v1/chat/completions. You send a POST request with your API key in the header and messages in the JSON body. The model field picks the model. The messages field holds the chat.

import requests
import json
import os

API_KEY = os.environ.get("OPENAI_API_KEY", "your-api-key-here")
API_URL = "https://api.openai.com/v1/chat/completions"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

payload = {
    "model": "gpt-3.5-turbo",
    "messages": [
        {"role": "user", "content": "What is Python?"}
    ]
}

response = requests.post(API_URL, headers=headers, json=payload)
data = response.json()

print(data["choices"][0]["message"]["content"])

Result:

python
Python is a high-level, interpreted programming language known
for its simplicity and readability. It supports multiple
programming paradigms and has a large standard library.

One POST request. That’s all it takes. The response JSON has a choices array. Each choice contains a message with role and content. We grab choices[0] because we asked for one response.

Tip: Always check `response.status_code` before parsing. A 401 means a bad API key. A 429 means rate limits. A 500 means their servers broke. Handle these in production.

Controlling the Chatbot’s Output

Two settings shape every reply: temperature and max_tokens. The temperature controls how random the output is. Set it to 0 for steady, same-every-time answers. Set it to 1.0 for creative, varied replies. The default is 1.0.

max_tokens caps how long the response can be. If you don’t set it, the model uses whatever tokens remain in its context window.

payload = {
    "model": "gpt-3.5-turbo",
    "messages": [
        {"role": "user", "content": "Write a haiku about Python."}
    ],
    "temperature": 0.7,
    "max_tokens": 100
}

response = requests.post(API_URL, headers=headers, json=payload)
print(response.json()["choices"][0]["message"]["content"])

Output:

python
Indented with care,
Loops and lists in harmony,
Code that reads like prose.

For a chatbot tutor, I’d use temperature=0.3. Low enough for accurate answers. High enough to avoid robotic repetition.

Tip: Use `temperature=0` when you need repeatable output. This is great for testing and debugging. Bump it to 0.5-0.7 for more natural chat.

The Message Format — Roles Explained

Every message needs two fields: role and content. There are three roles. Each one shapes how the model acts.

RolePurposeWhen to Use
systemSets personality and rulesOnce, at chat start
userThe human’s inputEvery time the user types
assistantThe model’s past responsesSo the model “remembers”

The system message is your control lever. Want a chatbot that speaks like a pirate? System message. Want one that only answers Python questions? System message.

Here’s how a system message shapes a response. We’ll set the model to be a Python tutor that gives short answers with code.

messages = [
    {
        "role": "system",
        "content": "You are a helpful Python tutor. "
                   "Keep answers short. Include code examples."
    },
    {
        "role": "user",
        "content": "How do I reverse a list?"
    }
]

payload = {"model": "gpt-3.5-turbo", "messages": messages}
response = requests.post(API_URL, headers=headers, json=payload)
data = response.json()

print(data["choices"][0]["message"]["content"])

The model responds briefly with code:

python
You can reverse a list using reverse() or slicing:

my_list = [1, 2, 3, 4, 5]
my_list.reverse()
print(my_list)  # [5, 4, 3, 2, 1]

# Or use slicing (creates a new list)
reversed_list = my_list[::-1]

Without that system message, the answer would be longer and less focused. The system role is powerful — use it.

Building Conversation Memory in Python

Here’s where the real work starts. You need a place to store every message. Then you send that full history with each API call.

The simplest approach? A plain Python list. Each element is a dictionary with role and content. Append the user’s message. Append the model’s reply. The list grows — that’s your memory.

The chat() function below does the work. It takes user text, adds it to history, calls the API with the full history, saves the reply, and returns it.

chat_history = [
    {
        "role": "system",
        "content": "You are a friendly Python tutor. Be concise."
    }
]

def chat(user_message):
    chat_history.append(
        {"role": "user", "content": user_message}
    )

    payload = {
        "model": "gpt-3.5-turbo",
        "messages": chat_history
    }
    response = requests.post(
        API_URL, headers=headers, json=payload
    )
    data = response.json()

    assistant_message = data["choices"][0]["message"]["content"]
    chat_history.append(
        {"role": "assistant", "content": assistant_message}
    )
    return assistant_message

No output from this block — it just defines the function.

print(chat("What are Python decorators?"))

Response:

python
Decorators are functions that modify the behavior of other
functions. You write them with the @ symbol above a function
definition.

That’s turn one. The history now has three items: system message, user question, assistant reply.

What happens on a follow-up? Let’s find out.

print(chat("Can you show me a simple example?"))

Response:

python
Sure! Here's a basic decorator:

def my_decorator(func):
    def wrapper():
        print("Before the function")
        func()
        print("After the function")
    return wrapper

@my_decorator
def say_hello():
    print("Hello!")

say_hello()

We said “a simple example” — not “a decorator example.” The model connected the dots because it received the full history. It saw our earlier question about decorators.

Quick check: What would happen if we didn’t append assistant messages to the history? Try to predict before reading on.

The answer: the model would lose context. It wouldn’t know what it had already told you. Follow-up questions would fail.

Let’s verify the history is growing as expected.

for msg in chat_history:
    role = msg["role"].upper()
    content = msg["content"][:60]
    print(f"[{role}] {content}...")
python
[SYSTEM] You are a friendly Python tutor. Be concise....
[USER] What are Python decorators?...
[ASSISTANT] Decorators are functions that modify the behavior o...
[USER] Can you show me a simple example?...
[ASSISTANT] Sure! Here's a basic decorator:

def my_decorator(...

Five messages. The list grows by two per exchange. One user message. One assistant reply.

Key Insight: Conversation memory is just a Python list. Append each message, send the full list every time, and the model acts like it remembers. Simple code. Powerful concept.

[TRY IT YOURSELF] Exercise 1: Build a Personality Chatbot

You’ve seen how system messages shape behavior. Time to build your own.

Task: Create a chat_history list with a system message that makes the chatbot act as a sarcastic movie critic. Write a chat() function with memory. Send two messages: ask for a movie review, then ask a follow-up.

Hint 1

Start your `chat_history` with a system message like: `”You are a sarcastic movie critic who rates everything on a scale of 1-10 and always finds something to complain about.”`

Hint 2

Your `chat()` function should: (1) append the user message, (2) send the full history to the API, (3) append and return the assistant’s response. Same pattern as above.

Solution
chat_history = [
    {
        "role": "system",
        "content": "You are a sarcastic movie critic who rates "
                   "everything on a scale of 1-10 and always "
                   "finds something to complain about."
    }
]

def chat(user_message):
    chat_history.append(
        {"role": "user", "content": user_message}
    )
    payload = {
        "model": "gpt-3.5-turbo",
        "messages": chat_history
    }
    response = requests.post(
        API_URL, headers=headers, json=payload
    )
    data = response.json()
    reply = data["choices"][0]["message"]["content"]
    chat_history.append(
        {"role": "assistant", "content": reply}
    )
    return reply

# Turn 1
print(chat("What did you think of Inception?"))
# Turn 2 — model remembers Inception from Turn 1
print(chat("Would you recommend it to a sci-fi hater?"))

**Why it works:** The system message sets the personality. The `chat()` function appends both sides of the chat. The second question uses “it” — and the model knows that refers to Inception because it sees the full thread.

Streaming Chatbot Responses Token by Token

When you use ChatGPT’s web interface, text appears word by word. That’s streaming. Without it, you stare at a blank screen for seconds while the model generates the full response.

How does it work? The API uses Server-Sent Events (SSE). Think of SSE as a one-way data pipe. The server sends small chunks to you as they’re ready. You don’t wait for the full reply.

To enable streaming, add "stream": True to your payload. But you can’t use response.json() anymore. Instead, the API sends a series of text lines. Each starts with data: followed by JSON.

The final line says data: [DONE]. That’s the stop signal. Inside each chunk, the new text lives at choices[0]["delta"]["content"].

def chat_stream(user_message):
    chat_history.append(
        {"role": "user", "content": user_message}
    )

    payload = {
        "model": "gpt-3.5-turbo",
        "messages": chat_history,
        "stream": True
    }
    response = requests.post(
        API_URL, headers=headers, json=payload,
        stream=True
    )

    full_reply = ""
    for line in response.iter_lines():
        if not line:
            continue
        line_text = line.decode("utf-8")
        if line_text == "data: [DONE]":
            break
        if line_text.startswith("data: "):
            chunk = json.loads(line_text[6:])
            delta = chunk["choices"][0]["delta"]
            if "content" in delta:
                token = delta["content"]
                print(token, end="", flush=True)
                full_reply += token

    print()
    chat_history.append(
        {"role": "assistant", "content": full_reply}
    )
    return full_reply

Four things happen here. The user message goes into history. The request fires with streaming on in both the payload and the requests.post() call. Each chunk prints as it shows up. The full reply gets saved to memory.

chat_stream("Explain list comprehensions in 3 sentences.")

Streamed output:

python
A list comprehension creates a new list by applying an expression
to each item in an iterable. The syntax is [expression for item
in iterable if condition]. It's a concise option to a for
loop with append.

The first word shows up in under 200 milliseconds. Without streaming, you’d wait 2-3 seconds in silence.

Warning: You need `stream=True` in TWO places. In the JSON payload (tells the API to stream). And in `requests.post(…, stream=True)` (tells `requests` to read chunks, not buffer everything).

[TRY IT YOURSELF] Exercise 2: Stream with a Word Counter

Streaming is working. Let’s extend it to track output length.

Task: Modify chat_stream() to count words in the response. Print the total word count after streaming finishes.

Hint 1

You already collect the full reply in `full_reply`. After the loop, count words with `len(full_reply.split())`.

Hint 2

After the `print()` newline, add: `word_count = len(full_reply.split())` and `print(f”\n[Words: {word_count}]”)`.

Solution
def chat_stream_counted(user_message):
    chat_history.append(
        {"role": "user", "content": user_message}
    )
    payload = {
        "model": "gpt-3.5-turbo",
        "messages": chat_history,
        "stream": True
    }
    response = requests.post(
        API_URL, headers=headers, json=payload,
        stream=True
    )

    full_reply = ""
    for line in response.iter_lines():
        if not line:
            continue
        line_text = line.decode("utf-8")
        if line_text == "data: [DONE]":
            break
        if line_text.startswith("data: "):
            chunk = json.loads(line_text[6:])
            delta = chunk["choices"][0]["delta"]
            if "content" in delta:
                token = delta["content"]
                print(token, end="", flush=True)
                full_reply += token

    print()
    word_count = len(full_reply.split())
    print(f"\n[Words: {word_count}]")

    chat_history.append(
        {"role": "assistant", "content": full_reply}
    )
    return full_reply

chat_stream_counted("What are the benefits of type hints?")

**Why it works:** `full_reply` captures every token. After streaming ends, `split()` breaks the text into words. This helps you monitor response length and estimate costs.

Managing Long Conversations

Every model has a context window — a token limit per request. GPT-3.5-turbo handles 4,096 tokens (about 3,000 words). GPT-4-turbo goes up to 128,000 tokens.

What if your history exceeds the limit? The API returns an error. Your chatbot crashes.

You have three strategies. Each has tradeoffs.


Strategy 1: Sliding Window

Keep only the last N messages. Simple and predictable. The downside: the chatbot forgets early context.

def trim_history(history, max_messages=20):
    system_msg = history[0]
    if len(history) > max_messages:
        return [system_msg] + history[-(max_messages - 1):]
    return history

Strategy 2: Summarize Old Messages

Ask the model to shrink the chat so far. Swap old messages with that summary. You keep key facts and cut tokens.

def summarize_history(history):
    summary_prompt = (
        "Summarize this chat in 2-3 sentences, "
        "keeping key facts and context:\n\n"
    )
    for msg in history[1:]:
        summary_prompt += f"{msg['role']}: {msg['content']}\n"

    payload = {
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "user", "content": summary_prompt}
        ]
    }
    resp = requests.post(API_URL, headers=headers, json=payload)
    summary = resp.json()["choices"][0]["message"]["content"]

    return [
        history[0],
        {"role": "assistant",
         "content": f"[Summary: {summary}]"}
    ]

Strategy 3: Token Counting

Count tokens and trim to fit. The most exact option. Rough rule: 1 token is about 4 English characters.

def estimate_tokens(text):
    return len(text) // 4

def trim_to_token_limit(history, max_tokens=3000):
    system_msg = history[0]
    total = estimate_tokens(system_msg["content"])
    trimmed = [system_msg]

    for msg in reversed(history[1:]):
        msg_tokens = estimate_tokens(msg["content"])
        if total + msg_tokens > max_tokens:
            break
        trimmed.insert(1, msg)
        total += msg_tokens

    return trimmed

I’d start with Strategy 1 for simple chatbots. It’s easy to debug. Move to Strategy 2 when you need long-term context.

StrategyProsConsBest For
Sliding windowSimple, predictableLoses old contextShort chats
SummarizePreserves key factsExtra API callLong chats
Token countingPrecise controlMore codeProduction

Tracking Token Usage and Costs

Every non-streaming API reply includes a usage field. It tells you how many tokens you spent. This matters — you pay per token.

payload = {
    "model": "gpt-3.5-turbo",
    "messages": chat_history
}
response = requests.post(API_URL, headers=headers, json=payload)
data = response.json()

usage = data["usage"]
print(f"Prompt tokens:     {usage['prompt_tokens']}")
print(f"Completion tokens: {usage['completion_tokens']}")
print(f"Total tokens:      {usage['total_tokens']}")

Example output:

python
Prompt tokens:     45
Completion tokens: 82
Total tokens:      127

Prompt tokens = what you send. Completion tokens = what the model generates. As the chat grows, prompt tokens climb fast. That’s exactly why trimming matters.

Tip: For streaming mode, add `”stream_options”: {“include_usage”: true}` to your payload. The usage data arrives in the final chunk. OpenAI added this feature in 2024.

Now you have all the building blocks. Let’s put them together.

The Complete Python AI Chatbot — All Pieces Combined

Time to put it all in one clean class. This version handles system prompts, memory, streaming, trimming, and errors.

The Chatbot class has three methods. __init__ sets up the API key and history. send handles one exchange. run starts the chat loop.

class Chatbot:
    def __init__(self, api_key, system_prompt, max_msgs=20):
        self.api_url = (
            "https://api.openai.com/v1/chat/completions"
        )
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
        self.max_msgs = max_msgs
        self.history = [
            {"role": "system", "content": system_prompt}
        ]

    def _trim(self):
        if len(self.history) > self.max_msgs:
            sys_msg = self.history[0]
            recent = self.history[-(self.max_msgs - 1):]
            self.history = [sys_msg] + recent

The send method does the hard work. It trims first, adds the user message, fires the stream request, and saves the full reply.

def send(self, user_message):
        self._trim()
        self.history.append(
            {"role": "user", "content": user_message}
        )

        payload = {
            "model": "gpt-3.5-turbo",
            "messages": self.history,
            "stream": True
        }
        resp = requests.post(
            self.api_url, headers=self.headers,
            json=payload, stream=True
        )

        if resp.status_code != 200:
            print(f"\nError {resp.status_code}: {resp.text[:100]}")
            self.history.pop()
            return None

        full_reply = ""
        for line in resp.iter_lines():
            if not line:
                continue
            text = line.decode("utf-8")
            if text == "data: [DONE]":
                break
            if text.startswith("data: "):
                chunk = json.loads(text[6:])
                delta = chunk["choices"][0]["delta"]
                if "content" in delta:
                    token = delta["content"]
                    print(token, end="", flush=True)
                    full_reply += token

        print()
        self.history.append(
            {"role": "assistant", "content": full_reply}
        )
        return full_reply

The run method creates the chat loop. It reads input until the user types “quit.”

def run(self):
        print("Chatbot ready! Type 'quit' to exit.\n")
        while True:
            user_input = input("You: ")
            if user_input.lower() in ("quit", "exit"):
                print("Goodbye!")
                break
            print("Bot: ", end="")
            self.send(user_input)
            print()

Start it up:

bot = Chatbot(
    api_key="your-api-key-here",
    system_prompt="You are a helpful Python tutor. "
                  "Give concise answers with code examples.",
    max_msgs=20
)
bot.run()

Sample session:

python
Chatbot ready! Type 'quit' to exit.

You: What's a dictionary in Python?
Bot: A dictionary stores key-value pairs. Create one with curly
braces: my_dict = {"name": "Alice", "age": 30}. Access values
by key: my_dict["name"] returns "Alice".

You: How do I add a new key?
Bot: Use assignment: my_dict["email"] = "alice@example.com".
If the key doesn't exist, it gets created.

You: quit
Goodbye!

About 60 lines of real logic. No frameworks. No SDKs. Just requests, json, and a Python list.

[TRY IT YOURSELF] Exercise 3: Add Error Handling

Real API calls fail. Networks drop. Keys expire. Rate limits hit.

Task: Modify send() to handle errors. If response.status_code isn’t 200, print a friendly error with the status code. Don’t add anything to history. Return None.

Hint 1

After `requests.post(…)`, check `resp.status_code != 200`. If it fails, print the code and response text. Remove the user message with `self.history.pop()`. Return `None`.

Hint 2
if resp.status_code != 200:
    print(f"\nAPI error {resp.status_code}: {resp.text[:100]}")
    self.history.pop()
    return None

Place this right after `requests.post()`, before the streaming loop.

Solution
def send(self, user_message):
    self._trim()
    self.history.append(
        {"role": "user", "content": user_message}
    )

    payload = {
        "model": "gpt-3.5-turbo",
        "messages": self.history,
        "stream": True
    }
    resp = requests.post(
        self.api_url, headers=self.headers,
        json=payload, stream=True
    )

    if resp.status_code != 200:
        error_info = resp.text[:200]
        print(f"\nAPI error {resp.status_code}: {error_info}")
        self.history.pop()
        return None

    full_reply = ""
    for line in resp.iter_lines():
        if not line:
            continue
        text = line.decode("utf-8")
        if text == "data: [DONE]":
            break
        if text.startswith("data: "):
            chunk = json.loads(text[6:])
            delta = chunk["choices"][0]["delta"]
            if "content" in delta:
                token = delta["content"]
                print(token, end="", flush=True)
                full_reply += token

    print()
    self.history.append(
        {"role": "assistant", "content": full_reply}
    )
    return full_reply

**Why it works:** Checking status right away avoids parsing a failed stream. Popping the user message keeps history clean. Returning `None` lets the caller detect failure.

Common Mistakes and How to Fix Them

Mistake 1: Forgetting to save the assistant’s reply

The most common bug. You append the user message and call the API. But you never add the reply to history. Next turn, the model can’t see what it said before.

# Wrong — reply never saved
def chat(msg):
    chat_history.append({"role": "user", "content": msg})
    resp = requests.post(API_URL, headers=headers,
        json={"model": "gpt-3.5-turbo",
              "messages": chat_history})
    return resp.json()["choices"][0]["message"]["content"]
# Correct — save both sides
def chat(msg):
    chat_history.append({"role": "user", "content": msg})
    resp = requests.post(API_URL, headers=headers,
        json={"model": "gpt-3.5-turbo",
              "messages": chat_history})
    reply = resp.json()["choices"][0]["message"]["content"]
    chat_history.append({"role": "assistant", "content": reply})
    return reply

Mistake 2: Missing stream=True on the requests call

You set "stream": True in the payload but forget it in requests.post(). The API streams chunks, but requests waits for them all. No real-time output.

# Wrong — buffers everything
response = requests.post(API_URL, headers=headers, json=payload)

# Correct — reads piece by piece
response = requests.post(API_URL, headers=headers,
                         json=payload, stream=True)

Mistake 3: Losing the system message when trimming

You slice from the end and accidentally chop the system message. The chatbot loses its personality.

# Wrong — system message gone
history = history[-10:]

# Correct — always keep system message first
system = history[0]
history = [system] + history[-9:]

Mistake 4: Not handling the [DONE] signal

Skip this check and json.loads() tries to parse [DONE]. It crashes right away.

# Wrong — crashes on [DONE]
for line in response.iter_lines():
    chunk = json.loads(line.decode("utf-8")[6:])

# Correct — check for stop signal
for line in response.iter_lines():
    if not line:
        continue
    text = line.decode("utf-8")
    if text == "data: [DONE]":
        break
    if text.startswith("data: "):
        chunk = json.loads(text[6:])

Switching LLM Providers

Your chatbot code works with any provider that uses the OpenAI format. Many do. Just change the URL, key, and model name.

ProviderEndpointModel Example
OpenAIapi.openai.com/v1/chat/completionsgpt-3.5-turbo
Groqapi.groq.com/openai/v1/chat/completionsllama3-8b-8192
Together AIapi.together.xyz/v1/chat/completionsmeta-llama/Llama-3-8b-chat-hf
Ollama (local)localhost:11434/v1/chat/completionsllama3
# Switch to Groq (free tier available)
API_URL = "https://api.groq.com/openai/v1/chat/completions"
API_KEY = "your-groq-key-here"

payload = {
    "model": "llama3-8b-8192",
    "messages": chat_history,
    "stream": True
}

The OpenAI message format is now a standard. That’s great news — your code works across providers with no changes.

Quick Reference

ComponentWhat It DoesKey Code
API callSend messages, get replyrequests.post(URL, headers=h, json=p)
Message formatStructure each message{"role": "user", "content": "text"}
System promptSet chatbot personalityFirst message with role: "system"
MemoryRemember past exchangesAppend user + assistant to a list
StreamingShow reply in real time"stream": True + iter_lines()
TrimmingStay in token limitsKeep system msg + last N messages
Error handlingHandle API failuresCheck status_code before parsing
Token trackingMonitor costsRead data["usage"] from response

Summary

You started with a single API call. You ended with a fully interactive chatbot.

Here’s what you built. LLM APIs are stateless — each call starts fresh. Chat memory is a Python list that grows with each turn. Three roles (system, user, assistant) shape how the model acts. Streaming uses Server-Sent Events for live output. Long chats need trimming to stay under token limits.

Practice exercise: Build a quiz bot. It asks Python questions, evaluates your answers, and keeps score. Use the system message for behavior, memory for state, and streaming for flow.

Solution outline
class QuizBot(Chatbot):
    def __init__(self, api_key):
        prompt = (
            "You are a Python quiz master. Ask one question "
            "at a time. Say if the answer is right or wrong. "
            "Keep a running score. On 'stop', give final score."
        )
        super().__init__(api_key, prompt)

    def start(self):
        print("Python Quiz! Type 'stop' to end.\n")
        self.send("Start with an easy Python question.")
        print()
        while True:
            answer = input("Your answer: ")
            if answer.lower() == "stop":
                self.send("User stopped. Give final score.")
                break
            print("Quiz Master: ", end="")
            self.send(answer)
            print()

quiz = QuizBot("your-api-key-here")
quiz.start()

**How it works:** The system message defines quiz behavior. Each `send()` passes the user’s answer. The full history lets the model track questions, correctness, and score — all through the growing message list.

Complete Code

Click to expand the full script (copy-paste and run)
# Complete code: Build a Python AI Chatbot with Conversation Memory
# Requires: pip install requests
# Python 3.9+

import requests
import json
import os

API_KEY = os.environ.get("OPENAI_API_KEY", "your-api-key-here")
API_URL = "https://api.openai.com/v1/chat/completions"

class Chatbot:
    def __init__(self, api_key, system_prompt, max_msgs=20):
        self.api_url = (
            "https://api.openai.com/v1/chat/completions"
        )
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
        self.max_msgs = max_msgs
        self.history = [
            {"role": "system", "content": system_prompt}
        ]

    def _trim(self):
        if len(self.history) > self.max_msgs:
            sys_msg = self.history[0]
            recent = self.history[-(self.max_msgs - 1):]
            self.history = [sys_msg] + recent

    def send(self, user_message):
        self._trim()
        self.history.append(
            {"role": "user", "content": user_message}
        )
        payload = {
            "model": "gpt-3.5-turbo",
            "messages": self.history,
            "stream": True
        }
        resp = requests.post(
            self.api_url, headers=self.headers,
            json=payload, stream=True
        )
        if resp.status_code != 200:
            print(f"\nError {resp.status_code}: {resp.text[:100]}")
            self.history.pop()
            return None

        full_reply = ""
        for line in resp.iter_lines():
            if not line:
                continue
            text = line.decode("utf-8")
            if text == "data: [DONE]":
                break
            if text.startswith("data: "):
                chunk = json.loads(text[6:])
                delta = chunk["choices"][0]["delta"]
                if "content" in delta:
                    token = delta["content"]
                    print(token, end="", flush=True)
                    full_reply += token

        print()
        self.history.append(
            {"role": "assistant", "content": full_reply}
        )
        return full_reply

    def run(self):
        print("Chatbot ready! Type 'quit' to exit.\n")
        while True:
            user_input = input("You: ")
            if user_input.lower() in ("quit", "exit"):
                print("Goodbye!")
                break
            print("Bot: ", end="")
            self.send(user_input)
            print()

if __name__ == "__main__":
    bot = Chatbot(
        api_key=API_KEY,
        system_prompt="You are a helpful Python tutor. "
                      "Give concise answers with code.",
        max_msgs=20
    )
    bot.run()

Frequently Asked Questions

Can I use this with local models like Ollama?

Yes. Ollama runs an OpenAI-compatible endpoint at http://localhost:11434/v1/chat/completions. Change API_URL and the model name. No API key needed. The message format and streaming work the same way.

How do I save and load a chat?

The history is a list of dictionaries. JSON handles it directly:

python
# Save
with open("chat_history.json", "w") as f:
    json.dump(chat_history, f)

# Load
with open("chat_history.json", "r") as f:
    chat_history = json.load(f)

Why does my chatbot give shorter answers over time?

The context window is shared between input and output. As history grows, fewer tokens remain for the response. Trim more harder or use a model with a bigger window (GPT-4-turbo has 128K tokens).

Is the requests library good enough for production?

For one user, yes. For a web app with many users, switch to an async client like httpx or aiohttp. The requests library blocks while it waits. Async tools handle many users at once much better.

How much does each chat cost?

GPT-3.5-turbo runs about $0.0005 per 1K input tokens and $0.0015 per 1K output tokens (as of early 2025). A 10-turn chat costs roughly $0.002 to $0.005. Less than a cent.

References

  1. OpenAI API documentation — Chat Completions. Link
  2. OpenAI API documentation — Streaming. Link
  3. OpenAI documentation — Models overview. Link
  4. Python requests library documentation. Link
  5. Server-Sent Events — MDN Web Docs. Link
  6. OpenAI Cookbook — How to stream completions. Link
  7. Groq API documentation — OpenAI compatibility. Link
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science