machine learning +
Zero-Shot vs Few-Shot Prompting: Complete Guide
Build an AI Chatbot with Memory in Python (2026)
Build a Python AI chatbot with conversation memory that actually remembers. Raw HTTP tutorial with streaming, 3 hands-on exercises, and complete code you can run today.
Send messages to an LLM, keep the chat alive, and stream replies token by token — using only raw HTTP requests.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.
You type “What’s the capital of France?” The chatbot says “Paris.” You follow up with “What’s its population?” — and the chatbot has no clue what “its” refers to. It forgot everything.
Every API call to an LLM starts fresh. The model doesn’t remember what you said ten seconds ago. That’s the problem this article solves.
You’ll build a chatbot that remembers the full chat, streams replies in real time, and runs on raw Python. No SDKs needed.
Why LLM Chatbot Conversations Are Stateless
Here’s what surprises most beginners. When you call the ChatGPT API, the model doesn’t keep a chat going in the background. Each call is on its own. The model gets your message, replies, and forgets.
So how does ChatGPT seem to remember? The client sends the entire chat history with every request. Every user message and every assistant reply goes back to the API each time.
import micropip
await micropip.install('requests')
# What actually happens behind the scenes
# Request 1: You send one message
messages = [
{"role": "user", "content": "What's the capital of France?"}
]
# Response: "The capital of France is Paris."
# Request 2: You send BOTH messages
messages = [
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "What's its population?"}
]
# Now the model sees "its" refers to Paris
The model reads the full list top to bottom. It sees the context. It connects “its” to “Paris.” No magic memory — just a growing list.
Key Insight: LLMs have zero memory between API calls. The illusion of memory comes from resending the full chat with every request. Your code manages the memory, not the model.
This has a practical cost. The message list grows with each exchange. Every API call uses more tokens as the chat gets longer. You’ll need a trimming strategy — we’ll cover that soon.
Prerequisites
- Python version: 3.9+
- Required library: requests
- Install:
pip install requests - API key: An OpenAI API key (platform.openai.com)
- Time to complete: 20-25 minutes
Note: You need an OpenAI API key. Sign up at platform.openai.com, go to API Keys, and create a new secret key. Keep it safe — treat it like a password. The examples here use GPT-3.5-turbo, which costs fractions of a cent per request.
Your First Chatbot API Call with Raw HTTP
Most tutorials reach for the openai Python SDK. We won’t. We’ll use the requests library instead.
Why? First, you’ll see exactly what happens at the network level. Second, requests works in Pyodide (browser Python), but the OpenAI SDK doesn’t.
The endpoint lives at https://api.openai.com/v1/chat/completions. You send a POST request with your API key in the header and messages in the JSON body. The model field picks the model. The messages field holds the chat.
import requests
import json
import os
API_KEY = os.environ.get("OPENAI_API_KEY", "your-api-key-here")
API_URL = "https://api.openai.com/v1/chat/completions"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
payload = {
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "What is Python?"}
]
}
response = requests.post(API_URL, headers=headers, json=payload)
data = response.json()
print(data["choices"][0]["message"]["content"])
Result:
python
Python is a high-level, interpreted programming language known
for its simplicity and readability. It supports multiple
programming paradigms and has a large standard library.
One POST request. That’s all it takes. The response JSON has a choices array. Each choice contains a message with role and content. We grab choices[0] because we asked for one response.
Tip: Always check `response.status_code` before parsing. A 401 means a bad API key. A 429 means rate limits. A 500 means their servers broke. Handle these in production.
Controlling the Chatbot’s Output
Two settings shape every reply: temperature and max_tokens. The temperature controls how random the output is. Set it to 0 for steady, same-every-time answers. Set it to 1.0 for creative, varied replies. The default is 1.0.
max_tokens caps how long the response can be. If you don’t set it, the model uses whatever tokens remain in its context window.
payload = {
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "Write a haiku about Python."}
],
"temperature": 0.7,
"max_tokens": 100
}
response = requests.post(API_URL, headers=headers, json=payload)
print(response.json()["choices"][0]["message"]["content"])
Output:
python
Indented with care,
Loops and lists in harmony,
Code that reads like prose.
For a chatbot tutor, I’d use temperature=0.3. Low enough for accurate answers. High enough to avoid robotic repetition.
Tip: Use `temperature=0` when you need repeatable output. This is great for testing and debugging. Bump it to 0.5-0.7 for more natural chat.
The Message Format — Roles Explained
Every message needs two fields: role and content. There are three roles. Each one shapes how the model acts.
| Role | Purpose | When to Use |
|---|---|---|
system | Sets personality and rules | Once, at chat start |
user | The human’s input | Every time the user types |
assistant | The model’s past responses | So the model “remembers” |
The system message is your control lever. Want a chatbot that speaks like a pirate? System message. Want one that only answers Python questions? System message.
Here’s how a system message shapes a response. We’ll set the model to be a Python tutor that gives short answers with code.
messages = [
{
"role": "system",
"content": "You are a helpful Python tutor. "
"Keep answers short. Include code examples."
},
{
"role": "user",
"content": "How do I reverse a list?"
}
]
payload = {"model": "gpt-3.5-turbo", "messages": messages}
response = requests.post(API_URL, headers=headers, json=payload)
data = response.json()
print(data["choices"][0]["message"]["content"])
The model responds briefly with code:
python
You can reverse a list using reverse() or slicing:
my_list = [1, 2, 3, 4, 5]
my_list.reverse()
print(my_list) # [5, 4, 3, 2, 1]
# Or use slicing (creates a new list)
reversed_list = my_list[::-1]
Without that system message, the answer would be longer and less focused. The system role is powerful — use it.
Building Conversation Memory in Python
Here’s where the real work starts. You need a place to store every message. Then you send that full history with each API call.
The simplest approach? A plain Python list. Each element is a dictionary with role and content. Append the user’s message. Append the model’s reply. The list grows — that’s your memory.
The chat() function below does the work. It takes user text, adds it to history, calls the API with the full history, saves the reply, and returns it.
chat_history = [
{
"role": "system",
"content": "You are a friendly Python tutor. Be concise."
}
]
def chat(user_message):
chat_history.append(
{"role": "user", "content": user_message}
)
payload = {
"model": "gpt-3.5-turbo",
"messages": chat_history
}
response = requests.post(
API_URL, headers=headers, json=payload
)
data = response.json()
assistant_message = data["choices"][0]["message"]["content"]
chat_history.append(
{"role": "assistant", "content": assistant_message}
)
return assistant_message
No output from this block — it just defines the function.
print(chat("What are Python decorators?"))
Response:
python
Decorators are functions that modify the behavior of other
functions. You write them with the @ symbol above a function
definition.
That’s turn one. The history now has three items: system message, user question, assistant reply.
What happens on a follow-up? Let’s find out.
print(chat("Can you show me a simple example?"))
Response:
python
Sure! Here's a basic decorator:
def my_decorator(func):
def wrapper():
print("Before the function")
func()
print("After the function")
return wrapper
@my_decorator
def say_hello():
print("Hello!")
say_hello()
We said “a simple example” — not “a decorator example.” The model connected the dots because it received the full history. It saw our earlier question about decorators.
Quick check: What would happen if we didn’t append assistant messages to the history? Try to predict before reading on.
The answer: the model would lose context. It wouldn’t know what it had already told you. Follow-up questions would fail.
Let’s verify the history is growing as expected.
for msg in chat_history:
role = msg["role"].upper()
content = msg["content"][:60]
print(f"[{role}] {content}...")
python
[SYSTEM] You are a friendly Python tutor. Be concise....
[USER] What are Python decorators?...
[ASSISTANT] Decorators are functions that modify the behavior o...
[USER] Can you show me a simple example?...
[ASSISTANT] Sure! Here's a basic decorator:
def my_decorator(...
Five messages. The list grows by two per exchange. One user message. One assistant reply.
Key Insight: Conversation memory is just a Python list. Append each message, send the full list every time, and the model acts like it remembers. Simple code. Powerful concept.
[TRY IT YOURSELF] Exercise 1: Build a Personality Chatbot
You’ve seen how system messages shape behavior. Time to build your own.
Task: Create a chat_history list with a system message that makes the chatbot act as a sarcastic movie critic. Write a chat() function with memory. Send two messages: ask for a movie review, then ask a follow-up.
Streaming Chatbot Responses Token by Token
When you use ChatGPT’s web interface, text appears word by word. That’s streaming. Without it, you stare at a blank screen for seconds while the model generates the full response.
How does it work? The API uses Server-Sent Events (SSE). Think of SSE as a one-way data pipe. The server sends small chunks to you as they’re ready. You don’t wait for the full reply.
To enable streaming, add "stream": True to your payload. But you can’t use response.json() anymore. Instead, the API sends a series of text lines. Each starts with data: followed by JSON.
The final line says data: [DONE]. That’s the stop signal. Inside each chunk, the new text lives at choices[0]["delta"]["content"].
def chat_stream(user_message):
chat_history.append(
{"role": "user", "content": user_message}
)
payload = {
"model": "gpt-3.5-turbo",
"messages": chat_history,
"stream": True
}
response = requests.post(
API_URL, headers=headers, json=payload,
stream=True
)
full_reply = ""
for line in response.iter_lines():
if not line:
continue
line_text = line.decode("utf-8")
if line_text == "data: [DONE]":
break
if line_text.startswith("data: "):
chunk = json.loads(line_text[6:])
delta = chunk["choices"][0]["delta"]
if "content" in delta:
token = delta["content"]
print(token, end="", flush=True)
full_reply += token
print()
chat_history.append(
{"role": "assistant", "content": full_reply}
)
return full_reply
Four things happen here. The user message goes into history. The request fires with streaming on in both the payload and the requests.post() call. Each chunk prints as it shows up. The full reply gets saved to memory.
chat_stream("Explain list comprehensions in 3 sentences.")
Streamed output:
python
A list comprehension creates a new list by applying an expression
to each item in an iterable. The syntax is [expression for item
in iterable if condition]. It's a concise option to a for
loop with append.
The first word shows up in under 200 milliseconds. Without streaming, you’d wait 2-3 seconds in silence.
Warning: You need `stream=True` in TWO places. In the JSON payload (tells the API to stream). And in `requests.post(…, stream=True)` (tells `requests` to read chunks, not buffer everything).
[TRY IT YOURSELF] Exercise 2: Stream with a Word Counter
Streaming is working. Let’s extend it to track output length.
Task: Modify chat_stream() to count words in the response. Print the total word count after streaming finishes.
Managing Long Conversations
Every model has a context window — a token limit per request. GPT-3.5-turbo handles 4,096 tokens (about 3,000 words). GPT-4-turbo goes up to 128,000 tokens.
What if your history exceeds the limit? The API returns an error. Your chatbot crashes.
You have three strategies. Each has tradeoffs.
Strategy 1: Sliding Window
Keep only the last N messages. Simple and predictable. The downside: the chatbot forgets early context.
def trim_history(history, max_messages=20):
system_msg = history[0]
if len(history) > max_messages:
return [system_msg] + history[-(max_messages - 1):]
return history
Strategy 2: Summarize Old Messages
Ask the model to shrink the chat so far. Swap old messages with that summary. You keep key facts and cut tokens.
def summarize_history(history):
summary_prompt = (
"Summarize this chat in 2-3 sentences, "
"keeping key facts and context:\n\n"
)
for msg in history[1:]:
summary_prompt += f"{msg['role']}: {msg['content']}\n"
payload = {
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": summary_prompt}
]
}
resp = requests.post(API_URL, headers=headers, json=payload)
summary = resp.json()["choices"][0]["message"]["content"]
return [
history[0],
{"role": "assistant",
"content": f"[Summary: {summary}]"}
]
Strategy 3: Token Counting
Count tokens and trim to fit. The most exact option. Rough rule: 1 token is about 4 English characters.
def estimate_tokens(text):
return len(text) // 4
def trim_to_token_limit(history, max_tokens=3000):
system_msg = history[0]
total = estimate_tokens(system_msg["content"])
trimmed = [system_msg]
for msg in reversed(history[1:]):
msg_tokens = estimate_tokens(msg["content"])
if total + msg_tokens > max_tokens:
break
trimmed.insert(1, msg)
total += msg_tokens
return trimmed
I’d start with Strategy 1 for simple chatbots. It’s easy to debug. Move to Strategy 2 when you need long-term context.
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Sliding window | Simple, predictable | Loses old context | Short chats |
| Summarize | Preserves key facts | Extra API call | Long chats |
| Token counting | Precise control | More code | Production |
Tracking Token Usage and Costs
Every non-streaming API reply includes a usage field. It tells you how many tokens you spent. This matters — you pay per token.
payload = {
"model": "gpt-3.5-turbo",
"messages": chat_history
}
response = requests.post(API_URL, headers=headers, json=payload)
data = response.json()
usage = data["usage"]
print(f"Prompt tokens: {usage['prompt_tokens']}")
print(f"Completion tokens: {usage['completion_tokens']}")
print(f"Total tokens: {usage['total_tokens']}")
Example output:
python
Prompt tokens: 45
Completion tokens: 82
Total tokens: 127
Prompt tokens = what you send. Completion tokens = what the model generates. As the chat grows, prompt tokens climb fast. That’s exactly why trimming matters.
Tip: For streaming mode, add `”stream_options”: {“include_usage”: true}` to your payload. The usage data arrives in the final chunk. OpenAI added this feature in 2024.
Now you have all the building blocks. Let’s put them together.
The Complete Python AI Chatbot — All Pieces Combined
Time to put it all in one clean class. This version handles system prompts, memory, streaming, trimming, and errors.
The Chatbot class has three methods. __init__ sets up the API key and history. send handles one exchange. run starts the chat loop.
class Chatbot:
def __init__(self, api_key, system_prompt, max_msgs=20):
self.api_url = (
"https://api.openai.com/v1/chat/completions"
)
self.headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
self.max_msgs = max_msgs
self.history = [
{"role": "system", "content": system_prompt}
]
def _trim(self):
if len(self.history) > self.max_msgs:
sys_msg = self.history[0]
recent = self.history[-(self.max_msgs - 1):]
self.history = [sys_msg] + recent
The send method does the hard work. It trims first, adds the user message, fires the stream request, and saves the full reply.
def send(self, user_message):
self._trim()
self.history.append(
{"role": "user", "content": user_message}
)
payload = {
"model": "gpt-3.5-turbo",
"messages": self.history,
"stream": True
}
resp = requests.post(
self.api_url, headers=self.headers,
json=payload, stream=True
)
if resp.status_code != 200:
print(f"\nError {resp.status_code}: {resp.text[:100]}")
self.history.pop()
return None
full_reply = ""
for line in resp.iter_lines():
if not line:
continue
text = line.decode("utf-8")
if text == "data: [DONE]":
break
if text.startswith("data: "):
chunk = json.loads(text[6:])
delta = chunk["choices"][0]["delta"]
if "content" in delta:
token = delta["content"]
print(token, end="", flush=True)
full_reply += token
print()
self.history.append(
{"role": "assistant", "content": full_reply}
)
return full_reply
The run method creates the chat loop. It reads input until the user types “quit.”
def run(self):
print("Chatbot ready! Type 'quit' to exit.\n")
while True:
user_input = input("You: ")
if user_input.lower() in ("quit", "exit"):
print("Goodbye!")
break
print("Bot: ", end="")
self.send(user_input)
print()
Start it up:
bot = Chatbot(
api_key="your-api-key-here",
system_prompt="You are a helpful Python tutor. "
"Give concise answers with code examples.",
max_msgs=20
)
bot.run()
Sample session:
python
Chatbot ready! Type 'quit' to exit.
You: What's a dictionary in Python?
Bot: A dictionary stores key-value pairs. Create one with curly
braces: my_dict = {"name": "Alice", "age": 30}. Access values
by key: my_dict["name"] returns "Alice".
You: How do I add a new key?
Bot: Use assignment: my_dict["email"] = "alice@example.com".
If the key doesn't exist, it gets created.
You: quit
Goodbye!
About 60 lines of real logic. No frameworks. No SDKs. Just requests, json, and a Python list.
[TRY IT YOURSELF] Exercise 3: Add Error Handling
Real API calls fail. Networks drop. Keys expire. Rate limits hit.
Task: Modify send() to handle errors. If response.status_code isn’t 200, print a friendly error with the status code. Don’t add anything to history. Return None.
Common Mistakes and How to Fix Them
Mistake 1: Forgetting to save the assistant’s reply
The most common bug. You append the user message and call the API. But you never add the reply to history. Next turn, the model can’t see what it said before.
# Wrong — reply never saved
def chat(msg):
chat_history.append({"role": "user", "content": msg})
resp = requests.post(API_URL, headers=headers,
json={"model": "gpt-3.5-turbo",
"messages": chat_history})
return resp.json()["choices"][0]["message"]["content"]
# Correct — save both sides
def chat(msg):
chat_history.append({"role": "user", "content": msg})
resp = requests.post(API_URL, headers=headers,
json={"model": "gpt-3.5-turbo",
"messages": chat_history})
reply = resp.json()["choices"][0]["message"]["content"]
chat_history.append({"role": "assistant", "content": reply})
return reply
Mistake 2: Missing stream=True on the requests call
You set "stream": True in the payload but forget it in requests.post(). The API streams chunks, but requests waits for them all. No real-time output.
# Wrong — buffers everything
response = requests.post(API_URL, headers=headers, json=payload)
# Correct — reads piece by piece
response = requests.post(API_URL, headers=headers,
json=payload, stream=True)
Mistake 3: Losing the system message when trimming
You slice from the end and accidentally chop the system message. The chatbot loses its personality.
# Wrong — system message gone history = history[-10:] # Correct — always keep system message first system = history[0] history = [system] + history[-9:]
Mistake 4: Not handling the [DONE] signal
Skip this check and json.loads() tries to parse [DONE]. It crashes right away.
# Wrong — crashes on [DONE]
for line in response.iter_lines():
chunk = json.loads(line.decode("utf-8")[6:])
# Correct — check for stop signal
for line in response.iter_lines():
if not line:
continue
text = line.decode("utf-8")
if text == "data: [DONE]":
break
if text.startswith("data: "):
chunk = json.loads(text[6:])
Switching LLM Providers
Your chatbot code works with any provider that uses the OpenAI format. Many do. Just change the URL, key, and model name.
| Provider | Endpoint | Model Example |
|---|---|---|
| OpenAI | api.openai.com/v1/chat/completions | gpt-3.5-turbo |
| Groq | api.groq.com/openai/v1/chat/completions | llama3-8b-8192 |
| Together AI | api.together.xyz/v1/chat/completions | meta-llama/Llama-3-8b-chat-hf |
| Ollama (local) | localhost:11434/v1/chat/completions | llama3 |
# Switch to Groq (free tier available)
API_URL = "https://api.groq.com/openai/v1/chat/completions"
API_KEY = "your-groq-key-here"
payload = {
"model": "llama3-8b-8192",
"messages": chat_history,
"stream": True
}
The OpenAI message format is now a standard. That’s great news — your code works across providers with no changes.
Quick Reference
| Component | What It Does | Key Code |
|---|---|---|
| API call | Send messages, get reply | requests.post(URL, headers=h, json=p) |
| Message format | Structure each message | {"role": "user", "content": "text"} |
| System prompt | Set chatbot personality | First message with role: "system" |
| Memory | Remember past exchanges | Append user + assistant to a list |
| Streaming | Show reply in real time | "stream": True + iter_lines() |
| Trimming | Stay in token limits | Keep system msg + last N messages |
| Error handling | Handle API failures | Check status_code before parsing |
| Token tracking | Monitor costs | Read data["usage"] from response |
Summary
You started with a single API call. You ended with a fully interactive chatbot.
Here’s what you built. LLM APIs are stateless — each call starts fresh. Chat memory is a Python list that grows with each turn. Three roles (system, user, assistant) shape how the model acts. Streaming uses Server-Sent Events for live output. Long chats need trimming to stay under token limits.
Practice exercise: Build a quiz bot. It asks Python questions, evaluates your answers, and keeps score. Use the system message for behavior, memory for state, and streaming for flow.
Complete Code
Frequently Asked Questions
Can I use this with local models like Ollama?
Yes. Ollama runs an OpenAI-compatible endpoint at http://localhost:11434/v1/chat/completions. Change API_URL and the model name. No API key needed. The message format and streaming work the same way.
How do I save and load a chat?
The history is a list of dictionaries. JSON handles it directly:
python
# Save
with open("chat_history.json", "w") as f:
json.dump(chat_history, f)
# Load
with open("chat_history.json", "r") as f:
chat_history = json.load(f)
Why does my chatbot give shorter answers over time?
The context window is shared between input and output. As history grows, fewer tokens remain for the response. Trim more harder or use a model with a bigger window (GPT-4-turbo has 128K tokens).
Is the requests library good enough for production?
For one user, yes. For a web app with many users, switch to an async client like httpx or aiohttp. The requests library blocks while it waits. Async tools handle many users at once much better.
How much does each chat cost?
GPT-3.5-turbo runs about $0.0005 per 1K input tokens and $0.0015 per 1K output tokens (as of early 2025). A 10-turn chat costs roughly $0.002 to $0.005. Less than a cent.
References
- OpenAI API documentation — Chat Completions. Link
- OpenAI API documentation — Streaming. Link
- OpenAI documentation — Models overview. Link
- Python
requestslibrary documentation. Link - Server-Sent Events — MDN Web Docs. Link
- OpenAI Cookbook — How to stream completions. Link
- Groq API documentation — OpenAI compatibility. Link
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
