machine learning +
LLM Temperature, Top-P, and Top-K Explained — With Python Simulations
LangGraph + FastAPI: Build a Full-Stack AI App
Build a full-stack AI app from scratch with FastAPI and LangGraph — streaming responses, saved chats, API key auth, and a live chat frontend in one project.
Build a full AI app from scratch — a FastAPI backend powered by a LangGraph agent with streaming, saved chats, auth, and a live chat frontend.
You have built LangGraph agents in Jupyter notebooks. They work great — until someone asks you to put it in front of real users. All of a sudden you need a web API, chat memory, streaming replies, and login checks. A notebook cannot do any of that.
This project bridges that gap. You will build a real AI app end to end: a FastAPI backend that serves a LangGraph agent over HTTP, streams tokens in real time, saves chats to a database, and locks things down with API key auth.
Before we write a single line of code, let me show you how the whole system fits together.
A user opens a chat page and types a message. That message hits a FastAPI endpoint as a POST request. The server loads the user’s chat history from SQLite, feeds it into a LangGraph agent, and starts streaming. The agent works through the message — calling tools if it needs more data — and sends each token back through Server-Sent Events. When the agent wraps up, the reply gets saved to the database. The user sees tokens pop up one at a time, just like ChatGPT.
Six pieces to build: the LangGraph agent graph, the storage layer, the FastAPI routes, the streaming setup, the auth layer, and the chat frontend. Each piece links to the next through clean hooks. We will build them one at a time and wire them up at the end.
What Does the Finished App Look Like?
Here is what makes this more than a notebook demo. Four traits that real apps demand:
Streaming replies — tokens show up one by one through SSE. Nobody stares at a blank screen for 10 seconds.
Chat memory — close the tab, come back tomorrow, and your full chat is still there. Every message lives in SQLite.
API key auth — each request must carry a valid X-API-Key header. Skip it and the server sends a 401 right back.
Tool-calling agent — the LangGraph agent can reach out to tools (web search, math) and keep looping until it has enough info to reply.
The layout at a glance:
python
User (Browser / curl)
│
▼
┌─────────────────────┐
│ FastAPI Backend │
│ ┌───────────────┐ │
│ │ Auth Layer │ │
│ │ (API Key) │ │
│ └──────┬────────┘ │
│ ▼ │
│ ┌───────────────┐ │
│ │ /chat/stream │ │
│ │ (SSE route) │ │
│ └──────┬────────┘ │
│ ▼ │
│ ┌───────────────┐ │
│ │ LangGraph │ │
│ │ Agent Graph │ │
│ └──────┬────────┘ │
│ ▼ │
│ ┌───────────────┐ │
│ │ SQLite DB │ │
│ │ (messages) │ │
│ └───────────────┘ │
└─────────────────────┘
Prerequisites
- Python version: 3.10+
- Required libraries: langgraph (0.4+), langchain-openai (0.3+), langchain-core (0.3+), fastapi (0.115+), uvicorn (0.34+), sse-starlette (2.0+), aiosqlite (0.21+)
- Install:
pip install langgraph langchain-openai langchain-core fastapi uvicorn sse-starlette aiosqlite python-dotenv - API key: An OpenAI API key set as
OPENAI_API_KEY. See OpenAI’s docs to create one. - How long it takes: ~45 minutes
- What you should know: LangGraph basics (nodes, edges, state). If LangGraph is new to you, start with our LangGraph setup guide.
Step 1 — How Do You Build the LangGraph Agent?
The agent is the brain of the app. We are making a ReAct-style agent with two nodes: an agent node (the LLM) and a tools node (runs tool calls). A conditional edge loops between them until the model stops asking for tools.
This first block pulls in everything and sets up the stage. We grab the LLM wrapper, message types, the @tool tag, and LangGraph’s graph helpers.
python
import os
import json
from typing import Annotated, Any
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import (
HumanMessage,
AIMessage,
SystemMessage,
ToolMessage,
)
from langchain_core.tools import tool
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode
from typing_extensions import TypedDict
load_dotenv()
Next up, we define two tools. The @tool tag lets the LLM know what each function does and what args it expects. web_search returns a dummy result for now — in a real app you would wire it to Tavily or SerpAPI. calculator runs safe math with a locked-down eval.
python
@tool
def web_search(query: str) -> str:
"""Search the web for current information."""
# Stub — replace with Tavily, SerpAPI, or Brave Search
return f"Search results for '{query}': No live results (stub). Replace with a real search API."
@tool
def calculator(expression: str) -> str:
"""Evaluate a mathematical expression. Example: '2 + 2' returns '4'."""
try:
result = eval(expression, {"__builtins__": {}}, {})
return str(result)
except Exception as e:
return f"Error evaluating '{expression}': {e}"
tools = [web_search, calculator]
Warning: The `calculator` tool uses `eval()` with locked-down builtins. That is fine for a tutorial. In a real app, use a proper math parser like `numexpr` or `sympy`. Never feed raw user input into an open `eval()`.
The state for this graph is about as lean as it gets — nothing but a list of messages. The agent_node function sends the message list to the LLM (with tools attached). Then should_continue peeks at the last message. If the model asked for a tool, we route there. If not, the graph is done.
python
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
llm_with_tools = llm.bind_tools(tools)
def agent_node(state: AgentState) -> dict:
"""Call the LLM with the current conversation."""
system = SystemMessage(
content="You are a helpful assistant. Use tools when needed."
)
response = llm_with_tools.invoke([system] + state["messages"])
return {"messages": [response]}
def should_continue(state: AgentState) -> str:
"""Route to tools if the LLM requested a tool call."""
last_message = state["messages"][-1]
if hasattr(last_message, "tool_calls") and last_message.tool_calls:
return "tools"
return END
You might wonder: why tack the system prompt on every single time? The reason is that the message list from the database only holds user and assistant turns. The system prompt never gets saved — we inject it fresh each time the agent runs.
Time to assemble the graph. The ToolNode from LangGraph’s prebuilt kit takes care of running tool calls and piping results back. Two nodes, three edges, a single conditional — and that wraps up the entire agent.
python
def build_agent_graph():
"""Construct and compile the agent graph."""
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", ToolNode(tools))
graph.add_edge(START, "agent")
graph.add_conditional_edges("agent", should_continue, ["tools", END])
graph.add_edge("tools", "agent")
return graph.compile()
agent = build_agent_graph()
Let’s do a quick sanity check. Fire off a question that forces the math tool:
python
result = agent.invoke(
{"messages": [HumanMessage(content="What is 42 * 17?")]}
)
print(result["messages"][-1].content)
The agent fires off the math tool with 42 * 17, gets back 714, and writes its reply. If a number shows up on your screen, the agent is working.
Key Insight: The agent graph lives on its own, with no ties to the web layer. Messages go in, messages come out. You can run it in a plain script or a notebook without spinning up any server.
Step 2 — How Do You Save Chats with SQLite?
As things stand, each call to agent.invoke() starts with a blank slate. The agent cannot recall past chats at all. We need a storage layer that files messages under a chat ID and pulls them back when someone returns to a thread.
Why SQLite? No server to run, no config to write, and it ships right inside Python. Perfect for a tutorial. We pair it with aiosqlite because FastAPI runs async — a normal blocking DB call would lock up the event loop.
The ChatStore class covers three jobs: create the table, save a message, and load a chat’s history.
python
import aiosqlite
import uuid
from datetime import datetime
class ChatStore:
"""Async SQLite store for conversation history."""
def __init__(self, db_path: str = "chat_history.db"):
self.db_path = db_path
async def initialize(self):
"""Create the messages table if it doesn't exist."""
async with aiosqlite.connect(self.db_path) as db:
await db.execute("""
CREATE TABLE IF NOT EXISTS messages (
id TEXT PRIMARY KEY,
conversation_id TEXT NOT NULL,
role TEXT NOT NULL,
content TEXT NOT NULL,
created_at TEXT NOT NULL
)
""")
await db.commit()
Just one table with five columns. Every message gets its own UUID, a chat ID that ties it to a thread, a role (user or assistant), the actual content, and a timestamp.
The save and load methods are no surprise. save_message drops a new row in. get_messages pulls back every message for a given chat, sorted by time, and wraps each one as a LangChain message object so the agent can use it.
python
async def save_message(
self, conversation_id: str, role: str, content: str
):
"""Save a single message to the database."""
async with aiosqlite.connect(self.db_path) as db:
await db.execute(
"INSERT INTO messages VALUES (?, ?, ?, ?, ?)",
(
str(uuid.uuid4()),
conversation_id,
role,
content,
datetime.now().isoformat(),
),
)
await db.commit()
async def get_messages(self, conversation_id: str) -> list:
"""Load all messages for a conversation."""
async with aiosqlite.connect(self.db_path) as db:
cursor = await db.execute(
"SELECT role, content FROM messages "
"WHERE conversation_id = ? ORDER BY created_at",
(conversation_id,),
)
rows = await cursor.fetchall()
messages = []
for role, content in rows:
if role == "user":
messages.append(HumanMessage(content=content))
else:
messages.append(AIMessage(content=content))
return messages
Tip: There is no need for a “create chat” endpoint. Generate a UUID on the client side. The moment you send the first message with that ID, the chat exists.
We keep this design bare on purpose. Each method opens and closes its own database link. Good enough for learning. In a live app, you would bring in a connection pool or lean on LangGraph’s built-in SqliteSaver checkpointer.
Step 3 — How Do You Stream Replies with FastAPI and SSE?
Now we tie the pieces together. The FastAPI app has one main route: POST /chat/stream. It accepts a message and a chat ID, loads past messages, runs the agent, and sends tokens back through Server-Sent Events.
Why go with SSE over WebSockets? SSE travels over regular HTTP, reconnects after dropped links by itself, and is about as simple as it gets for one-way data flow. The server pushes tokens out; the client sits and listens. That is exactly the pattern we need here.
The app warms up the chat store on launch using FastAPI’s lifespan hook.
python
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException, Depends
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from sse_starlette.sse import EventSourceResponse
chat_store = ChatStore()
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Initialize database on startup."""
await chat_store.initialize()
yield
app = FastAPI(title="LangGraph AI Agent", lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
Warning: Do not use `allow_origins=[“*”]` in a real app. Swap it for your actual frontend domain. The wildcard lets any website talk to your API.
The request shape is as small as it gets — a message string and a chat ID that defaults to None.
python
class ChatRequest(BaseModel):
message: str
conversation_id: str | None = None
Now for the main endpoint. It fetches chat history, kicks off the agent through astream_events, and yields each token as an SSE event. astream_events is how LangGraph delivers token-level streaming — every single token, tool call, and state update shows up as its own event object.
python
@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
"""Stream agent responses via Server-Sent Events."""
conv_id = request.conversation_id or str(uuid.uuid4())
# Load conversation history
history = await chat_store.get_messages(conv_id)
# Save the user message
await chat_store.save_message(conv_id, "user", request.message)
# Add the new message to history
history.append(HumanMessage(content=request.message))
async def event_generator():
full_response = ""
async for event in agent.astream_events(
{"messages": history}, version="v2"
):
kind = event["event"]
if kind == "on_chat_model_stream":
token = event["data"]["chunk"].content
if token:
full_response += token
yield {
"event": "token",
"data": json.dumps({"token": token}),
}
# Save the complete response
await chat_store.save_message(
conv_id, "assistant", full_response
)
yield {
"event": "done",
"data": json.dumps({
"conversation_id": conv_id,
"full_response": full_response,
}),
}
return EventSourceResponse(event_generator())
So what happens inside event_generator? The astream_events call pumps out event dicts one after another. We only care about on_chat_model_stream — that is where the actual text tokens hide. Each token gets packed into JSON and pushed as an SSE event. Once the stream dries up, we write the full reply to the database and send a final done event that carries the chat ID.
Key Insight: Call `astream_events` with `version=”v2″` to get fine-grained streaming. You get every token, every tool call (start and end), and every state update. For a chat UI, filter on `on_chat_model_stream` and ignore the rest — that is all the frontend needs.
Step 4 — How Do You Lock It Down with API Key Auth?
As it stands, anyone can call your API. That is a problem because every request burns real money on LLM calls. So let’s add API key checks with FastAPI’s dependency injection.
The concept is short: the client attaches an X-API-Key header to every request. The server looks it up in a set of approved keys. Key missing or wrong? Back comes a 401.
python
from fastapi.security import APIKeyHeader
API_KEYS = {
os.getenv("API_KEY", "dev-key-change-me-in-production"),
}
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
async def verify_api_key(
api_key: str = Depends(api_key_header),
) -> str:
"""Validate the API key from the request header."""
if not api_key or api_key not in API_KEYS:
raise HTTPException(
status_code=401,
detail="Invalid or missing API key",
)
return api_key
Wire the check into the streaming endpoint by tweaking its function signature:
python
@app.post("/chat/stream")
async def chat_stream(
request: ChatRequest,
api_key: str = Depends(verify_api_key),
):
# ... same code as before
Done. Any request that shows up without a valid key gets bounced before the agent even wakes up. Zero LLM tokens wasted on rogue calls.
Warning: Never hard-code API keys in source code for a real app. Store them in env vars or a secrets manager. The default `dev-key-change-me-in-production` is on purpose — it screams “swap me out.”
Below are two handy utility routes — a health probe (open to all) and a chat history lookup (auth required).
python
@app.get("/health")
async def health():
"""Health check — no auth required."""
return {"status": "healthy"}
@app.get("/conversations/{conversation_id}")
async def get_conversation(
conversation_id: str,
api_key: str = Depends(verify_api_key),
):
"""Retrieve full conversation history."""
messages = await chat_store.get_messages(conversation_id)
return {
"conversation_id": conversation_id,
"messages": [
{"role": "user" if isinstance(m, HumanMessage) else "assistant",
"content": m.content}
for m in messages
],
}
Step 5 — How Do You Build a Chat Frontend?
No React. No build step. Just plain HTML, CSS, and JavaScript packed into a single file. It hooks into the SSE endpoint and paints tokens onto the screen as they land.
The JavaScript sends a fetch POST to /chat/stream and reads the response body as a stream. Every token event gets glued onto the current chat bubble. The done event stashes the chat ID so the next message can keep the same thread.
html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>AI Chat</title>
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body { font-family: system-ui, sans-serif; background: #f5f5f5; }
.chat-container { max-width: 700px; margin: 2rem auto; }
.messages { background: white; border-radius: 12px;
padding: 1.5rem; min-height: 400px;
max-height: 600px; overflow-y: auto; }
.message { margin: 0.75rem 0; padding: 0.75rem 1rem;
border-radius: 8px; max-width: 80%; }
.user { background: #007bff; color: white;
margin-left: auto; text-align: right; }
.assistant { background: #e9ecef; }
.input-area { display: flex; gap: 0.5rem; margin-top: 1rem; }
input { flex: 1; padding: 0.75rem; border-radius: 8px;
border: 1px solid #ddd; font-size: 1rem; }
button { padding: 0.75rem 1.5rem; border-radius: 8px;
background: #007bff; color: white; border: none;
cursor: pointer; font-size: 1rem; }
</style>
</head>
<body>
<div class="chat-container">
<h2>AI Chat Assistant</h2>
<div class="messages" id="messages"></div>
<div class="input-area">
<input type="text" id="userInput"
placeholder="Type a message..." />
<button onclick="sendMessage()">Send</button>
</div>
</div>
<!-- JavaScript follows in the next block -->
</body>
</html>
The HTML is stripped down — a message panel and a text input. CSS handles the chat bubble look. All the heavy lifting sits in the script below.
The sendMessage function drives the full cycle: POST the user’s text, create a blank assistant bubble, pull the stream chunk by chunk, parse out SSE events, and paste each token into the bubble. The effect is text that grows in front of your eyes, just like ChatGPT.
javascript
<script>
const API_KEY = "dev-key-change-me-in-production";
const API_URL = "http://localhost:8000";
let conversationId = null;
async function sendMessage() {
const input = document.getElementById("userInput");
const message = input.value.trim();
if (!message) return;
appendMessage("user", message);
input.value = "";
const assistantDiv = appendMessage("assistant", "");
try {
const response = await fetch(`${API_URL}/chat/stream`, {
method: "POST",
headers: {
"Content-Type": "application/json",
"X-API-Key": API_KEY,
},
body: JSON.stringify({
message: message,
conversation_id: conversationId,
}),
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop();
for (const line of lines) {
if (line.startsWith("data:")) {
const data = JSON.parse(line.slice(5).trim());
if (data.token) {
assistantDiv.textContent += data.token;
}
if (data.conversation_id) {
conversationId = data.conversation_id;
}
}
}
}
} catch (error) {
assistantDiv.textContent = "Connection failed. Please try again.";
}
scrollToBottom();
}
function appendMessage(role, text) {
const div = document.createElement("div");
div.className = `message ${role}`;
div.textContent = text;
document.getElementById("messages").appendChild(div);
scrollToBottom();
return div;
}
function scrollToBottom() {
const el = document.getElementById("messages");
el.scrollTop = el.scrollHeight;
}
document.getElementById("userInput")
.addEventListener("keypress", (e) => {
if (e.key === "Enter") sendMessage();
});
</script>
Tip: To serve the frontend from FastAPI itself, save the HTML file as `static/index.html` and add `app.mount(“/”, StaticFiles(directory=”static”, html=True))` to your app. Open `http://localhost:8000` and it all runs from one server.
Step 6 — How Do You Run and Test the Full App?
Let’s fire it up and see all the parts working together. The folder layout looks like this:
python
fullstack-ai-app/
├── main.py # FastAPI server + LangGraph agent
├── chat_store.py # SQLite persistence
├── agent.py # Agent graph definition
├── static/
│ └── index.html # Chat frontend
├── .env # API keys (OPENAI_API_KEY, API_KEY)
└── requirements.txt # Dependencies
For this walk-through, all code lives in main.py. In a real project, you would break it into separate files.
Drop the entry point at the bottom of main.py:
python
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Start the server:
bash
python main.py
You should see:
python
INFO: Started server process
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
Give it a spin with curl. The -N flag shuts off output buffering so SSE events print as they arrive:
bash
curl -N -X POST http://localhost:8000/chat/stream \
-H "Content-Type: application/json" \
-H "X-API-Key: dev-key-change-me-in-production" \
-d '{"message": "What is 25 * 4?"}'
You will see events stream back:
python
event: token
data: {"token": "25"}
event: token
data: {"token": " multiplied"}
event: token
data: {"token": " by"}
event: token
data: {"token": " 4"}
event: token
data: {"token": " equals"}
event: token
data: {"token": " 100"}
event: done
data: {"conversation_id": "abc-123-...", "full_response": "25 multiplied by 4 equals 100."}
Every token event carries one chunk of the answer. The closing done event hands back the chat ID so your next message can land in the same thread.
What Are the Most Common Mistakes?
Mistake 1: Not awaiting async database calls
python
# Wrong — gives you a coroutine object, not data
messages = chat_store.get_messages(conv_id)
# Right — await the async call
messages = await chat_store.get_messages(conv_id)
Why it breaks: Skip the await and you end up with a coroutine object, not actual data. The agent chokes on that garbage input and the whole request fails.
Mistake 2: No error handling on the SSE stream
javascript
// Wrong — dropped links fail silently
const response = await fetch(url, options);
// read stream, no try/catch
// Right — catch failures cleanly
try {
const response = await fetch(url, options);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
// read stream
} catch (error) {
appendMessage("assistant", "Connection lost. Please retry.");
}
Why it matters: Networks drop all the time. If you skip the catch block, the UI hangs mid-reply with no way to recover.
Mistake 3: Saving messages without chat grouping
python
# Wrong — all messages in one big pile
await db.execute(
"INSERT INTO messages VALUES (?, ?, ?, ?)",
(msg_id, role, content, timestamp),
)
# Right — always tag with conversation_id
await db.execute(
"INSERT INTO messages VALUES (?, ?, ?, ?, ?)",
(msg_id, conversation_id, role, content, timestamp),
)
Why it fails: With no chat IDs, pulling history dumps messages from every user into one pile. The agent gets a jumbled context and its replies turn into nonsense.
Key Insight: Most full-stack AI bugs hide in the plumbing, not the model. A forgotten `await`, a missing error handler, or a schema mismatch will take your app down faster than any LLM quirk.
Exercise: Add a Delete Chat Endpoint
You built storage that keeps messages by chat ID. But there is no way to delete a chat. Add a DELETE /conversations/{conversation_id} endpoint that wipes all messages for that chat and returns the count of rows removed.
Your task: Add a delete_conversation method to ChatStore and hook it to a DELETE endpoint with API key auth.
What Would Change in a Real App?
This tutorial gives you a working app. But a handful of areas need toughening before real users hit it.
Database — replace SQLite with PostgreSQL. SQLite only lets one writer through at a time. Under load, writes stack up and time out. PostgreSQL manages thousands of open links with ease.
Auth — move from fixed API keys to JWT tokens or OAuth2. Static keys cannot be scoped, cannot be rotated on a schedule, and cannot be linked to a user.
Streaming — bolt on timeouts and token caps per request. A runaway agent can keep a link open for minutes and burn through your LLM budget while you sleep.
Hosting — put the app behind nginx or Caddy with HTTPS. Set rate limits. Run gunicorn with several uvicorn workers so you can scale sideways.
Tracking — wire in LangSmith tracing to keep an eye on agent behavior. When the agent loops five times on an easy question, you want to find out fast.
Tip: LangGraph ships with its own storage through checkpointers. For live apps, look at `SqliteSaver` or `PostgresSaver` from `langgraph.checkpoint`. They take care of state saving, thread tracking, and replay so you do not have to.
When Should You NOT Use This Setup?
This stack does not suit every AI project. Be upfront about when a simpler path wins.
Plain Q&A with no tools — if your app just pings an LLM and returns text, LangGraph is overkill. A single FastAPI route with the OpenAI SDK’s stream=True flag is lighter and quicker.
High-traffic, tight-deadline APIs — LangGraph tacks on overhead for graph hops and state bookkeeping. When you need sub-100ms replies, talk to the model API directly.
Multi-tenant SaaS — this tutorial stores everything in one SQLite file. A multi-tenant product needs per-tenant walls, connection pooling, and a proper ORM. Take a look at LangGraph Cloud for managed hosting.
Complete Code
Summary
You now have a full-stack AI app with five layers working together: a LangGraph agent that calls tools and thinks step by step, a SQLite layer that keeps chats alive across sessions, a FastAPI server that pushes replies through SSE, an API key check that blocks bad callers, and a vanilla HTML/JS frontend that paints tokens as they arrive.
The pattern you picked up — a graph-based agent sitting behind an async API with streaming — is the same blueprint that real AI products follow at scale. The exact tech may shift (PostgreSQL, Redis, React), but the flow does not change: message in, agent work, token stream out, chat stored.
Where to go from here? Plug in a real search tool with Tavily or Brave Search. Trade SQLite for PostgreSQL. Ship it to Cloud Run inside a Docker image. Layer in LangSmith tracing to keep tabs on what the agent does. Each of those is a natural next move from the base you just built.
Frequently Asked Questions
Can I use a different LLM provider instead of OpenAI?
Yes. Swap ChatOpenAI for any LangChain-friendly model. For Anthropic, use ChatAnthropic from langchain-anthropic. For local models, try ChatOllama from langchain-ollama. The agent graph stays the same — only the LLM setup line changes.
How do I deploy this to a cloud provider?
Wrap the app in a Docker image with python:3.11-slim as the base. Set CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]. Deploy to AWS ECS, Google Cloud Run, or Railway. Put API keys in the platform’s secrets manager.
Why SSE instead of WebSockets for streaming?
SSE works over plain HTTP, picks up dropped links on its own, and is simpler for server-to-client data flow. WebSockets shine when you need two-way talk — typing hints, presence, real-time teamwork. For a chat where the server pushes tokens, SSE is the easier choice.
How do I add rate limiting?
Use slowapi — install with pip install slowapi, create a Limiter, and add @limiter.limit("20/minute") to endpoints. That stops abuse without adding much code.
How long can a chat get before things slow down?
The bottleneck is the LLM’s context window, not the database. GPT-4o-mini supports 128K tokens — roughly 96,000 words. Before you hit that, set up a sliding window: load only the last N messages instead of the full history.
References
- LangGraph documentation — Streaming events from within tools. Link
- LangGraph documentation — Persistence and checkpointing. Link
- FastAPI documentation — Advanced: Server-Sent Events. Link
- LangChain documentation — ChatOpenAI integration. Link
- MDN Web Docs — Using Server-Sent Events. Link
- FastAPI documentation — Security tutorial. Link
- aiosqlite — Async interface for SQLite. Link
- Yao, S. et al. — ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 (2022). Link
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
