machine learning +
Build a Python AI Chatbot with Memory Using LangChain
LangGraph Debugging with LangSmith Tracing Guide
Master LangGraph debugging with LangSmith tracing — follow step-by-step examples to inspect agent runs, tag traces, and monitor production performance.
When your LangGraph agent breaks in production, LangSmith tracing lets you replay every step it took and pinpoint exactly what went wrong.
Your agent passed every test you threw at it. Then it went live. Now a user says it “just hangs” on some inputs. Another reports wrong answers about pricing. You check the logs and find nothing helpful — just a request going in and a reply coming out, with no clue about what happened in between.
That gap between input and output is the whole problem. A LangGraph agent runs through nodes, calls tools, talks to an LLM, and shuffles state around. Without a window into that chain, you’re stuck guessing. Which node ran? What did the LLM say? Did the right tool fire?
LangSmith gives you that window. It captures every step your graph takes and lays it out in a clean visual timeline. By the end of this guide, you’ll know how to trace, tag, filter, and debug your agents with full clarity.
What Is LangSmith, and Why Do Your Agents Need It?
LangSmith is a tracing tool made by the LangChain team. It records traces — logs of every action your LangGraph agent takes.
Think of it as a black box for your AI. Pilots don’t fly blind. The recorder logs altitude, speed, and every input. When things go wrong, you replay the tape. LangSmith works the same way for graph runs.
Each trace holds runs. A run is one chunk of work — one LLM call, one tool call, or one node step. Runs nest inside each other: a graph run holds node runs, and each node run holds LLM or tool runs. This lets you zoom from “the whole request failed” all the way down to “this one LLM call gave bad output.”
Key Insight: Every trace maps to your graph. Each node shows up as a run, every edge is visible, and you can view the exact state that moved between nodes.
Prerequisites
- Python version: 3.9+
- Required libraries: langchain-openai (0.2+), langgraph (0.2+), langsmith (0.1+), python-dotenv
- Install:
pip install langchain-openai langgraph langsmith python-dotenv - API Keys: OpenAI API key (create one here) and LangSmith API key (sign up free at smith.langchain.com)
- Time to complete: ~30 minutes
- Previous knowledge: Comfortable with LangGraph nodes, edges, and
MessagesState. This builds on the LangGraph series.
How Do You Turn On LangSmith Tracing?
Three steps and you’re done: make an account, copy your API key, and add four lines to a .env file. No code changes. No setup calls. No config files.
It all works through env vars. When LANGCHAIN_TRACING_V2 is set to "true", LangChain and LangGraph start sending trace data on their own. You don’t touch your graph code at all.
Here’s the .env file you need in your project root:
python
# .env file — LangSmith configuration
# Get your API key from https://smith.langchain.com
OPENAI_API_KEY=your-openai-key-here
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your-langsmith-key-here
LANGCHAIN_PROJECT=langgraph-observability-demo
A note on LANGCHAIN_PROJECT: it sorts your traces into named groups. I like one group per agent per stage — like weather-bot-dev or support-agent-prod. Leave it blank and everything piles into “default,” which gets messy fast.
Now let’s build the graph we’ll trace in this guide. It’s a chatbot with a search tool — just enough parts to make useful trace data.
python
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from langchain_core.tools import tool
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode
load_dotenv()
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
Below is a mock search tool. The LLM decides whether to call it based on the user’s input.
python
@tool
def search_web(query: str) -> str:
"""Search the web for current information."""
results = {
"weather": "Current weather: 22C, partly cloudy",
"news": "Top story: AI advances in 2026",
"python": "Python 3.13 released with faster interpreter",
}
for key, value in results.items():
if key in query.lower():
return value
return f"Search results for: {query}"
The graph links two nodes: an LLM node that can ask for tools, and a tool runner that does the work. The routing function should_use_tool checks if the LLM’s latest reply has any tool calls.
python
tools = [search_web]
llm_with_tools = llm.bind_tools(tools)
def agent_node(state: MessagesState):
return {"messages": [llm_with_tools.invoke(state["messages"])]}
def should_use_tool(state: MessagesState):
last_message = state["messages"][-1]
if last_message.tool_calls:
return "tools"
return END
graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
graph.add_node("tools", ToolNode(tools))
graph.add_edge(START, "agent")
graph.add_conditional_edges(
"agent", should_use_tool, ["tools", END]
)
graph.add_edge("tools", "agent")
app = graph.compile()
From this point on, every call to app ships a trace to LangSmith with zero extra code.
What Does a Trace Look Like in Practice?
Let’s run a query and see.
python
result = app.invoke(
{"messages": [HumanMessage(content="What's the weather?")]}
)
print(result["messages"][-1].content)
Head over to smith.langchain.com and open your project. A trace tree like this will greet you:
python
RunnableSequence (root trace)
|-- agent (node run)
| |-- ChatOpenAI (LLM run) -> tool_calls: [search_web]
|-- tools (node run)
| |-- search_web (tool run)
|-- agent (node run)
|-- ChatOpenAI (LLM run) -> final response
Read it top to bottom and the full story unfolds. The agent node went first. The LLM asked for search_web. The tools node ran it and got a result. Then the agent node fired again, and the LLM used that result to craft the final answer.
Click any run to see its details: inputs, outputs, token counts, timing, and the full prompt. No more guessing — you’re watching a replay.
Tip: Open the LLM run to read the exact prompt that went to OpenAI. You’ll see system messages, the full chat history, and tool definitions. When your agent misbehaves, the prompt is the first place to look.
How Do You Make Traces Easy to Find Later?
A raw list of traces works when you have ten. When you have ten thousand, you need filters. LangSmith gives you two: metadata (key-value pairs you attach) and tags (short labels). Both go in the config dict.
python
result = app.invoke(
{"messages": [HumanMessage(content="Latest Python news?")]},
config={
"metadata": {
"user_id": "user_42",
"session_id": "abc-123",
"environment": "production",
},
"tags": ["production", "search-query"],
},
)
print(result["messages"][-1].content)
Now in the LangSmith UI you can filter by user_id=user_42 or click the production tag. Next time a user files a bug, pull up their traces by ID and walk through every step.
My advice: always add user_id and session_id in any live system. The cost is a few bytes per trace. The debug value is huge.
Warning: Don’t put private data in tags or metadata. No emails, no API keys, no passwords. Use opaque IDs instead. LangSmith stores traces, and anyone with project access can read them.
How Do You Control Which Requests Get Traced?
Tracing every call in a busy service creates a lot of data. At thousands of calls per hour, costs rise and noise drowns out the signal.
LangSmith gives you tracing_context — a block that flips tracing on or off for any chunk of code.
python
from langsmith import tracing_context
# Trace this invocation
with tracing_context(enabled=True):
result = app.invoke(
{"messages": [HumanMessage(content="AI news")]}
)
# Skip tracing for this one
with tracing_context(enabled=False):
result = app.invoke(
{"messages": [HumanMessage(content="Hello!")]}
)
When would you use this? Three cases. First, sample 10% of traffic to cut costs. Second, only trace calls from beta users. Third, skip health checks and pings that just add noise.
Here’s a reusable sampling wrapper:
python
import random
from langsmith import tracing_context
def invoke_with_sampling(app, messages, sample_rate=0.1):
"""Trace only a fraction of requests."""
should_trace = random.random() < sample_rate
with tracing_context(enabled=should_trace):
return app.invoke({"messages": messages})
How Do You Track Down a Bug Using Traces?
This is where the setup pays off. Let me walk through a real bug hunt you’ll likely face yourself.
Users report the agent sometimes says “I can’t search for that” when asked about news. It should call search_web, but it doesn’t. How do you find the cause?
Step 1: Run the failing case and tag it.
python
failing_result = app.invoke(
{"messages": [HumanMessage(
content="What happened in tech today?"
)]},
config={
"metadata": {"issue": "no-tool-call"},
"tags": ["debug", "tool-call-issue"],
},
)
print(failing_result["messages"][-1].content)
Step 2: Pull up the trace in LangSmith.
Filter by the tool-call-issue tag. Open the trace. If the agent node ran but there’s no tool run nested under it, that tells you the LLM chose not to invoke the tool.
Step 3: Dig into the LLM run.
Click the ChatOpenAI run inside the agent node. Check three things:
- Input: Are the tool specs in the messages?
- Output: Does the reply have a
tool_callsfield? - Token count: Is the context near its limit? Cut-off tokens can make tools vanish with no warning.
Step 4: Read the tool note.
Open the tools part of the LLM input. If the note says something vague like “search stuff,” the LLM has a weak cue. A sharper note fixes most cases like this.
Key Insight: Most “agent bugs” are really prompt bugs. If the tool says “Search for current info” but the user asks “What happened today?”, the LLM might not link “today” with “current.” A clearer note — “Search for live data: news, weather, sports, events” — closes that gap.
How Do You Pull Traces with Code?
The web UI is great for one-off bug hunts. But when you need trends across hundreds of traces, you need code.
The langsmith package gives you full access to your trace data. Here’s how to grab recent traces and print a quick summary:
python
from langsmith import Client
client = Client()
traces = list(client.list_runs(
project_name="langgraph-observability-demo",
execution_order=1,
limit=20,
))
for trace in traces[:5]:
print(f"Run: {trace.name}")
print(f" Status: {trace.status}")
print(f" Latency: {trace.latency_ms}ms")
print(f" Tokens: {trace.total_tokens}")
print()
The execution_order=1 filter gives you only top-level runs — full graph passes, not single nodes or LLM calls.
Need to find failed traces for a specific user? Filter by metadata:
python
error_traces = list(client.list_runs(
project_name="langgraph-observability-demo",
filter='eq(metadata_key, "user_id")',
is_error=True,
limit=10,
))
print(f"Found {len(error_traces)} error traces")
for t in error_traces:
print(f" {t.name}: {t.error[:80]}")
How Do You Monitor Agents Once They’re Live?
Looking at one trace at a time is debugging. Watching trends across all traces is monitoring. They serve different jobs, and a live system needs both.
LangSmith has custom dashboards that roll up stats across your whole project. Here are the numbers worth tracking:
| Metric | What It Reveals | Red Flag |
|---|---|---|
| P50 / P99 latency | Normal and worst-case speed | P99 over 10 seconds |
| Error rate | How many traces end in failure | Stays above 2% |
| Token burn | Cost per request | Sudden jumps (stuck loops) |
| Tool call rate | How often the agent reaches for tools | Drops after a prompt change |
| User ratings | Whether people like the answers | Steady decline |
A sharp rise in P99 latency could mean the LLM host is slow. A sudden dip in tool use after a prompt change? The new wording likely confused the model. These trends stay hidden without a dashboard.
Tip: Set up token alerts from day one. The worst cost trap is an agent loop — the LLM calls a tool, reads the result, then calls the same tool again, on and on. One stuck loop burns through thousands of tokens in seconds. LangSmith spots these fast.
How Do You Link User Feedback to Traces?
Traces show what took place. Feedback tells you if the result was good. The real power comes from linking the two.
LangSmith lets you pin a score to any trace. Pass a run_id when you call the graph, then use that ID to log feedback later.
python
import uuid
from langsmith import Client
run_id = str(uuid.uuid4())
result = app.invoke(
{"messages": [HumanMessage(content="What's the weather?")]},
config={"run_id": run_id},
)
print(result["messages"][-1].content)
When the user rates the answer, tie that rating back to the trace:
python
client = Client()
client.create_feedback(
run_id=run_id,
key="user_rating",
score=0.0,
comment="Wrong location — showed NYC instead of London",
)
Now sort traces by low scores and look for clusters. Do bad ratings bunch around one tool? One type of question? A certain time of day? This turns lone gripes into a clear fix list.
When Is LangSmith NOT the Right Choice?
LangSmith is strong, but it’s not the right fit for every case. Here are three where you should look at other tools.
Your rules ban sharing data with third parties. LangSmith is a hosted service. Your traces — with user chats and LLM replies — go to their servers. If that’s off limits, try Langfuse (self-hosted) or roll your own with OpenTelemetry.
You don’t use LangChain or LangGraph. LangSmith works with any Python function via @traceable, but the “set env vars and done” flow only works with LangChain/LangGraph. Other tools need manual wiring. Langfuse and Arize Phoenix have wider native support.
You need deep ML model stats. LangSmith tracks the app layer — prompts, outputs, tool calls. For embedding drift, feature weights, or model curves, tools like Arize Phoenix or Weights & Biases are a better fit.
How Does LangSmith Stack Up Against Alternatives?
| Feature | LangSmith | Langfuse | Phoenix (Arize) |
|---|---|---|---|
| LangGraph integration | Native, zero-config | Callback-based | Callback-based |
| Setup effort | 3 env variables | SDK init + callbacks | SDK init + config |
| Self-hosted option | No | Yes (open source) | Yes (open source) |
| Evaluation framework | Built-in | Built-in | Limited |
| Pricing | Free tier + paid | Open source + cloud | Open source + cloud |
| Best for | LangGraph users | Framework-agnostic teams | ML-heavy teams |
Bottom line: if you build with LangGraph, LangSmith gives the most bang for the least work. If you need self-hosting or use many frameworks, Langfuse is the best open-source pick.
Common Mistakes and How to Fix Them
Mistake 1: Leaving Out the Tracing Flag
Wrong:
python
# .env missing the tracing flag
LANGCHAIN_API_KEY=your-key-here
LANGCHAIN_PROJECT=my-project
Why nothing shows up: The API key and project name alone don’t do it. Without LANGCHAIN_TRACING_V2=true, tracing stays off and no data gets sent.
Right:
python
# .env — all required variables
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your-key-here
LANGCHAIN_PROJECT=my-project
OPENAI_API_KEY=your-openai-key
Mistake 2: Dumping all traces into one bucket
Wrong: Letting every trace land in the “default” project.
What goes wrong: Within a week you have a mess of traces from different agents, tests, and stages. Finding anything takes forever.
Right: Split by agent and stage:
python
LANGCHAIN_PROJECT=my-agent-dev # Development
LANGCHAIN_PROJECT=my-agent-prod # Production
Mistake 3: Tracing everything at scale
Wrong: LANGCHAIN_TRACING_V2=true with no sampling and 10,000 requests flowing through each hour.
What goes wrong: Each trace holds the full text of every message and tool reply. At scale, costs pile up and the view gets noisy.
Right: Sample a small slice of traffic:
python
import random
from langsmith import tracing_context
sample_rate = 0.05 # 5% of requests
with tracing_context(
enabled=random.random() < sample_rate
):
result = app.invoke({"messages": messages})
Exercise 1: Add Tracing Metadata
Your turn. Run the graph with proper metadata so you can search for the trace afterward.
python
{
"type": "exercise",
"id": "trace-debug-ex1",
"title": "Exercise 1: Add Tracing Metadata to a Graph Invocation",
"difficulty": "advanced",
"exerciseType": "write",
"instructions": "Invoke the `app` graph with the message 'Search for Python news'. Include metadata with user_id='test_user' and session_id='session_001', plus a tag called 'exercise'. Print the final response content.",
"starterCode": "# Invoke the graph with metadata and tags\nresult = app.invoke(\n {\"messages\": [HumanMessage(content=\"Search for Python news\")]},\n config={\n # Add metadata and tags here\n },\n)\n# Print the final message content\nprint(result[\"messages\"][-1].content)",
"testCases": [
{"id": "tc1", "input": "print('DONE')", "expectedOutput": "DONE", "description": "Code executes without errors"},
{"id": "tc2", "input": "print(type(result).__name__)", "expectedOutput": "dict", "description": "Result is a dict"}
],
"hints": [
"The config dict takes 'metadata' (a dict of key-value pairs) and 'tags' (a list of strings)",
"Full config: config={'metadata': {'user_id': 'test_user', 'session_id': 'session_001'}, 'tags': ['exercise']}"
],
"solution": "result = app.invoke(\n {\"messages\": [HumanMessage(content=\"Search for Python news\")]},\n config={\n \"metadata\": {\"user_id\": \"test_user\", \"session_id\": \"session_001\"},\n \"tags\": [\"exercise\"],\n },\n)\nprint(result[\"messages\"][-1].content)",
"solutionExplanation": "The config dictionary accepts 'metadata' (key-value pairs for filtering) and 'tags' (labels for categorization). Both attach to the trace in LangSmith, making it searchable.",
"xpReward": 15
}
Exercise 2: Query Traces Programmatically
python
{
"type": "exercise",
"id": "trace-query-ex2",
"title": "Exercise 2: List and Summarize Recent Traces",
"difficulty": "advanced",
"exerciseType": "write",
"instructions": "Using the LangSmith Client, list the 5 most recent top-level traces from the project 'langgraph-observability-demo'. Print each trace's name, status, and latency.",
"starterCode": "from langsmith import Client\n\nclient = Client()\n\n# List recent top-level traces\ntraces = list(client.list_runs(\n # Add parameters here\n))\n\nfor trace in traces:\n # Print name, status, latency\n pass\n\nprint('DONE')",
"testCases": [
{"id": "tc1", "input": "print('DONE')", "expectedOutput": "DONE", "description": "Code runs without errors"}
],
"hints": [
"Use execution_order=1 for top-level runs and limit=5",
"Full call: client.list_runs(project_name='langgraph-observability-demo', execution_order=1, limit=5)"
],
"solution": "from langsmith import Client\n\nclient = Client()\n\ntraces = list(client.list_runs(\n project_name='langgraph-observability-demo',\n execution_order=1,\n limit=5,\n))\n\nfor trace in traces:\n print(f\"{trace.name} | {trace.status} | {trace.latency_ms}ms\")\n\nprint('DONE')",
"solutionExplanation": "Client.list_runs() queries LangSmith's API. execution_order=1 returns top-level runs only. Each run has name, status, latency_ms, total_tokens, and error attributes.",
"xpReward": 20
}
Summary
You started this post flying blind. Now you have full sight into every step your LangGraph agent takes. Here’s what you’ve learned:
- Flip on tracing with four environment variables — your graph code doesn’t change at all
- Walk through traces in the LangSmith UI, drilling from graph runs down to individual LLM calls
- Tag traces with metadata so you can find any request in seconds
- Sample traffic to keep storage costs under control in busy systems
- Debug failures by reading the trace tree from top to bottom
- Query traces with Python to spot trends across hundreds of runs
- Watch dashboards that track latency, error rates, and token spend
- Tie user feedback to specific traces so bad outcomes point you to the root cause
You can’t fix what you can’t see. Turn on full tracing while you build. Switch to sampled tracing when you go live.
Practice Task
Complete Code
Frequently Asked Questions
Is LangSmith free to use?
LangSmith has a free tier with a solid monthly trace limit. For most dev work and small live systems, it’s plenty. Paid plans add more volume, longer storage, and team features. Check smith.langchain.com for current pricing.
Can I use LangSmith with frameworks other than LangChain?
Yes. The @traceable tag from the langsmith package wraps any Python function. But the “just set env vars” flow only works with LangChain and LangGraph. Other tools need hands-on setup.
python
from langsmith import traceable
@traceable
def my_custom_function(input_text: str) -> str:
return f"Processed: {input_text}"
How do I trace async LangGraph calls?
Async tracing works the same as sync. The env vars turn on tracing for both app.invoke() and app.ainvoke(). No extra setup needed.
python
import asyncio
async def main():
result = await app.ainvoke(
{"messages": [HumanMessage(content="Hello!")]},
config={"tags": ["async-test"]},
)
print(result["messages"][-1].content)
asyncio.run(main())
Does tracing slow down my agent?
LangSmith sends trace data in the background on its own thread. The added time is under 1ms per run. Data gets batched and sent after each call ends, so it never holds up the agent’s reply.
References
- LangSmith Documentation — Observability. Link
- LangGraph Documentation — Using LangSmith with LangGraph. Link
- LangSmith SDK — Python Client Reference. Link
- LangChain Blog — LangSmith for Agent Observability (2025). Link
- LangGraph Documentation — Tracing and Debugging. Link
- OpenAI API Documentation — Function Calling. Link
- LangSmith Changelog — Connect traces to server logs (August 2025). Link
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Up Next in Learning Path
LangGraph Customer Support Agent with Escalation
