LangGraph Observability and Debugging — LangSmith Tracing in Practice
Your LangGraph agent worked fine in testing. Then you deployed it. A user reports the agent “just hangs” on certain queries. Another says it gives wrong answers about pricing. You stare at the logs — nothing useful. Just “input in, output out.”
This is the observability problem. Without tracing, a LangGraph agent is a black box. You see what goes in and what comes out. But the middle — which nodes ran, what the LLM said, which tools were called — stays invisible.
LangSmith fixes this. It records every step your graph takes and lets you inspect it visually. By the end of this article, you’ll have full visibility into your agent’s behavior.
What Is LangSmith and Why Does Your Agent Need It?
LangSmith is an observability platform built by the LangChain team. It captures traces — structured records of every operation your LangGraph agent performs.
Think of it like a flight recorder for your agent. A pilot doesn’t fly blind and hope for the best. The black box records altitude, speed, and every control input. When something goes wrong, you replay the recording. LangSmith does the same thing for graph execution.
A trace contains runs. Each run is one unit of work: an LLM call, a tool invocation, or a node execution. Runs are nested — a graph run contains node runs, which contain LLM runs. This hierarchy lets you zoom from “the whole request failed” to “this specific LLM call returned nonsense.”
Prerequisites
- Python version: 3.9+
- Required libraries: langchain-openai (0.2+), langgraph (0.2+), langsmith (0.1+), python-dotenv
- Install:
pip install langchain-openai langgraph langsmith python-dotenv - API Keys: OpenAI API key (create one here) and LangSmith API key (sign up free at smith.langchain.com)
- Time to complete: ~30 minutes
- Previous knowledge: Comfortable with LangGraph nodes, edges, and
MessagesState. This builds on the LangGraph series.
Setting Up LangSmith Tracing
Getting traces flowing takes three steps: create an account, grab an API key, and set environment variables. No SDK initialization, no decorators, no config files.
LangSmith tracing activates through environment variables. When LANGCHAIN_TRACING_V2 is "true", LangChain and LangGraph automatically send trace data. You don’t touch your graph code at all.
Create a .env file in your project root with these four variables:
# .env file — LangSmith configuration
# Get your API key from https://smith.langchain.com
OPENAI_API_KEY=your-openai-key-here
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your-langsmith-key-here
LANGCHAIN_PROJECT=langgraph-observability-demo
The LANGCHAIN_PROJECT variable organizes traces into named projects. I’d recommend one project per agent or per environment (dev, staging, prod). Without it, traces go to “default” — which gets cluttered fast.
Here’s the base graph we’ll trace throughout this article. A simple chatbot with a search tool — enough to generate interesting trace data.
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from langchain_core.tools import tool
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode
load_dotenv()
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
We define a simple search tool. The LLM decides whether to call it based on the user’s question.
@tool
def search_web(query: str) -> str:
"""Search the web for current information."""
results = {
"weather": "Current weather: 22C, partly cloudy",
"news": "Top story: AI advances in 2026",
"python": "Python 3.13 released with faster interpreter",
}
for key, value in results.items():
if key in query.lower():
return value
return f"Search results for: {query}"
The graph wires two nodes together: an LLM node (which can call tools) and a tool executor. The should_use_tool function routes based on whether the LLM’s response contains tool calls.
tools = [search_web]
llm_with_tools = llm.bind_tools(tools)
def agent_node(state: MessagesState):
return {"messages": [llm_with_tools.invoke(state["messages"])]}
def should_use_tool(state: MessagesState):
last_message = state["messages"][-1]
if last_message.tool_calls:
return "tools"
return END
graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
graph.add_node("tools", ToolNode(tools))
graph.add_edge(START, "agent")
graph.add_conditional_edges(
"agent", should_use_tool, ["tools", END]
)
graph.add_edge("tools", "agent")
app = graph.compile()
Every invocation of app now sends a trace to LangSmith automatically. Zero extra code.
Your First Trace — Running and Inspecting
What does a trace actually look like? Run a simple query and find out.
result = app.invoke(
{"messages": [HumanMessage(content="What's the weather?")]}
)
print(result["messages"][-1].content)
Open smith.langchain.com and navigate to your project. You’ll see a trace hierarchy like this:
RunnableSequence (root trace)
|-- agent (node run)
| |-- ChatOpenAI (LLM run) -> tool_calls: [search_web]
|-- tools (node run)
| |-- search_web (tool run)
|-- agent (node run)
|-- ChatOpenAI (LLM run) -> final response
This hierarchy tells you exactly what happened. The agent ran first. The LLM decided to call search_web. The tool node executed the search. Then the agent ran again to produce the final answer.
Click any run to inspect its details: input messages, output, token counts, latency, and the full prompt. You’re no longer guessing what your agent did. You’re seeing it.
Adding Metadata and Tags to Traces
Raw traces are useful. Tagged traces are searchable. When you’re debugging an issue from a specific user, you don’t want to scroll through thousands of traces.
LangSmith supports two filtering mechanisms: metadata (key-value pairs) and tags (labels). You add them through the config parameter.
result = app.invoke(
{"messages": [HumanMessage(content="Latest Python news?")]},
config={
"metadata": {
"user_id": "user_42",
"session_id": "abc-123",
"environment": "production",
},
"tags": ["production", "search-query"],
},
)
print(result["messages"][-1].content)
In the LangSmith UI, you can now filter by user_id=user_42 or by the production tag. When a user reports a problem, pull up their traces by ID and see exactly what went wrong.
I’d suggest always including user_id and session_id in production. The cost is a few extra bytes per trace. The debugging value is enormous.
Selective Tracing — Control What Gets Recorded
Tracing every invocation in production generates a lot of data. For an agent handling thousands of requests per hour, that’s expensive and noisy.
LangSmith provides tracing_context — a context manager that enables or disables tracing for specific blocks.
from langsmith import tracing_context
# Trace this invocation
with tracing_context(enabled=True):
result = app.invoke(
{"messages": [HumanMessage(content="AI news")]}
)
# Skip tracing for this one
with tracing_context(enabled=False):
result = app.invoke(
{"messages": [HumanMessage(content="Hello!")]}
)
Three scenarios where you’d use this. First, sample only 10% of requests to control costs. Second, trace only requests from beta users. Third, skip health checks and internal pings that add noise.
Here’s a practical sampling function:
import random
from langsmith import tracing_context
def invoke_with_sampling(app, messages, sample_rate=0.1):
"""Trace only a percentage of requests."""
should_trace = random.random() < sample_rate
with tracing_context(enabled=should_trace):
return app.invoke({"messages": messages})
Debugging a Failing Agent — A Walkthrough
Here’s where tracing pays for itself. Let me walk you through a debugging scenario you’ll hit in practice.
Users report the agent sometimes says “I can’t search for that” when asked about current events. The agent should call search_web, but it doesn’t. How do you find the problem?
Step 1: Reproduce and tag the failing request.
failing_result = app.invoke(
{"messages": [HumanMessage(
content="What happened in tech today?"
)]},
config={
"metadata": {"issue": "no-tool-call"},
"tags": ["debug", "tool-call-issue"],
},
)
print(failing_result["messages"][-1].content)
Step 2: Find the trace in LangSmith.
Filter by the tag tool-call-issue. Click the trace. If the agent node ran but no tool run appears underneath, the LLM chose not to call the tool.
Step 3: Inspect the LLM run.
Click the ChatOpenAI run inside the agent node. Check three things:
- Input: Are the tool definitions present in the messages?
- Output: Does the response include a
tool_callsfield? - Token count: Is the context near its limit? Truncated tools cause silent failures.
Step 4: Check the tool description.
Expand the tools parameter in the LLM input. A vague description like “search stuff” gives the LLM weak signal. The fix is usually better tool descriptions.
Programmatic Trace Access with the LangSmith SDK
The UI is great for one-off debugging. But when you need patterns across hundreds of traces, you need the SDK.
The langsmith package lets you query traces programmatically. Here’s how to pull recent traces and summarize them.
from langsmith import Client
client = Client()
traces = list(client.list_runs(
project_name="langgraph-observability-demo",
execution_order=1,
limit=20,
))
for trace in traces[:5]:
print(f"Run: {trace.name}")
print(f" Status: {trace.status}")
print(f" Latency: {trace.latency_ms}ms")
print(f" Tokens: {trace.total_tokens}")
print()
The execution_order=1 filter returns only top-level runs — full graph executions, not individual nodes.
Want to find error traces for a specific user? Filter by metadata:
error_traces = list(client.list_runs(
project_name="langgraph-observability-demo",
filter='eq(metadata_key, "user_id")',
is_error=True,
limit=10,
))
print(f"Found {len(error_traces)} error traces")
for t in error_traces:
print(f" {t.name}: {t.error[:80]}")
Monitoring in Production — Dashboards and Alerts
Tracing individual requests is debugging. Monitoring aggregate patterns is observability. These are different activities, and you need both.
LangSmith provides custom dashboards that track metrics across all traces. Here are the key numbers to watch:
| Metric | What It Tells You | When to Worry |
|---|---|---|
| P50 / P99 latency | Typical and worst-case response time | P99 above 10 seconds |
| Error rate | Share of traces that fail | Above 2% sustained |
| Token usage | Cost per request | Sudden spikes (agent loops) |
| Tool call rate | How often tools are used | Drops suggest prompt drift |
| Feedback score | User satisfaction | Trending downward |
A sudden spike in P99 latency? Could be the LLM provider. A drop in tool call rate? Your latest prompt change probably broke tool selection. These patterns are invisible without dashboards.
Collecting User Feedback on Traces
Tracing shows you what happened. Feedback tells you whether the result was good. Connecting the two is where real insight lives.
LangSmith lets you attach feedback scores to any trace. You pass a run_id when invoking the graph, then use that ID to record feedback later.
import uuid
from langsmith import Client
run_id = str(uuid.uuid4())
result = app.invoke(
{"messages": [HumanMessage(content="What's the weather?")]},
config={"run_id": run_id},
)
print(result["messages"][-1].content)
After the user rates the response, attach that rating to the trace:
client = Client()
client.create_feedback(
run_id=run_id,
key="user_rating",
score=0.0,
comment="Wrong location — showed NYC instead of London",
)
Filter traces by low feedback scores and look for patterns. Do failures cluster around specific tools? Specific query types? This turns individual complaints into actionable improvements.
When NOT to Use LangSmith
LangSmith isn’t the right choice for every situation. Here are three cases where you should consider alternatives.
You need full data sovereignty. LangSmith is a hosted service. Your traces — including user messages and LLM responses — go to LangSmith’s servers. If compliance rules prohibit third-party data sharing, consider Langfuse (self-hosted) or custom OpenTelemetry.
You’re not using LangChain or LangGraph. LangSmith works with any framework via @traceable, but the zero-config experience only applies to LangChain/LangGraph. With other frameworks, manual instrumentation is required. Langfuse or Arize Phoenix have broader native integrations.
You need deep ML model observability. LangSmith focuses on application-level tracing — prompts, completions, tool calls. For embedding drift, feature importance, or model performance metrics, tools like Arize Phoenix or Weights & Biases are better suited.
LangSmith vs. Alternatives — Quick Comparison
| Feature | LangSmith | Langfuse | Phoenix (Arize) |
|---|---|---|---|
| LangGraph integration | Native, zero-config | Callback-based | Callback-based |
| Setup effort | 3 env variables | SDK init + callbacks | SDK init + config |
| Self-hosted option | No | Yes (open source) | Yes (open source) |
| Evaluation framework | Built-in | Built-in | Limited |
| Pricing | Free tier + paid | Open source + cloud | Open source + cloud |
| Best for | LangGraph users | Framework-agnostic teams | ML-heavy teams |
If you’re building with LangGraph, LangSmith gives you the most with the least effort. If you need self-hosting or work across multiple frameworks, Langfuse is the strongest open-source alternative.
Common Mistakes and How to Fix Them
Mistake 1: Forgetting LANGCHAIN_TRACING_V2
❌ Wrong:
# .env missing the tracing flag
LANGCHAIN_API_KEY=your-key-here
LANGCHAIN_PROJECT=my-project
Why it breaks: Without LANGCHAIN_TRACING_V2=true, no traces are sent. The API key and project name alone aren’t enough.
✅ Correct:
# .env — all required variables
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your-key-here
LANGCHAIN_PROJECT=my-project
OPENAI_API_KEY=your-openai-key
Mistake 2: Dumping everything into one project
❌ Wrong: Letting all traces go to the “default” project.
Why it’s a problem: After a week, you have thousands of mixed traces from different agents and experiments. Finding anything becomes impossible.
✅ Correct: Separate projects by agent and environment:
LANGCHAIN_PROJECT=my-agent-dev # Development
LANGCHAIN_PROJECT=my-agent-prod # Production
Mistake 3: Full tracing in high-traffic production
❌ Wrong: LANGCHAIN_TRACING_V2=true with no sampling at 10,000 requests per hour.
Why it hurts: Each trace includes full message content and tool I/O. At that volume, costs add up fast.
✅ Correct: Sample a percentage of requests:
import random
from langsmith import tracing_context
sample_rate = 0.05 # 5% of requests
with tracing_context(
enabled=random.random() < sample_rate
):
result = app.invoke({"messages": messages})
Exercise 1: Add Tracing Metadata
Your turn. Invoke the graph with proper metadata so you can find the trace later.
{
"type": "exercise",
"id": "trace-debug-ex1",
"title": "Exercise 1: Add Tracing Metadata to a Graph Invocation",
"difficulty": "advanced",
"exerciseType": "write",
"instructions": "Invoke the `app` graph with the message 'Search for Python news'. Include metadata with user_id='test_user' and session_id='session_001', plus a tag called 'exercise'. Print the final response content.",
"starterCode": "# Invoke the graph with metadata and tags\nresult = app.invoke(\n {\"messages\": [HumanMessage(content=\"Search for Python news\")]},\n config={\n # Add metadata and tags here\n },\n)\n# Print the final message content\nprint(result[\"messages\"][-1].content)",
"testCases": [
{"id": "tc1", "input": "print('DONE')", "expectedOutput": "DONE", "description": "Code executes without errors"},
{"id": "tc2", "input": "print(type(result).__name__)", "expectedOutput": "dict", "description": "Result is a dict"}
],
"hints": [
"The config dict takes 'metadata' (a dict of key-value pairs) and 'tags' (a list of strings)",
"Full config: config={'metadata': {'user_id': 'test_user', 'session_id': 'session_001'}, 'tags': ['exercise']}"
],
"solution": "result = app.invoke(\n {\"messages\": [HumanMessage(content=\"Search for Python news\")]},\n config={\n \"metadata\": {\"user_id\": \"test_user\", \"session_id\": \"session_001\"},\n \"tags\": [\"exercise\"],\n },\n)\nprint(result[\"messages\"][-1].content)",
"solutionExplanation": "The config dictionary accepts 'metadata' (key-value pairs for filtering) and 'tags' (labels for categorization). Both attach to the trace in LangSmith, making it searchable.",
"xpReward": 15
}
Exercise 2: Query Traces Programmatically
{
"type": "exercise",
"id": "trace-query-ex2",
"title": "Exercise 2: List and Summarize Recent Traces",
"difficulty": "advanced",
"exerciseType": "write",
"instructions": "Using the LangSmith Client, list the 5 most recent top-level traces from the project 'langgraph-observability-demo'. Print each trace's name, status, and latency.",
"starterCode": "from langsmith import Client\n\nclient = Client()\n\n# List recent top-level traces\ntraces = list(client.list_runs(\n # Add parameters here\n))\n\nfor trace in traces:\n # Print name, status, latency\n pass\n\nprint('DONE')",
"testCases": [
{"id": "tc1", "input": "print('DONE')", "expectedOutput": "DONE", "description": "Code runs without errors"}
],
"hints": [
"Use execution_order=1 for top-level runs and limit=5",
"Full call: client.list_runs(project_name='langgraph-observability-demo', execution_order=1, limit=5)"
],
"solution": "from langsmith import Client\n\nclient = Client()\n\ntraces = list(client.list_runs(\n project_name='langgraph-observability-demo',\n execution_order=1,\n limit=5,\n))\n\nfor trace in traces:\n print(f\"{trace.name} | {trace.status} | {trace.latency_ms}ms\")\n\nprint('DONE')",
"solutionExplanation": "Client.list_runs() queries LangSmith's API. execution_order=1 returns top-level runs only. Each run has name, status, latency_ms, total_tokens, and error attributes.",
"xpReward": 20
}
Summary
You’ve gone from flying blind to full visibility into your LangGraph agents. Here’s what you can do now:
- Set up tracing with four environment variables — no code changes
- Inspect traces in the LangSmith UI, drilling into LLM calls and tool runs
- Tag and filter traces using metadata for fast debugging
- Control costs with selective tracing and sampling
- Debug failures by walking the trace hierarchy
- Query traces programmatically for pattern analysis
- Monitor production with dashboards tracking latency, errors, and tokens
- Collect feedback and connect user satisfaction to specific traces
Observability isn’t optional for production agents. You can’t fix what you can’t see. Start with full tracing in development. Switch to sampled tracing when you deploy.
Practice Exercise
Build a trace analysis script that flags anomalous runs
**Task:** Query your LangSmith project for traces from the last hour. Flag any trace with latency above 5 seconds or token count above 2000. Print a summary.
from langsmith import Client
from datetime import datetime, timedelta
client = Client()
one_hour_ago = datetime.utcnow() - timedelta(hours=1)
traces = list(client.list_runs(
project_name="langgraph-observability-demo",
execution_order=1,
start_time=one_hour_ago,
))
flagged = []
for trace in traces:
reasons = []
if trace.latency_ms and trace.latency_ms > 5000:
reasons.append(f"High latency: {trace.latency_ms}ms")
if trace.total_tokens and trace.total_tokens > 2000:
reasons.append(f"High tokens: {trace.total_tokens}")
if reasons:
flagged.append((trace, reasons))
print(f"Checked {len(traces)} traces. Flagged {len(flagged)}.")
for trace, reasons in flagged:
print(f" {trace.id}: {', '.join(reasons)}")
In production, you’d send flagged traces to Slack or PagerDuty instead of printing them.
Complete Code
Click to expand the full script (copy-paste and run)
# Complete code from: LangGraph Observability and Debugging
# Requires: pip install langchain-openai langgraph langsmith python-dotenv
# Python 3.9+
# Needs: OPENAI_API_KEY, LANGCHAIN_API_KEY in .env
import os
import uuid
import random
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from langchain_core.tools import tool
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode
from langsmith import Client, tracing_context
load_dotenv()
# --- Define Tools ---
@tool
def search_web(query: str) -> str:
"""Search the web for current information."""
results = {
"weather": "Current weather: 22C, partly cloudy",
"news": "Top story: AI advances in 2026",
"python": "Python 3.13 released with faster interpreter",
}
for key, value in results.items():
if key in query.lower():
return value
return f"Search results for: {query}"
# --- Build the Graph ---
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
tools = [search_web]
llm_with_tools = llm.bind_tools(tools)
def agent_node(state: MessagesState):
return {"messages": [llm_with_tools.invoke(state["messages"])]}
def should_use_tool(state: MessagesState):
last_message = state["messages"][-1]
if last_message.tool_calls:
return "tools"
return END
graph = StateGraph(MessagesState)
graph.add_node("agent", agent_node)
graph.add_node("tools", ToolNode(tools))
graph.add_edge(START, "agent")
graph.add_conditional_edges(
"agent", should_use_tool, ["tools", END]
)
graph.add_edge("tools", "agent")
app = graph.compile()
# --- Basic Trace ---
result = app.invoke(
{"messages": [HumanMessage(content="What's the weather?")]}
)
print("Basic:", result["messages"][-1].content)
# --- Trace with Metadata ---
result = app.invoke(
{"messages": [HumanMessage(content="Python news?")]},
config={
"metadata": {"user_id": "user_42", "session_id": "abc-123"},
"tags": ["production", "search-query"],
},
)
print("Tagged:", result["messages"][-1].content)
# --- Selective Tracing ---
with tracing_context(enabled=True):
result = app.invoke(
{"messages": [HumanMessage(content="AI news")]}
)
print("Selective:", result["messages"][-1].content)
# --- Feedback Collection ---
run_id = str(uuid.uuid4())
result = app.invoke(
{"messages": [HumanMessage(content="Weather?")]},
config={"run_id": run_id},
)
client = Client()
client.create_feedback(
run_id=run_id, key="user_rating",
score=1.0, comment="Accurate response",
)
print("Feedback attached to:", run_id)
# --- Programmatic Query ---
traces = list(client.list_runs(
project_name="langgraph-observability-demo",
execution_order=1, limit=5,
))
print(f"\nRecent traces ({len(traces)}):")
for t in traces:
print(f" {t.name} | {t.status} | {t.latency_ms}ms")
print("\nScript completed successfully.")
Frequently Asked Questions
Is LangSmith free to use?
LangSmith offers a free tier with a generous trace allowance per month. For most development and small-scale production, it’s enough. Paid plans add higher volumes, longer retention, and team features. Check smith.langchain.com for current pricing.
Can I use LangSmith with frameworks other than LangChain?
Yes. The @traceable decorator from the langsmith package instruments any Python function. The zero-config experience (just environment variables) only works with LangChain and LangGraph. Other frameworks need manual instrumentation.
from langsmith import traceable
@traceable
def my_custom_function(input_text: str) -> str:
return f"Processed: {input_text}"
How do I trace async LangGraph invocations?
Async tracing works the same as sync. The environment variables activate tracing for both app.invoke() and app.ainvoke(). No extra configuration needed.
import asyncio
async def main():
result = await app.ainvoke(
{"messages": [HumanMessage(content="Hello!")]},
config={"tags": ["async-test"]},
)
print(result["messages"][-1].content)
asyncio.run(main())
Does tracing add latency to my agent?
LangSmith sends trace data asynchronously in the background. The overhead is typically under 1 millisecond per run. Trace data is batched and sent after invocation completes, so it doesn’t block your agent’s response.
References
- LangSmith Documentation — Observability. Link
- LangGraph Documentation — Using LangSmith with LangGraph. Link
- LangSmith SDK — Python Client Reference. Link
- LangChain Blog — LangSmith for Agent Observability (2025). Link
- LangGraph Documentation — Tracing and Debugging. Link
- OpenAI API Documentation — Function Calling. Link
- LangSmith Changelog — Connect traces to server logs (August 2025). Link
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →