LangGraph Error Handling: Retries & Fallback Strategies

Make your LangGraph agents production-ready with retry logic, fallback paths, and error tracking that keeps pipelines alive when things go wrong.

Written by Selva Prabhakaran | 28 min read

Build resilient LangGraph agents that recover from tool failures, LLM errors, and unexpected states — without crashing your entire pipeline.

Your LangGraph agent runs great on your laptop. You demo it to your team and everyone cheers. Then you ship it, and within an hour an API times out, the LLM sends back garbled JSON, and a tool throws an error no one saw coming. The whole graph dies.

Been there? In production, agents fail. The real question is not will they fail — it is will they bounce back. In this guide, I will show you how to build LangGraph agents that catch errors cleanly, retry with care, and switch to safer paths when the main route breaks.

Why Does Error Handling Matter So Much in Agent Systems?

A plain Python script fails at one spot. You read the traceback, fix the bug, and move on. Agents are a different story. They chain LLM calls, fire off external tools, and pass state from node to node.

Three things make error handling in agents harder than in normal code.

First, outside services are not reliable. Your agent talks to OpenAI, runs a web search, and hits a database. Any one of these can choke on rate limits, time out, or hand back junk data.

Second, bad output snowballs. If node A spits out garbage, node B eats it and makes things worse. By the time you spot the problem, the real cause is three nodes behind.

Third, many failures are only temporary. A rate-limit error goes away in 30 seconds. Crashing right away wastes a chance to try again and succeed.

Key Insight: Error handling for agents is not about stopping every failure. It is about choosing what happens when failures occur. A well-built agent degrades gracefully rather than falling flat on its face.

Let us set up our tools and build error handling step by step.

python

import os
import time
import random
from typing import TypedDict, Annotated
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.types import RetryPolicy

load_dotenv()

llm = ChatOpenAI(model="gpt-4o-mini")
print("Environment ready")

python

Environment ready

How Do You Use Try/Except Inside Node Functions?

The simplest way to keep your graph running when a node fails is to wrap the risky call in try/except and return a helpful message instead of crashing.

Below is a node that talks to an outside API. We fake an unreliable service that sometimes times out or drops the connection. The safe_fetch_node function catches both error types and hands back something the LLM can reason about.

python

def call_external_api(query: str) -> str:
    """Simulates an unreliable API."""
    roll = random.random()
    if roll < 0.3:
        raise TimeoutError("API request timed out after 30s")
    if roll < 0.5:
        raise ConnectionError("Could not reach api.example.com")
    return f"Results for '{query}': 42 matching records found"

def safe_fetch_node(state: MessagesState):
    """Fetch node with structured error handling."""
    query = state["messages"][-1].content
    try:
        result = call_external_api(query)
        return {"messages": [AIMessage(content=result)]}
    except (TimeoutError, ConnectionError) as e:
        error_type = type(e).__name__
        return {"messages": [AIMessage(
            content=f"[{error_type}] {str(e)}. Continuing with available info."
        )]}

random.seed(42)
test_state = {"messages": [HumanMessage(content="test query")]}
result = safe_fetch_node(test_state)
print(result["messages"][0].content)

python

Results for 'test query': 42 matching records found

See the pattern? Each except block sends back a valid state update — not a crash. The graph moves on to the next node with an error note in the messages.

Think about this: what if call_external_api threw a ValueError instead? We did not list that type in our except, so the node would crash. That is why you should always spell out which errors you expect to see.

Warning: Never write a bare `except:` inside a node. Stick to named error types. A bare `except` quietly eats `KeyboardInterrupt` and `SystemExit`, which means you lose the ability to stop the agent while debugging.

How Can You Retry with State Counters?

Sometimes the smart move after a failure is to try again. You can build your own retry loop by storing an attempt count in the graph state.

The idea: add retry_count and max_retries to the state. Each time the node fails, it bumps the counter. A conditional edge checks the counter and either loops back (retry) or heads to a fallback (give up).

python

class RetryState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    retry_count: int
    max_retries: int
    last_error: str

def unreliable_node(state: RetryState):
    """A node that fails sometimes and tracks retries."""
    try:
        if random.random() < 0.6:
            raise ConnectionError("Service temporarily unavailable")
        return {
            "messages": [AIMessage(content="Operation succeeded!")],
            "retry_count": 0,
            "last_error": "",
        }
    except ConnectionError as e:
        new_count = state.get("retry_count", 0) + 1
        return {
            "messages": [],
            "retry_count": new_count,
            "last_error": str(e),
        }

The routing function decides what comes next. If the count has passed max_retries, we go to a give_up node. If there is a fresh error but we still have tries left, we loop back. If no error, we carry on.

python

def should_retry(state: RetryState) -> str:
    """Route to retry or give up based on attempt count."""
    max_retries = state.get("max_retries", 3)
    if state.get("retry_count", 0) >= max_retries:
        return "give_up"
    if state.get("last_error"):
        return "retry"
    return "continue"

def give_up_node(state: RetryState):
    """Fallback when retries are exhausted."""
    count = state.get("retry_count", 0)
    error = state.get("last_error", "Unknown error")
    return {"messages": [AIMessage(
        content=f"Failed after {count} attempts. Last error: {error}"
    )]}

builder = StateGraph(RetryState)
builder.add_node("attempt", unreliable_node)
builder.add_node("give_up", give_up_node)
builder.add_edge(START, "attempt")
builder.add_conditional_edges("attempt", should_retry, {
    "retry": "attempt",
    "give_up": "give_up",
    "continue": END,
})
builder.add_edge("give_up", END)
graph = builder.compile()
print("Retry graph compiled")

python

Retry graph compiled

Tip: Log the error type in state, not only the count. A `TimeoutError` deserves another shot. A `ValueError` from bad input almost certainly does not. Use the type to decide if a retry is worthwhile.

What Does LangGraph’s Built-In RetryPolicy Do?

Writing your own retry loop gives you full control, but for common cases LangGraph ships RetryPolicy. You attach it to any node and the framework takes care of retries for you, complete with exponential backoff.

RetryPolicy lives in langgraph.types. Here is what each setting does:

Setting	Default	What it controls
`initial_interval`	0.5 s	Pause before the first retry
`backoff_factor`	2.0	How much the wait grows each time
`max_attempts`	3	Total tries allowed
`max_interval`	128 s	Ceiling for the wait time
`jitter`	True	Random noise on the delays
`retry_on`	default_retry_on	Which errors deserve a retry

Hooking a policy up takes one line. Pass retry=RetryPolicy(...) when you call add_node().

python

def flaky_api_node(state: MessagesState):
    """Node that calls an unreliable API."""
    if random.random() < 0.5:
        raise ConnectionError("Service unavailable")
    return {"messages": [AIMessage(content="API call succeeded")]}

builder = StateGraph(MessagesState)
builder.add_node(
    "api_call",
    flaky_api_node,
    retry=RetryPolicy(max_attempts=5, initial_interval=1.0)
)
builder.add_edge(START, "api_call")
builder.add_edge("api_call", END)
retry_graph = builder.compile()
print("Graph with RetryPolicy compiled")

python

Graph with RetryPolicy compiled

LangGraph will retry flaky_api_node up to five times. It waits 1 second before the first retry, then 2, then 4, and so on. Your node stays clean — it just raises when things go wrong, and the framework handles the rest.

Quick check: with initial_interval=1.0 and backoff_factor=2.0, what is the wait before attempt number four (ignoring jitter)?

The math: 1.0 times 2 to the power of 3 equals 8.0 seconds. Each attempt doubles the gap.

Key Insight: RetryPolicy turns reliability into a graph-level setting, not something each node must handle. Node functions keep doing their job. Retry rules live in the graph config where they belong.

The retry_on setting is worth a closer look. By default, LangGraph retries most errors. For HTTP errors (from requests or httpx), it only retries on 5xx codes. You can swap in your own filter function.

python

def should_retry_error(error: Exception) -> bool:
    """Only retry transient network errors."""
    return isinstance(error, (ConnectionError, TimeoutError))

builder = StateGraph(MessagesState)
builder.add_node(
    "selective_retry",
    flaky_api_node,
    retry=RetryPolicy(max_attempts=3, retry_on=should_retry_error)
)
builder.add_edge(START, "selective_retry")
builder.add_edge("selective_retry", END)
selective_graph = builder.compile()
print("Selective retry graph compiled")

python

Selective retry graph compiled

How Do Fallback Nodes Work?

Retrying will not help when a service is fully down. That is where fallback nodes step in — they give the graph a plan B.

The pattern is simple. A primary node tries the preferred approach. If it fails, a conditional edge sends the graph to a fallback node that uses a simpler or cached method. Both paths meet at the same point downstream.

python

class FallbackState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    primary_failed: bool

def primary_node(state: FallbackState):
    """Try the preferred approach first."""
    try:
        response = llm.invoke(state["messages"])
        return {"messages": [response], "primary_failed": False}
    except Exception as e:
        return {
            "messages": [AIMessage(content=f"Primary failed: {e}")],
            "primary_failed": True,
        }

def fallback_node(state: FallbackState):
    """Simpler approach when primary fails."""
    query = state["messages"][0].content
    return {"messages": [AIMessage(
        content=f"I couldn't process your request fully, "
        f"but here's a basic response to: '{query}'"
    )]}

def route_after_primary(state: FallbackState) -> str:
    if state.get("primary_failed", False):
        return "fallback"
    return "done"

builder = StateGraph(FallbackState)
builder.add_node("primary", primary_node)
builder.add_node("fallback", fallback_node)
builder.add_edge(START, "primary")
builder.add_conditional_edges("primary", route_after_primary, {
    "fallback": "fallback",
    "done": END,
})
builder.add_edge("fallback", END)
fallback_graph = builder.compile()
print("Fallback graph compiled")

python

Fallback graph compiled

This pattern is perfect for LLM model switching. Say the primary node uses GPT-4o. If GPT-4o goes dark, the fallback swaps to GPT-4o-mini. LangChain offers a built-in shortcut for exactly this: .with_fallbacks().

python

def create_resilient_llm():
    """Create an LLM with automatic fallback chain."""
    primary = ChatOpenAI(
        model="gpt-4o",
        timeout=30,
        max_retries=3,
    )
    fallback = ChatOpenAI(
        model="gpt-4o-mini",
        timeout=30,
        max_retries=3,
    )
    return primary.with_fallbacks([fallback])

resilient_llm = create_resilient_llm()
print("Resilient LLM with fallback chain ready")

python

Resilient LLM with fallback chain ready

I use this setup in every agent I ship. GPT-4o-mini is much cheaper, so the fallback saves money on top of keeping things running.

Note: `ChatOpenAI`’s `max_retries` covers HTTP-level retries (429, 500, 503). LangGraph’s `RetryPolicy` covers node-level retries. They work on different layers. Use both for the strongest safety net.

How Does ToolNode Handle Tool Errors?

LangGraph’s ToolNode comes with its own error handling through the handle_tool_errors flag. When a tool throws, ToolNode catches it and sends the error back as a ToolMessage. The LLM sees the error and can fix its approach.

Let me set up two tools that are designed to break. divide throws when you pass zero as the second argument. fetch_price throws when it gets a ticker it does not know.

python

@tool
def divide(a: float, b: float) -> float:
    """Divide a by b."""
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

@tool
def fetch_price(ticker: str) -> str:
    """Fetch stock price for a ticker symbol."""
    valid_tickers = {"AAPL": 185.50, "GOOGL": 142.30}
    if ticker not in valid_tickers:
        raise ValueError(f"Unknown ticker: {ticker}")
    return f"${valid_tickers[ticker]}"

tools = [divide, fetch_price]

You can set up handle_tool_errors in four ways. Here they are side by side.

python

# Option 1: Catch all errors, return error text to LLM
tool_node_basic = ToolNode(tools, handle_tool_errors=True)

# Option 2: Custom error message for all errors
tool_node_custom = ToolNode(
    tools,
    handle_tool_errors="Tool failed. Try different parameters."
)

# Option 3: Custom handler function
def handle_tool_error(error: Exception) -> str:
    if isinstance(error, ValueError):
        return f"Invalid input: {error}. Check your parameters."
    return f"Tool error: {error}. Try a different approach."

tool_node_handler = ToolNode(tools, handle_tool_errors=handle_tool_error)

# Option 4: Catch only specific exception types
tool_node_selective = ToolNode(
    tools, handle_tool_errors=(ValueError,)
)

print("Four ToolNode error handling configurations ready")

python

Four ToolNode error handling configurations ready

With handle_tool_errors=True, the LLM gets the raw error text. If it tried to divide by zero, it reads “Cannot divide by zero” and can try a different call. This self-fixing loop is one of the most powerful tricks in agent design.

Tip: Use `handle_tool_errors=True` while you develop. It catches everything and shows honest error messages. For production, switch to a custom handler that cleans up the text — you do not want to leak internal details to end users.

What Is Graceful Degradation and How Do You Build It?

What happens when retries run out and fallbacks fail too? Graceful degradation means your agent still offers something useful, even if it cannot deliver the full answer.

I like to think of degradation in three tiers:

Tier 1 — Retry and win. The call fails, retries, and gets through. The user never knows anything went wrong.

Tier 2 — Switch to a backup. The main tool or model is down. The agent moves to a second choice. The response is thinner but still helpful.

Tier 3 — Be honest and guide. Nothing works. The agent tells the user what happened and suggests next steps.

python

class DegradationState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    degradation_level: int

def smart_response_node(state: DegradationState):
    """Generate response with graceful degradation."""
    level = state.get("degradation_level", 0)

    if level == 0:
        return {"messages": [AIMessage(
            content="Full response with all data sources."
        )]}
    elif level == 1:
        return {"messages": [AIMessage(
            content="Partial response. Some data sources were unavailable."
        )]}
    else:
        return {"messages": [AIMessage(
            content="I'm having trouble accessing my tools. "
            "Here's what I know from training data. "
            "Please verify this information independently."
        )]}

for level in [0, 1, 2]:
    test_state = {"messages": [], "degradation_level": level}
    result = smart_response_node(test_state)
    print(f"Level {level}: {result['messages'][0].content[:55]}...")

python

Level 0: Full response with all data sources....
Level 1: Partial response. Some data sources were unavailable...
Level 2: I'm having trouble accessing my tools. Here's what I...

The golden rule: degradation must be honest. Never pretend all is well when it is not. Tell the user what you could not do and why.

How Do You Track and Propagate Errors Through the Graph?

When something breaks deep inside your graph, the nodes that come after need to know about it. By logging error details into the state, every downstream node can make a smart choice.

Here is the approach: add an errors list to your state. Each entry holds the node name, the error type, the message, and a timestamp. A reusable wrapper tacks error tracking onto any node function.

python

from datetime import datetime

class ErrorTrackingState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    errors: Annotated[list[dict], lambda a, b: a + b]

def node_with_error_tracking(node_name: str, operation):
    """Wrap any node function with error tracking."""
    def wrapped(state: ErrorTrackingState):
        try:
            return operation(state)
        except Exception as e:
            error_info = {
                "node": node_name,
                "error_type": type(e).__name__,
                "message": str(e),
                "timestamp": datetime.now().isoformat(),
            }
            return {"messages": [], "errors": [error_info]}
    return wrapped

def step_one(state):
    return {"messages": [AIMessage(content="Step 1 done")], "errors": []}

def step_two(state):
    raise ValueError("Database connection refused")

builder = StateGraph(ErrorTrackingState)
builder.add_node("step1", node_with_error_tracking("step1", step_one))
builder.add_node("step2", node_with_error_tracking("step2", step_two))
builder.add_edge(START, "step1")
builder.add_edge("step1", "step2")
builder.add_edge("step2", END)
error_graph = builder.compile()

result = error_graph.invoke({"messages": [], "errors": []})
print(f"Errors collected: {len(result['errors'])}")
for err in result["errors"]:
    print(f"  {err['node']}: [{err['error_type']}] {err['message']}")

python

Errors collected: 1
  step2: [ValueError] Database connection refused

For production monitoring, pair this error list with Python’s logging module. Log at INFO for successes, WARNING for retries, and ERROR for crashes. LangSmith picks up these traces on its own when you set LANGCHAIN_TRACING_V2=true.

How Do You Wire It All Together in a Real Agent?

Let us put every piece into one production-style agent. This research agent has three tools, three layers of protection, and clean fallback paths.

We will build three tools, each with its own way of breaking. web_search fakes a rate limit. calculator rejects anything with letters in it. summarize_text refuses text that is too short. The agent needs to cope with all three.

python

@tool
def web_search(query: str) -> str:
    """Search the web for information."""
    if random.random() < 0.3:
        raise ConnectionError("Search API rate limited")
    return f"Search results for '{query}': Found 5 relevant articles."

@tool
def calculator(expression: str) -> str:
    """Evaluate a mathematical expression safely."""
    allowed = set("0123456789+-*/.() ")
    if not all(c in allowed for c in expression):
        raise ValueError(f"Invalid characters in: '{expression}'")
    result = eval(expression)
    return str(result)

@tool
def summarize_text(text: str) -> str:
    """Summarize a block of text."""
    if len(text) < 10:
        raise ValueError("Text too short to summarize")
    return f"Summary: {text[:100]}..."

research_tools = [web_search, calculator, summarize_text]
research_llm = ChatOpenAI(model="gpt-4o-mini").bind_tools(research_tools)

The agent node guards the LLM call with try/except. We turn on handle_tool_errors=True for ToolNode. And we attach RetryPolicy to the tools node so short-lived glitches resolve on their own.

python

def agent_node(state: MessagesState):
    """Agent with error handling on LLM calls."""
    try:
        response = research_llm.invoke(state["messages"])
        return {"messages": [response]}
    except Exception as e:
        return {"messages": [AIMessage(
            content="I'm having trouble processing your request. "
            "Could you rephrase your question?"
        )]}

tool_node = ToolNode(research_tools, handle_tool_errors=True)

builder = StateGraph(MessagesState)
builder.add_node("agent", agent_node)
builder.add_node(
    "tools", tool_node,
    retry=RetryPolicy(max_attempts=3, initial_interval=1.0)
)
builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", tools_condition)
builder.add_edge("tools", "agent")
resilient_agent = builder.compile()
print("Resilient research agent compiled")

python

Resilient research agent compiled

Three shields work together here:

Shield 1 (tool level): handle_tool_errors=True catches tool crashes and turns them into ToolMessage replies. The LLM sees the error and can change its approach.

Shield 2 (node level): RetryPolicy on the tools node retries the whole tool run if a brief network glitch occurs. Rate limits and timeouts often clear up on their own.

Shield 3 (agent level): The try/except in agent_node catches LLM failures and returns a friendly message instead of a crash.

Common Mistakes and How to Avoid Them

Mistake 1: Swallowing errors inside tools instead of letting ToolNode handle them

Wrong:

python

@tool
def my_tool(query: str) -> str:
    """Do something."""
    try:
        return do_work(query)
    except Exception:
        return "Something went wrong"

Why it hurts you: The LLM reads a vague string and cannot tell if it is an error or a real answer. You also lose the error from your logs entirely.

Right:

python

@tool
def my_tool(query: str) -> str:
    """Do something."""
    return do_work(query)  # Let ToolNode handle errors

Hand the job to ToolNode with handle_tool_errors=True. It wraps the real error text in a ToolMessage, giving the LLM the info it needs to self-correct.

Mistake 2: Retrying errors that will never go away

Wrong:

python

builder.add_node("parse", parse_node, retry=RetryPolicy(max_attempts=5))

Why it backfires: If the input is malformed, you get the identical crash five times in a row. You burn time, tokens, and money for zero benefit.

Right:

python

def should_retry(error: Exception) -> bool:
    return isinstance(error, (ConnectionError, TimeoutError))

builder.add_node("parse", parse_node, retry=RetryPolicy(
    max_attempts=5, retry_on=should_retry
))

Filter with retry_on. A dropped connection? Retry it. A schema check that found bad data? Let it fail now.

Mistake 3: Quiet failures that crash a later node

Wrong:

python

def node_a(state):
    try:
        return {"result": risky_operation()}
    except Exception:
        return {"result": None}  # Silent failure

def node_b(state):
    processed = state["result"].upper()  # Crashes: NoneType has no .upper()

The problem: Node B has no idea anything went wrong upstream. It calls .upper() on None and crashes with an AttributeError that masks the true root cause.

Right:

python

def node_a(state):
    try:
        return {"result": risky_operation(), "errors": []}
    except Exception as e:
        return {"result": None, "errors": [{"node": "a", "error": str(e)}]}

def node_b(state):
    if state.get("errors"):
        return {"result": "Skipped: upstream error"}
    return {"result": state["result"].upper(), "errors": []}

Pass error details along in the state dict. Every node that comes after can peek at the error list and skip its work if something already went wrong.

Quick Fixes for Common Error Messages

GraphRecursionError: Recursion limit of 25 reached

Your retry loop creates a cycle that hits LangGraph’s ceiling. Fix it by raising recursion_limit at invoke time, or lower max_retries in your state-based loop.

python

graph = builder.compile()
result = graph.invoke(state, {"recursion_limit": 50})

ToolException: Tool 'my_tool' not found

The LLM asked for a tool that is not in the ToolNode. This usually means the model made up a tool name. With handle_tool_errors=True, the error goes back to the LLM so it can pick a real tool.

openai.RateLimitError: Rate limit reached

You hit your API quota. ChatOpenAI retries on its own through max_retries. If the problem sticks around, chain in a cheaper model with .with_fallbacks().

When Should You Skip These Patterns?

Not every graph needs retries and fallback paths. Here are cases where keeping it simple is the better call.

Early prototyping. While you explore, let errors crash loudly. You need to see what breaks and why. Adding error handling too soon hides bugs.

Fully local pipelines. If your graph never calls an API or an LLM, failures are plain code bugs. Fix the bug instead of wrapping it in retries.

Tight budgets. Every retry costs money — LLM tokens, API calls, compute time. If the budget is slim, fail fast and show the error to the user rather than burning credits on retries.

Exercises

{
type: ‘exercise’,
id: ‘retry-node-exercise’,
title: ‘Exercise 1: Build a Retry-Aware Node’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Complete the retry_node function so it retries flaky_operation() up to 3 times with exponential backoff (base delay of 1 second). On success, return the result as an AIMessage. After all attempts fail, return a fallback message.’,
starterCode: ‘import time\nimport random\nfrom langchain_core.messages import AIMessage\n\ndef flaky_operation():\n “””Succeeds only 30% of the time.”””\n if random.random() > 0.3:\n raise ConnectionError(“Service unavailable”)\n return “Operation completed successfully”\n\ndef retry_node(state):\n max_attempts = 3\n # YOUR CODE HERE\n # Try flaky_operation() with exponential backoff\n # Return AIMessage with result on success\n # Return AIMessage with fallback on failure\n pass’,
testCases: [
{ id: ‘tc1’, input: ‘random.seed(10)\nresult = retry_node({“messages”: []})\nprint(result[“messages”][0].content)’, expectedOutput: ‘Operation completed successfully’, description: ‘Should succeed after retries with seed 10’ },
{ id: ‘tc2’, input: ‘random.seed(1)\nresult = retry_node({“messages”: []})\nprint(“fallback” in result[“messages”][0].content.lower())’, expectedOutput: ‘True’, description: ‘Should return fallback message when all retries fail’, hidden: true },
],
hints: [
‘Use a for loop with range(max_attempts). Inside the loop, wrap flaky_operation() in try/except. Calculate delay as 1.0 * (2 ** attempt) and call time.sleep(delay) before the next attempt.’,
‘After the loop completes without returning, all attempts failed. Return {“messages”: [AIMessage(content=”All retries exhausted. Using fallback response.”)]}’,
],
solution: ‘def retry_node(state):\n max_attempts = 3\n for attempt in range(max_attempts):\n try:\n result = flaky_operation()\n return {“messages”: [AIMessage(content=result)]}\n except ConnectionError:\n if attempt < max_attempts – 1:\n delay = 1.0 * (2 ** attempt)\n time.sleep(delay)\n return {“messages”: [AIMessage(content=”All retries exhausted. Using fallback response.”)]}’,
solutionExplanation: ‘The loop tries the operation up to 3 times. On each failure, it waits 1s, then 2s before the next attempt. After the final failure, it returns a fallback message instead of crashing.’,
xpReward: 15,
}

{
type: ‘exercise’,
id: ‘error-tracking-exercise’,
title: ‘Exercise 2: Add Error Tracking to a Graph’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Create a safe_wrap(name, fn) function that wraps any node function with error tracking. When the wrapped function catches an exception, it should append an error dict with “node” and “error” keys to the errors list in state. Then use it to build a 3-node graph where step_b always fails.’,
starterCode: ‘from typing import TypedDict, Annotated\nfrom langchain_core.messages import AIMessage\nfrom langgraph.graph import StateGraph, START, END\n\nclass TrackedState(TypedDict):\n messages: Annotated[list, lambda a, b: a + b]\n errors: Annotated[list[dict], lambda a, b: a + b]\n\ndef step_a(state):\n return {“messages”: [AIMessage(content=”A done”)], “errors”: []}\n\ndef step_b(state):\n raise ValueError(“Database connection failed”)\n\ndef step_c(state):\n count = len(state[“errors”])\n return {“messages”: [AIMessage(content=f”Done with {count} error(s)”)], “errors”: []}\n\ndef safe_wrap(name, fn):\n # YOUR CODE HERE\n pass\n\n# YOUR CODE HERE: Build graph with wrapped nodes’,
testCases: [
{ id: ‘tc1’, input: ‘result = graph.invoke({“messages”: [], “errors”: []})\nprint(result[“messages”][-1].content)’, expectedOutput: ‘Done with 1 error(s)’, description: ‘Should collect 1 error from step_b’ },
{ id: ‘tc2’, input: ‘result = graph.invoke({“messages”: [], “errors”: []})\nprint(result[“errors”][0][“node”])’, expectedOutput: ‘b’, description: ‘Error should identify the failing node’ },
],
hints: [
‘safe_wrap should return a new function that calls fn(state) inside try/except. On exception, return {“messages”: [], “errors”: [{“node”: name, “error”: str(e)}]}.’,
‘Build the graph: add_node(“a”, safe_wrap(“a”, step_a)), add_node(“b”, safe_wrap(“b”, step_b)), add_node(“c”, step_c). Chain edges: START -> a -> b -> c -> END.’,
],
solution: ‘def safe_wrap(name, fn):\n def wrapped(state):\n try:\n return fn(state)\n except Exception as e:\n return {“messages”: [], “errors”: [{“node”: name, “error”: str(e)}]}\n return wrapped\n\nbuilder = StateGraph(TrackedState)\nbuilder.add_node(“a”, safe_wrap(“a”, step_a))\nbuilder.add_node(“b”, safe_wrap(“b”, step_b))\nbuilder.add_node(“c”, step_c)\nbuilder.add_edge(START, “a”)\nbuilder.add_edge(“a”, “b”)\nbuilder.add_edge(“b”, “c”)\nbuilder.add_edge(“c”, END)\ngraph = builder.compile()’,
solutionExplanation: ‘safe_wrap catches exceptions from any node and stores error info in state. Step_b fails with ValueError, but the wrapper catches it. Step_c sees 1 error in state and reports it.’,
xpReward: 20,
}

Complete Code

Click to expand the full script (copy-paste and run)

python

# Complete code from: Error Handling, Retries, and Fallback Strategies in LangGraph
# Requires: pip install langchain-openai langgraph python-dotenv
# Python 3.10+
# Set OPENAI_API_KEY in your .env file

import os
import time
import random
from typing import TypedDict, Annotated
from datetime import datetime
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.types import RetryPolicy

load_dotenv()

# --- Setup ---
llm = ChatOpenAI(model="gpt-4o-mini")

# --- Try/Except in Node Functions ---
def call_external_api(query: str) -> str:
    roll = random.random()
    if roll < 0.3:
        raise TimeoutError("API request timed out after 30s")
    if roll < 0.5:
        raise ConnectionError("Could not reach api.example.com")
    return f"Results for '{query}': 42 matching records found"

def safe_fetch_node(state: MessagesState):
    query = state["messages"][-1].content
    try:
        result = call_external_api(query)
        return {"messages": [AIMessage(content=result)]}
    except (TimeoutError, ConnectionError) as e:
        error_type = type(e).__name__
        return {"messages": [AIMessage(
            content=f"[{error_type}] {str(e)}. Continuing with available info."
        )]}

# --- Retry with State Counters ---
class RetryState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    retry_count: int
    max_retries: int
    last_error: str

def unreliable_node(state: RetryState):
    try:
        if random.random() < 0.6:
            raise ConnectionError("Service temporarily unavailable")
        return {
            "messages": [AIMessage(content="Operation succeeded!")],
            "retry_count": 0, "last_error": "",
        }
    except ConnectionError as e:
        new_count = state.get("retry_count", 0) + 1
        return {"messages": [], "retry_count": new_count, "last_error": str(e)}

def should_retry(state: RetryState) -> str:
    if state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "give_up"
    if state.get("last_error"):
        return "retry"
    return "continue"

def give_up_node(state: RetryState):
    return {"messages": [AIMessage(
        content=f"Failed after {state.get('retry_count', 0)} attempts."
    )]}

# --- Exponential Backoff ---
def backoff_delay(attempt: int, base: float = 1.0, jitter: bool = True) -> float:
    delay = base * (2 ** attempt)
    if jitter:
        delay *= (0.5 + random.random())
    return min(delay, 60.0)

# --- Fallback Pattern ---
class FallbackState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    primary_failed: bool

def primary_node(state: FallbackState):
    try:
        response = llm.invoke(state["messages"])
        return {"messages": [response], "primary_failed": False}
    except Exception as e:
        return {"messages": [AIMessage(content=f"Primary failed: {e}")],
                "primary_failed": True}

def fallback_node(state: FallbackState):
    query = state["messages"][0].content
    return {"messages": [AIMessage(
        content=f"Basic response to: '{query}'"
    )]}

# --- Error Tracking ---
class ErrorTrackingState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    errors: Annotated[list[dict], lambda a, b: a + b]

def node_with_error_tracking(node_name: str, operation):
    def wrapped(state: ErrorTrackingState):
        try:
            return operation(state)
        except Exception as e:
            return {"messages": [], "errors": [{
                "node": node_name,
                "error_type": type(e).__name__,
                "message": str(e),
                "timestamp": datetime.now().isoformat(),
            }]}
    return wrapped

# --- Resilient Agent ---
@tool
def web_search(query: str) -> str:
    """Search the web for information."""
    if random.random() < 0.3:
        raise ConnectionError("Search API rate limited")
    return f"Search results for '{query}': Found 5 relevant articles."

@tool
def calculator(expression: str) -> str:
    """Evaluate a mathematical expression safely."""
    allowed = set("0123456789+-*/.() ")
    if not all(c in allowed for c in expression):
        raise ValueError(f"Invalid characters in: '{expression}'")
    return str(eval(expression))

@tool
def summarize_text(text: str) -> str:
    """Summarize a block of text."""
    if len(text) < 10:
        raise ValueError("Text too short to summarize")
    return f"Summary: {text[:100]}..."

research_tools = [web_search, calculator, summarize_text]
research_llm = ChatOpenAI(model="gpt-4o-mini").bind_tools(research_tools)

def agent_node(state: MessagesState):
    try:
        response = research_llm.invoke(state["messages"])
        return {"messages": [response]}
    except Exception as e:
        return {"messages": [AIMessage(
            content="I'm having trouble. Could you rephrase?"
        )]}

tool_node = ToolNode(research_tools, handle_tool_errors=True)

builder = StateGraph(MessagesState)
builder.add_node("agent", agent_node)
builder.add_node("tools", tool_node, retry=RetryPolicy(max_attempts=3))
builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", tools_condition)
builder.add_edge("tools", "agent")
resilient_agent = builder.compile()

# Run the agent
result = resilient_agent.invoke({
    "messages": [HumanMessage(content="What is 25 * 4?")]
})
for msg in result["messages"]:
    if isinstance(msg, HumanMessage):
        print(f"User: {msg.content}")
    elif isinstance(msg, ToolMessage):
        print(f"Tool: {msg.content[:80]}")
    else:
        print(f"Agent: {msg.content[:120]}")

print("\nScript completed successfully.")

Summary

Error handling in LangGraph is not optional — it is what keeps your agents alive in production. Here is a recap of the strategies we covered.

Try/except in nodes stops your graph from crashing when a call goes wrong. Catch the exact error types and send back a useful message.

State-based retries give you full control over the retry loop. Track counts in state and use conditional edges to decide: try again or fall back.

RetryPolicy handles retries for you. Bolt it onto any node with retry=RetryPolicy(...) and get exponential backoff out of the box.

Fallback nodes open a second path through the graph. Chain LLM models with .with_fallbacks() or wire conditional edges to a backup node.

ToolNode’s handle_tool_errors catches tool crashes and feeds the error text back to the LLM, giving it a chance to self-correct.

Error tracking through state lets downstream nodes react to upstream trouble. Add an errors list to your state and check it before doing any work.

Practice exercise: build an agent with two tools where one is unreliable. Add RetryPolicy to the tools node, handle_tool_errors=True to ToolNode, and a fallback agent node that responds without tools when all else fails.

Solution outline

python

# 1. Define reliable_tool and unreliable_tool
# 2. Create ToolNode with handle_tool_errors=True
# 3. Add tools node with RetryPolicy(max_attempts=3)
# 4. Add fallback_agent node that responds without tools
# 5. Conditional edge: route to fallback after repeated failures
# 6. Test with a query that triggers the unreliable tool

Frequently Asked Questions

Does RetryPolicy retry the whole graph or only the broken node?

Only the broken node. When a node raises, RetryPolicy re-runs that one node with the same input state. The rest of the graph waits.

python

# Each node gets its own policy
builder.add_node("a", node_a, retry=RetryPolicy(max_attempts=5))
builder.add_node("b", node_b, retry=RetryPolicy(max_attempts=2))

How do handle_tool_errors and RetryPolicy interact?

They sit on different layers. handle_tool_errors catches tool crashes and turns them into ToolMessage replies — the graph never sees a failure, so RetryPolicy does not kick in. If the error happens outside the tool (say, during serialization or a network glitch), RetryPolicy takes over.

Can you add a pause between state-based retries?

Yes. Drop a time.sleep() at the top of your node when retry_count > 0. Use exponential backoff: delay = base * (2 ** state["retry_count"]).

python

def retry_aware_node(state):
    if state.get("retry_count", 0) > 0:
        delay = 1.0 * (2 ** state["retry_count"])
        time.sleep(min(delay, 30))
    # ... rest of node logic

What happens after RetryPolicy uses up all its attempts?

LangGraph raises the original error. If nothing else in the graph catches it, the run stops. Plan for this by adding a conditional edge to a fallback node.

References

LangGraph Documentation — How to Add Node Retry Policies: https://langchain-ai.github.io/langgraph/how-tos/node-retries/
LangGraph Types Reference — RetryPolicy: https://langchain-ai.github.io/langgraph/reference/types/
LangChain Documentation — ToolNode and Tool Execution: https://python.langchain.com/docs/langgraph/prebuilt/toolnode
LangChain Documentation — Model Fallbacks: https://python.langchain.com/docs/how_to/fallbacks/
LangGraph Changelog — Enhanced State Management and Retries: https://changelog.langchain.com/announcements/enhanced-state-management-retries-in-langgraph-python
Python Documentation — Logging HOWTO: https://docs.python.org/3/howto/logging.html
LangGraph Documentation — Thinking in LangGraph: https://docs.langchain.com/oss/python/langgraph/thinking-in-langgraph

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

LangGraph Error Handling: Retries & Fallback Strategies

Why Does Error Handling Matter So Much in Agent Systems?

How Do You Use Try/Except Inside Node Functions?

How Can You Retry with State Counters?

What Does LangGraph’s Built-In RetryPolicy Do?

How Do Fallback Nodes Work?

How Does ToolNode Handle Tool Errors?

What Is Graceful Degradation and How Do You Build It?

How Do You Track and Propagate Errors Through the Graph?

How Do You Wire It All Together in a Real Agent?

Common Mistakes and How to Avoid Them

Mistake 1: Swallowing errors inside tools instead of letting ToolNode handle them

Mistake 2: Retrying errors that will never go away

Mistake 3: Quiet failures that crash a later node

Quick Fixes for Common Error Messages

When Should You Skip These Patterns?

Exercises

Complete Code

Summary

Frequently Asked Questions

Does RetryPolicy retry the whole graph or only the broken node?

How do handle_tool_errors and RetryPolicy interact?

Can you add a pause between state-based retries?

What happens after RetryPolicy uses up all its attempts?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Why Does Error Handling Matter So Much in Agent Systems?

How Do You Use Try/Except Inside Node Functions?

How Can You Retry with State Counters?

What Does LangGraph’s Built-In RetryPolicy Do?

How Do Fallback Nodes Work?

How Does ToolNode Handle Tool Errors?

What Is Graceful Degradation and How Do You Build It?

How Do You Track and Propagate Errors Through the Graph?

How Do You Wire It All Together in a Real Agent?

Common Mistakes and How to Avoid Them

Mistake 1: Swallowing errors inside tools instead of letting ToolNode handle them

Mistake 2: Retrying errors that will never go away

Mistake 3: Quiet failures that crash a later node

Quick Fixes for Common Error Messages

When Should You Skip These Patterns?

Exercises

Complete Code

Summary

Frequently Asked Questions

Does RetryPolicy retry the whole graph or only the broken node?

How do handle_tool_errors and RetryPolicy interact?

Can you add a pause between state-based retries?

What happens after RetryPolicy uses up all its attempts?

References

Related Articles

Build a Python AI Chatbot with Memory Using LangChain

LLM API Retry & Fallback: Build a Resilient Client

OpenAI Chat Completions API: Complete Python Guide

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.