Menu

LangGraph Error Handling — Retries, Fallbacks, and Recovery Strategies

Written by Selva Prabhakaran | 27 min read

Your LangGraph agent works great in dev. You show it off. Then you ship it, and within an hour an API times out, the LLM spits back bad JSON, and a tool throws something you never saw coming. The whole graph dies.

That story plays out in almost every project I’ve seen. The agents that last are the ones built to deal with failure — not dodge it. In this post, I’ll walk you through every error-handling trick LangGraph offers, from basic try/except blocks all the way up to multi-layer tough agents.

Why Does Error Handling Matter So Much in Agent Systems?

A normal Python script fails at one spot. You read the traceback, fix the bug, done. Agents are a different beast — they chain LLM calls, hit outside tools, and pass state from node to node.

Three things make this harder than regular error handling.

First, outside services are wild cards. Your agent talks to OpenAI, a search tool, and a database. Any one of them can blow up with rate limits, timeouts, or weird output — and you can’t control when.

Second, bad data snowballs. If node A returns junk, node B eats that junk and produces something worse. By the time you spot the problem, the real cause is three steps back.

Third, some failures heal on their own. A rate-limit error clears up if you wait 30 seconds. Crashing right away throws that chance in the trash.

Key Insight: Error handling for agents isn’t about stopping failures. It’s about deciding what happens next when something breaks. A solid agent stumbles and recovers. A brittle one falls flat.

Let me show you how to build that strength, step by step.

python
import os
import time
import random
from typing import TypedDict, Annotated
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.types import RetryPolicy

load_dotenv()

llm = ChatOpenAI(model="gpt-4o-mini")
print("Environment ready")
python
Environment ready

How Do You Use Try/Except in Node Functions?

The simplest way to keep your graph running when a node fails: wrap the risky part in try/except and hand back a useful error message instead of letting it crash.

Here’s what I mean. We have a node that calls a flaky API — one that sometimes times out or drops the link. The safe_fetch_node function catches both problems and returns a message the LLM can work with.

python
def call_external_api(query: str) -> str:
    """Simulates an unreliable API."""
    roll = random.random()
    if roll < 0.3:
        raise TimeoutError("API request timed out after 30s")
    if roll < 0.5:
        raise ConnectionError("Could not reach api.example.com")
    return f"Results for '{query}': 42 matching records found"

def safe_fetch_node(state: MessagesState):
    """Fetch node with structured error handling."""
    query = state["messages"][-1].content
    try:
        result = call_external_api(query)
        return {"messages": [AIMessage(content=result)]}
    except (TimeoutError, ConnectionError) as e:
        error_type = type(e).__name__
        return {"messages": [AIMessage(
            content=f"[{error_type}] {str(e)}. Continuing with available info."
        )]}

random.seed(42)
test_state = {"messages": [HumanMessage(content="test query")]}
result = safe_fetch_node(test_state)
print(result["messages"][0].content)
python
Results for 'test query': 42 matching records found

See the pattern? Each except block hands back a valid state update — not a crash. The graph keeps going, and the next node gets an error message it can reason about.

Quick thought puzzle: what if call_external_api raised a ValueError? That one isn’t caught here, so the node would still crash. Always name the exact errors you expect.

Warning: Never use bare except: in node functions. Catch only the types you know about. A bare except swallows KeyboardInterrupt and SystemExit, which makes your agent almost impossible to kill during testing.

How Do You Build Retry Loops with State Counters?

Sometimes the right move after a failure isn’t quitting — it’s trying again. You can do this by hand: keep a count of tries in your graph state.

Here’s the idea. You add retry_count and max_retries to the state. When the node fails, it bumps the counter. A routing function checks the counter and either loops back (retry) or moves to a give-up node (stop trying).

python
class RetryState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    retry_count: int
    max_retries: int
    last_error: str

def unreliable_node(state: RetryState):
    """A node that fails sometimes and tracks retries."""
    try:
        if random.random() < 0.6:
            raise ConnectionError("Service temporarily unavailable")
        return {
            "messages": [AIMessage(content="Operation succeeded!")],
            "retry_count": 0,
            "last_error": "",
        }
    except ConnectionError as e:
        new_count = state.get("retry_count", 0) + 1
        return {
            "messages": [],
            "retry_count": new_count,
            "last_error": str(e),
        }

The routing function picks the next step. If retry_count goes past max_retries, it sends traffic to give_up. If there’s an error but retries remain, it loops back. If all went well, it moves forward.

python
def should_retry(state: RetryState) -> str:
    """Route to retry or give up based on attempt count."""
    max_retries = state.get("max_retries", 3)
    if state.get("retry_count", 0) >= max_retries:
        return "give_up"
    if state.get("last_error"):
        return "retry"
    return "continue"

def give_up_node(state: RetryState):
    """Fallback when retries are exhausted."""
    count = state.get("retry_count", 0)
    error = state.get("last_error", "Unknown error")
    return {"messages": [AIMessage(
        content=f"Failed after {count} attempts. Last error: {error}"
    )]}

builder = StateGraph(RetryState)
builder.add_node("attempt", unreliable_node)
builder.add_node("give_up", give_up_node)
builder.add_edge(START, "attempt")
builder.add_conditional_edges("attempt", should_retry, {
    "retry": "attempt",
    "give_up": "give_up",
    "continue": END,
})
builder.add_edge("give_up", END)
graph = builder.compile()
print("Retry graph compiled")
python
Retry graph compiled

Tip: Store the error type in state, not just a count. A TimeoutError is worth retrying — the server might just be slow. A ValueError from bad input won’t get better no matter how many times you try.

What Is LangGraph’s Built-In RetryPolicy?

Hand-rolled retry logic gives you fine control, but LangGraph ships a RetryPolicy class that covers the common cases. Bolt it onto any node and you get retries with backoff — no extra code in your node body.

RetryPolicy lives in langgraph.types. Here’s what each setting does:

Parameter Default What it does
initial_interval 0.5s How long to wait before the first retry
backoff_factor 2.0 Multiplier — each wait doubles the last
max_attempts 3 Total tries before giving up
max_interval 128s Cap on the wait time
jitter True Adds a random fudge to each wait
retry_on default_retry_on Which errors trigger a retry

Hooking it up takes one line. Pass retry=RetryPolicy(...) when you call add_node().

python
def flaky_api_node(state: MessagesState):
    """Node that calls an unreliable API."""
    if random.random() < 0.5:
        raise ConnectionError("Service unavailable")
    return {"messages": [AIMessage(content="API call succeeded")]}

builder = StateGraph(MessagesState)
builder.add_node(
    "api_call",
    flaky_api_node,
    retry=RetryPolicy(max_attempts=5, initial_interval=1.0)
)
builder.add_edge(START, "api_call")
builder.add_edge("api_call", END)
retry_graph = builder.compile()
print("Graph with RetryPolicy compiled")
python
Graph with RetryPolicy compiled

Now LangGraph retries flaky_api_node up to 5 times. It waits 1 second, then 2, then 4, and so on. Your node stays clean — just raise the error and let the policy do the work.

Quick quiz: if initial_interval=1.0 and backoff_factor=2.0, how long is the wait before the 4th retry (no jitter)?

Answer: 1.0 × 2³ = 8.0 seconds. Each retry doubles the gap.

Key Insight: RetryPolicy moves retry logic out of your node code and into the graph config where it belongs. Your nodes focus on their job. The graph handles the safety net.

Take a closer look at retry_on. By default, LangGraph retries most errors. For HTTP errors (from requests or httpx), it only retries 5xx codes. You can plug in your own filter.

python
def should_retry_error(error: Exception) -> bool:
    """Only retry transient network errors."""
    return isinstance(error, (ConnectionError, TimeoutError))

builder = StateGraph(MessagesState)
builder.add_node(
    "selective_retry",
    flaky_api_node,
    retry=RetryPolicy(max_attempts=3, retry_on=should_retry_error)
)
builder.add_edge(START, "selective_retry")
builder.add_edge("selective_retry", END)
selective_graph = builder.compile()
print("Selective retry graph compiled")
python
Selective retry graph compiled

How Do Fallback Nodes and Edges Work?

Retrying only helps when the glitch is brief. If the service is flat-out dead, no number of retries will save you. That’s where fallback nodes come in — they give your graph a Plan B.

The setup is simple. A primary node tries the main approach. If it fails, a routing edge sends the graph to a fallback node that does something simpler or uses cached data. Both paths meet up at the same spot downstream.

python
class FallbackState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    primary_failed: bool

def primary_node(state: FallbackState):
    """Try the preferred approach first."""
    try:
        response = llm.invoke(state["messages"])
        return {"messages": [response], "primary_failed": False}
    except Exception as e:
        return {
            "messages": [AIMessage(content=f"Primary failed: {e}")],
            "primary_failed": True,
        }

def fallback_node(state: FallbackState):
    """Simpler approach when primary fails."""
    query = state["messages"][0].content
    return {"messages": [AIMessage(
        content=f"I couldn't process your request fully, "
        f"but here's a basic response to: '{query}'"
    )]}

def route_after_primary(state: FallbackState) -> str:
    if state.get("primary_failed", False):
        return "fallback"
    return "done"

builder = StateGraph(FallbackState)
builder.add_node("primary", primary_node)
builder.add_node("fallback", fallback_node)
builder.add_edge(START, "primary")
builder.add_conditional_edges("primary", route_after_primary, {
    "fallback": "fallback",
    "done": END,
})
builder.add_edge("fallback", END)
fallback_graph = builder.compile()
print("Fallback graph compiled")
python
Fallback graph compiled

This pattern is great for LLM model fallbacks. Say your main node calls GPT-4o. If that fails, the backup tries GPT-4o-mini. LangChain makes it even simpler with .with_fallbacks().

python
def create_resilient_llm():
    """Create an LLM with automatic fallback chain."""
    primary = ChatOpenAI(
        model="gpt-4o",
        timeout=30,
        max_retries=3,
    )
    fallback = ChatOpenAI(
        model="gpt-4o-mini",
        timeout=30,
        max_retries=3,
    )
    return primary.with_fallbacks([fallback])

resilient_llm = create_resilient_llm()
print("Resilient LLM with fallback chain ready")
python
Resilient LLM with fallback chain ready

I wire this into every agent I ship. The fallback model is cheaper too, so you save money on top of staying online.

Note: ChatOpenAI‘s max_retries handles HTTP-level retries (429, 500, 503). LangGraph’s RetryPolicy handles node-level retries. They sit at different layers. Use both for the best coverage.

How Does ToolNode Handle Errors with handle_tool_errors?

LangGraph’s ToolNode has built-in error catching via handle_tool_errors. When a tool throws, ToolNode grabs the error and sends it back as a ToolMessage. The LLM reads that message and can fix its own mistake — a self-healing loop.

Here are two tools that can break. divide blows up on division by zero. fetch_price rejects unknown tickers.

python
@tool
def divide(a: float, b: float) -> float:
    """Divide a by b."""
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

@tool
def fetch_price(ticker: str) -> str:
    """Fetch stock price for a ticker symbol."""
    valid_tickers = {"AAPL": 185.50, "GOOGL": 142.30}
    if ticker not in valid_tickers:
        raise ValueError(f"Unknown ticker: {ticker}")
    return f"${valid_tickers[ticker]}"

tools = [divide, fetch_price]

You can set up handle_tool_errors four different ways. Here’s each one:

python
# Option 1: Catch all errors, return error text to LLM
tool_node_basic = ToolNode(tools, handle_tool_errors=True)

# Option 2: Custom error message for all errors
tool_node_custom = ToolNode(
    tools,
    handle_tool_errors="Tool failed. Try different parameters."
)

# Option 3: Custom handler function
def handle_tool_error(error: Exception) -> str:
    if isinstance(error, ValueError):
        return f"Invalid input: {error}. Check your parameters."
    return f"Tool error: {error}. Try a different approach."

tool_node_handler = ToolNode(tools, handle_tool_errors=handle_tool_error)

# Option 4: Catch only specific exception types
tool_node_selective = ToolNode(
    tools, handle_tool_errors=(ValueError,)
)

print("Four ToolNode error handling configurations ready")
python
Four ToolNode error handling configurations ready

When you set handle_tool_errors=True, the LLM gets the raw error text. If it tried to divide by zero, it sees “Cannot divide by zero” and can try again with better inputs. This self-fix loop is one of the most handy tricks in agent design.

Tip: Use handle_tool_errors=True while you develop. It catches everything and shows the real error. Once you ship, switch to a custom handler that cleans up the message — you don’t want to leak internal details to end users.

What Are Graceful Degradation Strategies?

What if retries fail and fallbacks fail too? The idea behind graceful decline is simple: your agent still gives the user something helpful, even when it can’t deliver the full answer.

I like to think of it as three levels:

Level 1 — Retry and win. The call fails, retries, and works the next time. The user never knows anything went wrong.

Level 2 — Switch to a backup. The main tool or model is down. The agent uses a simpler one. The response is less rich but still helpful.

Level 3 — Own the failure. Nothing works. The agent says what happened and points the user to a next step.

python
class DegradationState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    degradation_level: int

def smart_response_node(state: DegradationState):
    """Generate response with graceful degradation."""
    level = state.get("degradation_level", 0)

    if level == 0:
        return {"messages": [AIMessage(
            content="Full response with all data sources."
        )]}
    elif level == 1:
        return {"messages": [AIMessage(
            content="Partial response. Some data sources were unavailable."
        )]}
    else:
        return {"messages": [AIMessage(
            content="I'm having trouble accessing my tools. "
            "Here's what I know from training data. "
            "Please verify this information independently."
        )]}

for level in [0, 1, 2]:
    test_state = {"messages": [], "degradation_level": level}
    result = smart_response_node(test_state)
    print(f"Level {level}: {result['messages'][0].content[:55]}...")
python
Level 0: Full response with all data sources....
Level 1: Partial response. Some data sources were unavailable...
Level 2: I'm having trouble accessing my tools. Here's what I...

The golden rule: be honest about it. Don’t act like everything is fine when it isn’t. Tell the user what broke and why.

How Do You Track and Propagate Errors Through State?

When a failure happens deep in your graph, the nodes that come after it need to know. If you pass error info through state, every later node can make a smart choice about what to do next.

The trick: add an errors list to your state. Each entry logs the node name, error type, message, and timestamp. Here’s a handy wrapper that bolts error tracking onto any node function.

python
from datetime import datetime

class ErrorTrackingState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    errors: Annotated[list[dict], lambda a, b: a + b]

def node_with_error_tracking(node_name: str, operation):
    """Wrap any node function with error tracking."""
    def wrapped(state: ErrorTrackingState):
        try:
            return operation(state)
        except Exception as e:
            error_info = {
                "node": node_name,
                "error_type": type(e).__name__,
                "message": str(e),
                "timestamp": datetime.now().isoformat(),
            }
            return {"messages": [], "errors": [error_info]}
    return wrapped

def step_one(state):
    return {"messages": [AIMessage(content="Step 1 done")], "errors": []}

def step_two(state):
    raise ValueError("Database connection refused")

builder = StateGraph(ErrorTrackingState)
builder.add_node("step1", node_with_error_tracking("step1", step_one))
builder.add_node("step2", node_with_error_tracking("step2", step_two))
builder.add_edge(START, "step1")
builder.add_edge("step1", "step2")
builder.add_edge("step2", END)
error_graph = builder.compile()

result = error_graph.invoke({"messages": [], "errors": []})
print(f"Errors collected: {len(result['errors'])}")
for err in result["errors"]:
    print(f"  {err['node']}: [{err['error_type']}] {err['message']}")
python
Errors collected: 1
  step2: [ValueError] Database connection refused

For production, pair this with Python’s logging module. Log at INFO for wins, WARNING for retries, and ERROR for real failures. If you set LANGCHAIN_TRACING_V2=true, LangSmith grabs these traces for you.

How Do You Build a Resilient Agent End-to-End?

Let’s pull it all together. I’ll walk you through a research agent with three tools, three layers of safety, and a clean fallback path.

We set up tools that break in different ways. web_search hits rate limits at random. calculator rejects bad math. summarize_text refuses short input. The graph deals with all of them.

python
@tool
def web_search(query: str) -> str:
    """Search the web for information."""
    if random.random() < 0.3:
        raise ConnectionError("Search API rate limited")
    return f"Search results for '{query}': Found 5 relevant articles."

@tool
def calculator(expression: str) -> str:
    """Evaluate a mathematical expression safely."""
    allowed = set("0123456789+-*/.() ")
    if not all(c in allowed for c in expression):
        raise ValueError(f"Invalid characters in: '{expression}'")
    result = eval(expression)
    return str(result)

@tool
def summarize_text(text: str) -> str:
    """Summarize a block of text."""
    if len(text) < 10:
        raise ValueError("Text too short to summarize")
    return f"Summary: {text[:100]}..."

research_tools = [web_search, calculator, summarize_text]
research_llm = ChatOpenAI(model="gpt-4o-mini").bind_tools(research_tools)

The agent node wraps the LLM call in try/except. The ToolNode uses handle_tool_errors=True. And we bolt RetryPolicy onto the tools node for network blips.

python
def agent_node(state: MessagesState):
    """Agent with error handling on LLM calls."""
    try:
        response = research_llm.invoke(state["messages"])
        return {"messages": [response]}
    except Exception as e:
        return {"messages": [AIMessage(
            content="I'm having trouble processing your request. "
            "Could you rephrase your question?"
        )]}

tool_node = ToolNode(research_tools, handle_tool_errors=True)

builder = StateGraph(MessagesState)
builder.add_node("agent", agent_node)
builder.add_node(
    "tools", tool_node,
    retry=RetryPolicy(max_attempts=3, initial_interval=1.0)
)
builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", tools_condition)
builder.add_edge("tools", "agent")
resilient_agent = builder.compile()
print("Resilient research agent compiled")
python
Resilient research agent compiled

Here’s how those three layers team up:

Layer 1 (tool-level): handle_tool_errors=True catches tool crashes and feeds the error text to the LLM. The LLM reads it and can try again with better inputs.

Layer 2 (node-level): RetryPolicy on the tools node retries the full tool call on network-type errors. Rate limits and timeouts sort themselves out.

Layer 3 (agent-level): The agent_node try/except catches LLM failures and returns a polite “please try again” message.

What Are the Most Common Error-Handling Mistakes?

Mistake 1: Catching errors inside tools instead of using ToolNode

Wrong:

python
@tool
def my_tool(query: str) -> str:
    """Do something."""
    try:
        return do_work(query)
    except Exception:
        return "Something went wrong"

The problem: The LLM can’t tell an error from a real result. You’ve also hidden the failure from your logs and killed the self-fix loop.

Right:

python
@tool
def my_tool(query: str) -> str:
    """Do something."""
    return do_work(query)  # Let ToolNode handle errors

Let ToolNode with handle_tool_errors=True catch it. The LLM gets the actual error in a ToolMessage and can adjust.

Mistake 2: Retrying errors that will never go away

Wrong:

python
builder.add_node("parse", parse_node, retry=RetryPolicy(max_attempts=5))

The problem: If the input is garbage, retrying gives you the same crash five times. You burn time and API credits for nothing.

Right:

python
def should_retry(error: Exception) -> bool:
    return isinstance(error, (ConnectionError, TimeoutError))

builder.add_node("parse", parse_node, retry=RetryPolicy(
    max_attempts=5, retry_on=should_retry
))

Use retry_on to pick which errors deserve a second chance. Network blips? Yes. Bad input? No.

Mistake 3: Silent failures that break nodes downstream

Wrong:

python
def node_a(state):
    try:
        return {"result": risky_operation()}
    except Exception:
        return {"result": None}  # Silent failure

def node_b(state):
    processed = state["result"].upper()  # Crashes: NoneType has no .upper()

The problem: Node B has no idea node A failed. It blows up with an AttributeError that masks the real issue.

Right:

python
def node_a(state):
    try:
        return {"result": risky_operation(), "errors": []}
    except Exception as e:
        return {"result": None, "errors": [{"node": "a", "error": str(e)}]}

def node_b(state):
    if state.get("errors"):
        return {"result": "Skipped: upstream error"}
    return {"result": state["result"].upper(), "errors": []}

Pass errors through state so each node can check before it runs.

Quick Troubleshooting Guide

GraphRecursionError: Recursion limit of 25 reached

Your retry loop created a cycle that went too long. Fix it by raising the limit when you compile, or dial down max_retries in your state logic.

python
graph = builder.compile()
result = graph.invoke(state, {"recursion_limit": 50})

ToolException: Tool 'my_tool' not found

The LLM asked for a tool that isn’t in your ToolNode. Usually the model made up a name. With handle_tool_errors=True, the error goes back to the LLM so it can pick a real tool.

openai.RateLimitError: Rate limit reached

You’ve hit your API quota. ChatOpenAI retries on its own via max_retries. If it keeps happening, add .with_fallbacks() to drop to a cheaper model.

When Should You Skip These Patterns?

Not every graph needs layers of retry and fallback logic. Here’s when to keep it simple.

Early prototyping. While you’re exploring, let errors crash loud. You want to see what breaks. Adding safety nets too soon hides bugs.

Pure logic pipelines. If your graph never calls an API or LLM, failures are just bugs in your code. Fix the bug — don’t retry around it.

Tight budgets. Every retry costs money (LLM calls, API fees). If cash is short, fail fast and show the user the error instead of burning credits on repeat tries.

Exercises

{
type: ‘exercise’,
id: ‘retry-node-exercise’,
title: ‘Exercise 1: Build a Retry-Aware Node’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Complete the retry_node function so it retries flaky_operation() up to 3 times with exponential backoff (base delay of 1 second). On success, return the result as an AIMessage. After all attempts fail, return a fallback message.’,
starterCode: ‘import time\nimport random\nfrom langchain_core.messages import AIMessage\n\ndef flaky_operation():\n “””Succeeds only 30% of the time.”””\n if random.random() > 0.3:\n raise ConnectionError(“Service unavailable”)\n return “Operation completed successfully”\n\ndef retry_node(state):\n max_attempts = 3\n # YOUR CODE HERE\n # Try flaky_operation() with exponential backoff\n # Return AIMessage with result on success\n # Return AIMessage with fallback on failure\n pass’,
testCases: [
{ id: ‘tc1’, input: ‘random.seed(10)\nresult = retry_node({“messages”: []})\nprint(result[“messages”][0].content)’, expectedOutput: ‘Operation completed successfully’, description: ‘Should succeed after retries with seed 10’ },
{ id: ‘tc2’, input: ‘random.seed(1)\nresult = retry_node({“messages”: []})\nprint(“fallback” in result[“messages”][0].content.lower())’, expectedOutput: ‘True’, description: ‘Should return fallback message when all retries fail’, hidden: true },
],
hints: [
‘Use a for loop with range(max_attempts). Inside the loop, wrap flaky_operation() in try/except. Calculate delay as 1.0 * (2 ** attempt) and call time.sleep(delay) before the next attempt.’,
‘After the loop completes without returning, all attempts failed. Return {“messages”: [AIMessage(content=”All retries exhausted. Using fallback response.”)]}’,
],
solution: ‘def retry_node(state):\n max_attempts = 3\n for attempt in range(max_attempts):\n try:\n result = flaky_operation()\n return {“messages”: [AIMessage(content=result)]}\n except ConnectionError:\n if attempt < max_attempts – 1:\n delay = 1.0 * (2 ** attempt)\n time.sleep(delay)\n return {“messages”: [AIMessage(content=”All retries exhausted. Using fallback response.”)]}’,
solutionExplanation: ‘The loop tries the operation up to 3 times. On each failure, it waits 1s, then 2s before the next attempt. After the final failure, it returns a fallback message instead of crashing.’,
xpReward: 15,
}

{
type: ‘exercise’,
id: ‘error-tracking-exercise’,
title: ‘Exercise 2: Add Error Tracking to a Graph’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Create a safe_wrap(name, fn) function that wraps any node function with error tracking. When the wrapped function catches an exception, it should append an error dict with “node” and “error” keys to the errors list in state. Then use it to build a 3-node graph where step_b always fails.’,
starterCode: ‘from typing import TypedDict, Annotated\nfrom langchain_core.messages import AIMessage\nfrom langgraph.graph import StateGraph, START, END\n\nclass TrackedState(TypedDict):\n messages: Annotated[list, lambda a, b: a + b]\n errors: Annotated[list[dict], lambda a, b: a + b]\n\ndef step_a(state):\n return {“messages”: [AIMessage(content=”A done”)], “errors”: []}\n\ndef step_b(state):\n raise ValueError(“Database connection failed”)\n\ndef step_c(state):\n count = len(state[“errors”])\n return {“messages”: [AIMessage(content=f”Done with {count} error(s)”)], “errors”: []}\n\ndef safe_wrap(name, fn):\n # YOUR CODE HERE\n pass\n\n# YOUR CODE HERE: Build graph with wrapped nodes’,
testCases: [
{ id: ‘tc1’, input: ‘result = graph.invoke({“messages”: [], “errors”: []})\nprint(result[“messages”][-1].content)’, expectedOutput: ‘Done with 1 error(s)’, description: ‘Should collect 1 error from step_b’ },
{ id: ‘tc2’, input: ‘result = graph.invoke({“messages”: [], “errors”: []})\nprint(result[“errors”][0][“node”])’, expectedOutput: ‘b’, description: ‘Error should identify the failing node’ },
],
hints: [
‘safe_wrap should return a new function that calls fn(state) inside try/except. On exception, return {“messages”: [], “errors”: [{“node”: name, “error”: str(e)}]}.’,
‘Build the graph: add_node(“a”, safe_wrap(“a”, step_a)), add_node(“b”, safe_wrap(“b”, step_b)), add_node(“c”, step_c). Chain edges: START -> a -> b -> c -> END.’,
],
solution: ‘def safe_wrap(name, fn):\n def wrapped(state):\n try:\n return fn(state)\n except Exception as e:\n return {“messages”: [], “errors”: [{“node”: name, “error”: str(e)}]}\n return wrapped\n\nbuilder = StateGraph(TrackedState)\nbuilder.add_node(“a”, safe_wrap(“a”, step_a))\nbuilder.add_node(“b”, safe_wrap(“b”, step_b))\nbuilder.add_node(“c”, step_c)\nbuilder.add_edge(START, “a”)\nbuilder.add_edge(“a”, “b”)\nbuilder.add_edge(“b”, “c”)\nbuilder.add_edge(“c”, END)\ngraph = builder.compile()’,
solutionExplanation: ‘safe_wrap catches exceptions from any node and stores error info in state. Step_b fails with ValueError, but the wrapper catches it. Step_c sees 1 error in state and reports it.’,
xpReward: 20,
}

Complete Code

Click to expand the full script (copy-paste and run)

python
# Complete code from: Error Handling, Retries, and Fallback Strategies in LangGraph
# Requires: pip install langchain-openai langgraph python-dotenv
# Python 3.10+
# Set OPENAI_API_KEY in your .env file

import os
import time
import random
from typing import TypedDict, Annotated
from datetime import datetime
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.types import RetryPolicy

load_dotenv()

# --- Setup ---
llm = ChatOpenAI(model="gpt-4o-mini")

# --- Try/Except in Node Functions ---
def call_external_api(query: str) -> str:
    roll = random.random()
    if roll < 0.3:
        raise TimeoutError("API request timed out after 30s")
    if roll < 0.5:
        raise ConnectionError("Could not reach api.example.com")
    return f"Results for '{query}': 42 matching records found"

def safe_fetch_node(state: MessagesState):
    query = state["messages"][-1].content
    try:
        result = call_external_api(query)
        return {"messages": [AIMessage(content=result)]}
    except (TimeoutError, ConnectionError) as e:
        error_type = type(e).__name__
        return {"messages": [AIMessage(
            content=f"[{error_type}] {str(e)}. Continuing with available info."
        )]}

# --- Retry with State Counters ---
class RetryState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    retry_count: int
    max_retries: int
    last_error: str

def unreliable_node(state: RetryState):
    try:
        if random.random() < 0.6:
            raise ConnectionError("Service temporarily unavailable")
        return {
            "messages": [AIMessage(content="Operation succeeded!")],
            "retry_count": 0, "last_error": "",
        }
    except ConnectionError as e:
        new_count = state.get("retry_count", 0) + 1
        return {"messages": [], "retry_count": new_count, "last_error": str(e)}

def should_retry(state: RetryState) -> str:
    if state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "give_up"
    if state.get("last_error"):
        return "retry"
    return "continue"

def give_up_node(state: RetryState):
    return {"messages": [AIMessage(
        content=f"Failed after {state.get('retry_count', 0)} attempts."
    )]}

# --- Exponential Backoff ---
def backoff_delay(attempt: int, base: float = 1.0, jitter: bool = True) -> float:
    delay = base * (2 ** attempt)
    if jitter:
        delay *= (0.5 + random.random())
    return min(delay, 60.0)

# --- Fallback Pattern ---
class FallbackState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    primary_failed: bool

def primary_node(state: FallbackState):
    try:
        response = llm.invoke(state["messages"])
        return {"messages": [response], "primary_failed": False}
    except Exception as e:
        return {"messages": [AIMessage(content=f"Primary failed: {e}")],
                "primary_failed": True}

def fallback_node(state: FallbackState):
    query = state["messages"][0].content
    return {"messages": [AIMessage(
        content=f"Basic response to: '{query}'"
    )]}

# --- Error Tracking ---
class ErrorTrackingState(TypedDict):
    messages: Annotated[list, lambda a, b: a + b]
    errors: Annotated[list[dict], lambda a, b: a + b]

def node_with_error_tracking(node_name: str, operation):
    def wrapped(state: ErrorTrackingState):
        try:
            return operation(state)
        except Exception as e:
            return {"messages": [], "errors": [{
                "node": node_name,
                "error_type": type(e).__name__,
                "message": str(e),
                "timestamp": datetime.now().isoformat(),
            }]}
    return wrapped

# --- Resilient Agent ---
@tool
def web_search(query: str) -> str:
    """Search the web for information."""
    if random.random() < 0.3:
        raise ConnectionError("Search API rate limited")
    return f"Search results for '{query}': Found 5 relevant articles."

@tool
def calculator(expression: str) -> str:
    """Evaluate a mathematical expression safely."""
    allowed = set("0123456789+-*/.() ")
    if not all(c in allowed for c in expression):
        raise ValueError(f"Invalid characters in: '{expression}'")
    return str(eval(expression))

@tool
def summarize_text(text: str) -> str:
    """Summarize a block of text."""
    if len(text) < 10:
        raise ValueError("Text too short to summarize")
    return f"Summary: {text[:100]}..."

research_tools = [web_search, calculator, summarize_text]
research_llm = ChatOpenAI(model="gpt-4o-mini").bind_tools(research_tools)

def agent_node(state: MessagesState):
    try:
        response = research_llm.invoke(state["messages"])
        return {"messages": [response]}
    except Exception as e:
        return {"messages": [AIMessage(
            content="I'm having trouble. Could you rephrase?"
        )]}

tool_node = ToolNode(research_tools, handle_tool_errors=True)

builder = StateGraph(MessagesState)
builder.add_node("agent", agent_node)
builder.add_node("tools", tool_node, retry=RetryPolicy(max_attempts=3))
builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", tools_condition)
builder.add_edge("tools", "agent")
resilient_agent = builder.compile()

# Run the agent
result = resilient_agent.invoke({
    "messages": [HumanMessage(content="What is 25 * 4?")]
})
for msg in result["messages"]:
    if isinstance(msg, HumanMessage):
        print(f"User: {msg.content}")
    elif isinstance(msg, ToolMessage):
        print(f"Tool: {msg.content[:80]}")
    else:
        print(f"Agent: {msg.content[:120]}")

print("\nScript completed successfully.")

Summary

Error handling isn’t a nice bonus — it’s the gap between agents that live in the wild and agents that crash on day one. Here’s what we covered.

Try/except in nodes keeps the graph alive when a call goes wrong. Catch the specific errors you expect and return a useful message.

State-based retries hand you full control. Track counts in state and use routing edges to loop back or give up.

RetryPolicy does the heavy lifting for common cases. Bolt it onto any node with retry=RetryPolicy(...) and get backoff for free.

Fallback nodes give your graph an alternate route. Stack LLM models with .with_fallbacks() or build your own routing edges.

ToolNode’s handle_tool_errors turns tool crashes into messages the LLM can learn from — a self-healing loop.

Error passing through state lets later nodes react to earlier failures. Thread an errors list and check it before you process.

Practice exercise: wire up an agent with two tools — one reliable, one flaky. Add RetryPolicy on the tools node, handle_tool_errors=True on ToolNode, and a fallback agent node that answers without tools when all else fails.

Solution outline

python
# 1. Define reliable_tool and unreliable_tool
# 2. Create ToolNode with handle_tool_errors=True
# 3. Add tools node with RetryPolicy(max_attempts=3)
# 4. Add fallback_agent node that responds without tools
# 5. Conditional edge: route to fallback after repeated failures
# 6. Test with a query that triggers the unreliable tool

FAQ

Does RetryPolicy retry the whole graph or just the broken node?

Just the broken node. When a node raises an error, RetryPolicy reruns that one node with the same input. Everything else in the graph waits until it either succeeds or runs out of tries.

python
# Each node gets its own policy
builder.add_node("a", node_a, retry=RetryPolicy(max_attempts=5))
builder.add_node("b", node_b, retry=RetryPolicy(max_attempts=2))

How do handle_tool_errors and RetryPolicy work together?

They work at different layers and don’t overlap. handle_tool_errors catches errors inside tools and turns them into ToolMessage replies — the graph sees a success, so RetryPolicy never fires. If something breaks outside the tool (like a network error mid-call), RetryPolicy kicks in and retries the node.

Can I add a delay between state-based retries?

Yes. Drop a time.sleep() at the top of your node when retry_count is above 0. Use backoff math to space the waits: delay = base * (2 ** state["retry_count"]).

python
def retry_aware_node(state):
    if state.get("retry_count", 0) > 0:
        delay = 1.0 * (2 ** state["retry_count"])
        time.sleep(min(delay, 30))
    # ... rest of node logic

What happens when RetryPolicy uses up all its attempts?

LangGraph throws the original error again. If nothing else in your graph catches it, the run stops. Plan for this by adding a routing edge to a fallback node so you land safely.

References

  • LangGraph Documentation — How to Add Node Retry Policies: https://langchain-ai.github.io/langgraph/how-tos/node-retries/

  • LangGraph Types Reference — RetryPolicy: https://langchain-ai.github.io/langgraph/reference/types/

  • LangChain Documentation — ToolNode and Tool Execution: https://python.langchain.com/docs/langgraph/prebuilt/toolnode

  • LangChain Documentation — Model Fallbacks: https://python.langchain.com/docs/how_to/fallbacks/

  • LangGraph Changelog — Enhanced State Management and Retries: https://changelog.langchain.com/announcements/enhanced-state-management-retries-in-langgraph-python

  • Python Documentation — Logging HOWTO: https://docs.python.org/3/howto/logging.html

  • LangGraph Documentation — Thinking in LangGraph: https://docs.langchain.com/oss/python/langgraph/thinking-in-langgraph

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science