LangGraph Code Generation Agent: Build & Run Python

Step-by-step guide to building a LangGraph agent that writes Python code, runs it safely, checks the output, and self-corrects until it works.

Written by Selva Prabhakaran | 29 min read

Write an agent that creates Python code, runs it in a sandbox, checks the output, and fixes its own mistakes — step by step in LangGraph.

Picture this: you ask your LLM to “find the average salary by department from this CSV.” It writes code that looks fine. But when you run it, the column name is off and the whole script blows up. You fix it by hand, re-prompt, and cross your fingers.

What if the agent could spot its own errors, rewrite the code, and try again — all on its own? That is exactly what we are going to build in this post.

Let me walk you through how the data moves before we write a single line. A user sends a plain English request. The agent node picks it up and asks the LLM to write Python code. That code goes to a sandbox, where it runs in isolation.

The sandbox gives back either clean output or an error traceback. If the code worked, the agent checks whether the output truly answers the question. If it does, the graph wraps up with a final answer.

If the code broke or gave wrong results, the error flows back to the agent node. The LLM reads what went wrong, figures out the fix, and writes a new version. This loop keeps going until the code works — or we hit a retry cap.

Five parts make up the whole system: the state (which tracks messages, code, results, and retries), the code writer node, the runner node, a checker node, and the conditional routing that glues them together.

What Is a Code Generation Agent?

Think of a code generation agent as an LLM with a built-in test suite. It does not just spit out code and hope for the best. It writes code, runs it, looks at what happened, and tries again if the result is off. The agent takes charge of the entire workflow.

Here is how regular LLM code generation works:

python

User prompt → LLM → Code (might work, might not)

A code generation agent looks like this:

python

User prompt → Write → Run → Check → Fix if broken → Repeat → Answer

See the difference? The second flow has a feedback loop baked in. The agent runs its own code, reads any errors that pop up, and drafts a fix. It follows the same debug cycle you use as a developer — just faster and without the coffee breaks.

Key Insight: The LLM is the programmer; the sandbox is the test lab. The agent treats every draft as a hypothesis. Write it, run it, see what breaks, patch it up. This “code, test, fix” rhythm is what turns a simple prompt into a reliable tool.

Prerequisites

Python version: 3.10+
Required libraries: langgraph (0.4+), langchain-openai (0.3+), langchain-core (0.3+)
Install: pip install langgraph langchain-openai langchain-core
API key: An OpenAI API key set as OPENAI_API_KEY. See OpenAI’s docs to create one.
Time to complete: ~40 minutes
Prior knowledge: Basic LangGraph concepts (nodes, edges, state). If you are new, start with our LangGraph installation and setup guide.

Step 1 — Define the Agent State

The first thing any LangGraph agent needs is a state class. This typed dictionary rides along with every node in the graph. Nodes read from it, do their work, and drop results back in.

For our agent, plain message history is not enough. We also store the latest script, whatever the sandbox printed, a boolean that says pass or fail, and counters for the retry loop. With all of this in one place, every node can see the full picture.

python

import os
from typing import Annotated, TypedDict
from langchain_openai import ChatOpenAI
from langchain_core.messages import (
    HumanMessage,
    AIMessage,
    SystemMessage,
)
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages


class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    generated_code: str
    execution_result: str
    execution_succeeded: bool
    retry_count: int
    max_retries: int

Let me walk through each field. messages keeps the full chat and uses add_messages as its reducer, so new messages get appended rather than replacing old ones. generated_code holds whatever Python the LLM last produced. execution_result stores either the printed output (on success) or the traceback (on failure).

The bottom three fields power the loop. execution_succeeded is the flag that tells the router “keep going” or “try again.” retry_count says how many passes we have made. max_retries is the hard stop.

Quick check: Why does messages need a reducer but the rest do not? Messages pile up — every turn adds one. The other fields always show the current attempt. You care about the latest script, not the history of every draft.

Step 2 — Build a Safe Code Runner

Stop and think: how do you safely run code an LLM just invented? Letting random Python loose on your machine is asking for trouble. One bad os.remove() and your data is gone.

The answer is a sandbox — a walled-off space where code can run without touching the rest of your system. In this guide I use Python’s subprocess with a timeout. It launches a separate process with a strict clock. For anything beyond learning, reach for Docker or langchain-sandbox which offer real isolation.

Here is what the runner does: save the code to a temp file, execute it, and capture whatever comes back. The result falls into one of three buckets: clean output (stdout), an error (stderr), or a timeout (process killed).

python

import subprocess
import tempfile


def execute_code_safely(code: str, timeout: int = 30) -> dict:
    """Execute Python code in a subprocess with a timeout."""
    with tempfile.NamedTemporaryFile(
        mode="w", suffix=".py", delete=False
    ) as f:
        f.write(code)
        temp_path = f.name

    try:
        result = subprocess.run(
            ["python", temp_path],
            capture_output=True,
            text=True,
            timeout=timeout,
        )
        if result.returncode == 0:
            return {
                "success": True,
                "output": result.stdout,
                "error": "",
            }
        else:
            return {
                "success": False,
                "output": result.stdout,
                "error": result.stderr,
            }
    except subprocess.TimeoutExpired:
        return {
            "success": False,
            "output": "",
            "error": f"Code timed out after {timeout} seconds.",
        }
    finally:
        os.unlink(temp_path)

The dict that comes back carries three pieces: success (a boolean), output (whatever the script printed), and error (the traceback when things break). No matter what happens, the finally block scrubs the temp file off disk.

Warning: This subprocess runner is for learning only. It runs code with your user’s full rights — no file system walls, no network limits. For a real product, use Docker, `langchain-sandbox` (Pyodide-based), or a cloud sandbox like E2B. Never run code from an LLM without proper isolation.

Step 3 — Create the Code Writer Node

Now we get to the core: making the LLM write code. The writer node builds a prompt, sends it off, and carves the Python script out of the response.

If there is one lesson I have learned, it is this: the system prompt is everything. You have to spell out the rules — only runnable Python, always use print() for output, and include every import at the top. Skip these rules and the model hands back code that either prints nothing or crashes on a missing library.

When the agent is retrying, the prompt gets longer. It now includes the old code and the error. The model gets pointed feedback — not a vague “try again” but a detailed “line 12 threw a KeyError because the column name was wrong.”

python

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

SYSTEM_PROMPT = """You are a Python code generation assistant.
1. Write complete, executable Python that solves the request.
2. Always include print() statements to show results.
3. Include all necessary imports at the top.
4. Handle potential errors with try/except where needed.
5. Output ONLY Python code — no explanations, no markdown.

If you receive an error from a previous attempt:
- Read the error carefully.
- Fix the specific issue.
- Do NOT rewrite everything unless necessary.
- Output the corrected code."""


def generate_code(state: AgentState) -> dict:
    """Generate Python code based on the user request."""
    messages = state["messages"]
    retry_count = state.get("retry_count", 0)

    prompt_messages = [SystemMessage(content=SYSTEM_PROMPT)]

    if retry_count > 0 and state.get("generated_code"):
        error_context = (
            f"\n\nYour previous code:\n"
            f"```python\n{state['generated_code']}\n```\n\n"
            f"Error encountered:\n{state['execution_result']}\n\n"
            f"Fix the code. Attempt {retry_count + 1} of "
            f"{state['max_retries']}."
        )
        prompt_messages.extend(messages)
        prompt_messages.append(
            HumanMessage(content=error_context)
        )
    else:
        prompt_messages.extend(messages)

    response = model.invoke(prompt_messages)
    generated_code = response.content.strip()

    # Strip markdown code fences if the model adds them
    if generated_code.startswith("```python"):
        generated_code = generated_code[9:]
    if generated_code.startswith("```"):
        generated_code = generated_code[3:]
    if generated_code.endswith("```"):
        generated_code = generated_code[:-3]
    generated_code = generated_code.strip()

    return {
        "messages": [
            AIMessage(
                content=f"Generated code (attempt "
                f"{retry_count + 1}):\n```python\n"
                f"{generated_code}\n```"
            )
        ],
        "generated_code": generated_code,
    }

Look at the fence-stripping logic near the end. Even with “output ONLY code” in the prompt, models love to wrap replies in triple backticks. We peel those off so the sandbox gets pure Python.

During retries, the function tacks on error context as a new message. The model can now see the entire conversation plus the exact stack trace. That context is worth gold — starting from scratch would lose everything the model learned from the first attempt.

Step 4 — Build the Runner Node

This node is short and sweet. It reads generated_code from state, feeds it to the sandbox, and records what comes back. You can think of it as the middleman between the LLM’s imagination and cold hard reality.

python

def execute_code(state: AgentState) -> dict:
    """Execute the generated code and capture the result."""
    code = state["generated_code"]
    result = execute_code_safely(code)

    if result["success"]:
        output_text = (
            result["output"] if result["output"] else "(No output)"
        )
        return {
            "messages": [
                AIMessage(
                    content=f"Execution successful.\n"
                    f"Output:\n{output_text}"
                )
            ],
            "execution_result": output_text,
            "execution_succeeded": True,
        }
    else:
        error_text = result["error"]
        return {
            "messages": [
                AIMessage(
                    content=f"Execution failed.\n"
                    f"Error:\n{error_text}"
                )
            ],
            "execution_result": error_text,
            "execution_succeeded": False,
            "retry_count": state.get("retry_count", 0) + 1,
        }

The function splits into two roads. When the code works, we save the printed output and set the success flag. When it breaks, we save the traceback, clear the flag, and tick the retry counter up by one.

You might wonder why I bump retry_count here instead of in the writer. The reason is simple: the runner is where we discover the failure. The writer should stay focused on producing code — counting attempts is not its concern.

Step 5 — Add the Checker Node

A script that runs without errors can still produce garbage. Suppose your user wants “the top 5 products by revenue” and the code spits out an unsorted dump of every row in the table. No traceback in sight — but the answer is flat-out wrong.

That is why we need a checker. Once the code finishes clean, we ask the LLM a second question: “Does this output really answer what the user asked?” It costs an extra API call, but the payoff is huge — it keeps the agent from handing back confident nonsense.

python

def evaluate_result(state: AgentState) -> dict:
    """Check if the output answers the user's question."""
    user_request = ""
    for msg in state["messages"]:
        if isinstance(msg, HumanMessage):
            user_request = msg.content
            break

    eval_prompt = (
        f"The user asked: '{user_request}'\n\n"
        f"The code produced this output:\n"
        f"{state['execution_result']}\n\n"
        f"Does this output correctly and completely answer "
        f"the user's request?\n"
        f"Reply with exactly 'YES' or 'NO: <reason>'."
    )

    response = model.invoke(
        [
            SystemMessage(
                content="You evaluate code execution results. "
                "Be strict but fair."
            ),
            HumanMessage(content=eval_prompt),
        ]
    )

    evaluation = response.content.strip()
    is_correct = evaluation.upper().startswith("YES")

    if is_correct:
        return {
            "messages": [
                AIMessage(
                    content=f"Result verified.\n\n"
                    f"Final answer:\n"
                    f"{state['execution_result']}"
                )
            ],
            "execution_succeeded": True,
        }
    else:
        return {
            "messages": [
                AIMessage(
                    content=f"Output doesn't match the "
                    f"request. Reason: {evaluation}"
                )
            ],
            "execution_result": (
                f"Code ran but output was wrong. "
                f"Evaluation: {evaluation}"
            ),
            "execution_succeeded": False,
            "retry_count": state.get("retry_count", 0) + 1,
        }

Inside, the node digs up the original user request from the message log. It frames a yes-or-no question for the LLM and uses the answer to decide the next step. A “yes” means we are done. A “no” flips the success flag off and increments the retry counter.

Is that extra API call worth the pennies? Without question. The alternative is an agent that returns wrong data with full confidence. A single cheap check prevents that trap.

Tip: Control how strict the checker is by editing its system prompt. Working with tabular data? Add rules like “confirm the row count matches.” Doing basic math? The default wording is fine. My advice: tighten the prompt for any task where a plausible-looking wrong answer could fool a user.

Step 6 — Wire the Graph with Conditional Routing

Now we stitch the nodes into a single graph. Regular edges connect steps that always follow each other. Conditional edges inspect the state and pick a path. Together, these edges form the decision-making layer of the agent.

Three decisions shape the flow:

After running: Did the code pass? Send it to the checker. Did it crash? Check the retry budget.
After checking: Did the output look right? Finish. Was it wrong? Retry.
Retry guard: Have we hit max_retries? If yes, stop. If no, write new code.

python

def route_after_execution(state: AgentState) -> str:
    """Decide what happens after code execution."""
    if state["execution_succeeded"]:
        return "evaluate"
    elif state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "end"
    else:
        return "retry"


def route_after_evaluation(state: AgentState) -> str:
    """Decide what happens after result evaluation."""
    if state["execution_succeeded"]:
        return "end"
    elif state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "end"
    else:
        return "retry"

Both functions are tiny. route_after_execution checks the flag first — clean runs head to the checker, crashes loop back to the writer (if retries remain) or exit. route_after_evaluation covers the subtler situation: the code ran but the answer was wrong. Same branching pattern, same retry ceiling.

Below is the full assembly. add_node registers each function. add_edge and add_conditional_edges spell out how data flows from one node to the next.

python

def build_code_agent() -> StateGraph:
    """Build and compile the code generation agent graph."""
    graph = StateGraph(AgentState)

    # Register nodes
    graph.add_node("generate", generate_code)
    graph.add_node("execute", execute_code)
    graph.add_node("evaluate", evaluate_result)

    # Start with code generation
    graph.add_edge(START, "generate")

    # After generation, always execute
    graph.add_edge("generate", "execute")

    # After execution, branch on success
    graph.add_conditional_edges(
        "execute",
        route_after_execution,
        {
            "evaluate": "evaluate",
            "retry": "generate",
            "end": END,
        },
    )

    # After evaluation, branch on correctness
    graph.add_conditional_edges(
        "evaluate",
        route_after_evaluation,
        {
            "end": END,
            "retry": "generate",
        },
    )

    return graph.compile()

Follow the path: START feeds into generate, which always feeds into execute. After execute, the graph branches — successful runs head to evaluate, while crashes either loop back to generate or bail out. After evaluate, good answers exit and bad answers trigger another round.

Three nodes. Two routers. One compiled graph. That is the whole agent.

Key Insight: The retry loop draws the line between an agent and a chain. A chain fires once and returns whatever lands. An agent looks at its own work, grades it, and takes a second pass. LangGraph conditional edges make every decision visible in the graph layout — nothing is hidden.

Step 7 — Run the Agent

Time to see the agent in action. We call invoke with a state dictionary that holds the user message, blank tracking fields, and a retry cap of 3.

python

agent = build_code_agent()

result = agent.invoke(
    {
        "messages": [
            HumanMessage(
                content="Write a Python script that generates "
                "a list of 10 random numbers between 1 and "
                "100, sorts them, and prints the sorted list "
                "along with the average."
            )
        ],
        "generated_code": "",
        "execution_result": "",
        "execution_succeeded": False,
        "retry_count": 0,
        "max_retries": 3,
    }
)

# Print the final result
for msg in result["messages"]:
    print(f"\n{'='*50}")
    print(f"[{msg.__class__.__name__}]")
    print(msg.content)

For a straightforward task like this, the agent almost always gets it right on the first pass. The message log shows the full journey: user request, generated script, execution output, and the checker’s verdict.

Now let me throw a curveball — a task that is set up to fail on the first try.

python

result_retry = agent.invoke(
    {
        "messages": [
            HumanMessage(
                content="Read a CSV file called 'sales.csv' "
                "with columns 'product', 'region', 'revenue'. "
                "Group by region, calculate total revenue, "
                "and print the region with the highest total."
            )
        ],
        "generated_code": "",
        "execution_result": "",
        "execution_succeeded": False,
        "retry_count": 0,
        "max_retries": 3,
    }
)

print(f"\nRetries used: {result_retry['retry_count']}")
print(f"Succeeded: {result_retry['execution_succeeded']}")

Watch what happens. Attempt one crashes with FileNotFoundError because sales.csv is nowhere on disk. The agent parses the traceback, figures out the file is missing, and rewrites the script to create sample data right inside the code. That ability to adapt is the entire point of building an agent rather than a simple chain.

How to Watch the Agent Think Step by Step

LangGraph has a streaming mode that is perfect for debugging. Instead of waiting for the final answer, you can watch each node fire and see its output the moment it lands.

python

agent = build_code_agent()

for event in agent.stream(
    {
        "messages": [
            HumanMessage(
                content="Calculate the first 20 Fibonacci "
                "numbers and print them."
            )
        ],
        "generated_code": "",
        "execution_result": "",
        "execution_succeeded": False,
        "retry_count": 0,
        "max_retries": 3,
    }
):
    for node_name, node_output in event.items():
        print(f"\n--- Node: {node_name} ---")
        if "generated_code" in node_output:
            code_preview = node_output["generated_code"][:200]
            print(f"Code:\n{code_preview}...")
        if "execution_result" in node_output:
            print(f"Result: {node_output['execution_result'][:200]}")
        if "execution_succeeded" in node_output:
            print(f"Success: {node_output['execution_succeeded']}")

You will see the nodes light up in order: generate produces the Fibonacci logic, execute prints 20 numbers, and evaluate gives the green light. When a task triggers retries, generate shows up again with a refined script each time.

How to Add Guard Rails for Production

Before you point real users at this agent, bolt on some safety measures. Without them, the agent could produce harmful scripts, spin in circles, or drain your API wallet. Here are four protections I recommend.

Guard rail 1: Code safety scan. Before running anything, scan for risky patterns. This is not a full security audit — just a blocklist for the obvious threats.

python

FORBIDDEN_PATTERNS = [
    "os.remove", "os.rmdir", "shutil.rmtree",
    "subprocess.call", "subprocess.run",
    "os.system", "__import__",
    "eval(", "exec(",
    "open(", "pathlib",
]


def check_code_safety(code: str) -> tuple[bool, str]:
    """Check generated code for dangerous patterns."""
    for pattern in FORBIDDEN_PATTERNS:
        if pattern in code:
            return False, f"Blocked: contains '{pattern}'"
    return True, "Code passed safety check"

Warning: Pattern matching is not real security. A clever LLM can dodge these checks with `getattr`, string tricks, or other workarounds. For production, use a container sandbox. This blocklist catches accidental dangers, not planned attacks.

Guard rail 2: Cost tracking. Each retry doubles the bill. Track spend and set a hard cap per request.

python

def create_cost_tracker(max_cost: float = 0.10):
    """Track estimated API costs per request."""
    total = 0.0

    def check_cost(retry_count: int) -> bool:
        nonlocal total
        # ~$0.005 per generation + evaluation call
        total += 0.005
        return total <= max_cost

    return check_cost

Guard rail 3: Code length cap. If the model writes 500 lines for a simple task, something is off. Put a ceiling on script length.

python

MAX_CODE_LINES = 100

def validate_code_length(code: str) -> bool:
    """Reject suspiciously long generated code."""
    return len(code.strip().split("\n")) <= MAX_CODE_LINES

Guard rail 4: Retry ceiling. Three retries is a solid default. Beyond that, the model likely does not grasp the task — more tries will not help.

When Should You Use a Code Agent (and When Not)?

Code agents pack a punch, but they are not the answer to every problem. Let me share where I have seen them shine — and where they fall flat.

Works well:

Data analysis — “Find X from this dataset.” Short code, clear output, retries handle edge cases.
Format conversion — “Turn this JSON into CSV with these columns.” Clear input, clear success test.
Math problems — “Solve this system of equations.” The LLM writes NumPy code, the answer is easy to check.
Quick automation — “Rename files matching this pattern.” (With proper sandboxing.)

Poor fit:

Long jobs — ML training takes hours. You do not want a retry loop on a 3-hour task.
No clear output — “Write a web server.” There is no single result to judge.
Side effects — Database writes, API calls. Each retry could create duplicates or corrupt data.
Taste calls — “Make this chart look nice.” The checker node cannot judge style.

Tip: Start with tasks where you can define “correct” in one sentence. If the success test needs a whole paragraph, the checker will not be reliable. Simple, checkable tasks get the best results from code agents.

Common Mistakes and How to Fix Them

These are the issues that come up most when building code agents.

1. Infinite retry loops

python

RecursionError or the graph hangs

This shows up when max_retries is not enforced or the counter does not go up. Make sure the runner node bumps retry_count on failure. Also check your routing function — a missing "end" path creates an endless cycle.

2. Code fence leftovers

python

SyntaxError: invalid syntax (line 1: ```python)

The LLM wraps code in markdown fences even though the prompt says not to. The generate_code function strips them, but edge cases slip through. Build a tougher cleaner that handles backticks, language tags, and partial fences.

3. Missing imports in the output

python

ModuleNotFoundError: No module named 'pandas'

Two separate problems hide here. Either the model left out the import (fix the system prompt) or the package is not in the sandbox. Your sandbox and your dev setup may have different packages.

4. Stale state between calls

If you reuse the compiled graph, pass a fresh state every time. Leftover retry_count or generated_code from a past run will confuse the agent. Reset every field on each new request.

Complete Code

Click to expand the full script (copy-paste and run)

python

# Complete code from: Building a Code Generation and Execution Agent
# Requires: pip install langgraph langchain-openai langchain-core
# Python 3.10+
# Set OPENAI_API_KEY environment variable before running

import os
import subprocess
import tempfile
from typing import Annotated, TypedDict

from langchain_openai import ChatOpenAI
from langchain_core.messages import (
    HumanMessage,
    AIMessage,
    SystemMessage,
)
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages


# --- State Definition ---

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    generated_code: str
    execution_result: str
    execution_succeeded: bool
    retry_count: int
    max_retries: int


# --- Safe Code Executor ---

def execute_code_safely(code: str, timeout: int = 30) -> dict:
    """Execute Python code in a subprocess with a timeout."""
    with tempfile.NamedTemporaryFile(
        mode="w", suffix=".py", delete=False
    ) as f:
        f.write(code)
        temp_path = f.name

    try:
        result = subprocess.run(
            ["python", temp_path],
            capture_output=True,
            text=True,
            timeout=timeout,
        )
        if result.returncode == 0:
            return {
                "success": True,
                "output": result.stdout,
                "error": "",
            }
        else:
            return {
                "success": False,
                "output": result.stdout,
                "error": result.stderr,
            }
    except subprocess.TimeoutExpired:
        return {
            "success": False,
            "output": "",
            "error": f"Code timed out after {timeout} seconds.",
        }
    finally:
        os.unlink(temp_path)


# --- Code Generation Node ---

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

SYSTEM_PROMPT = """You are a Python code generation assistant.
1. Write complete, executable Python that solves the request.
2. Always include print() statements to show results.
3. Include all necessary imports at the top.
4. Handle potential errors with try/except where needed.
5. Output ONLY Python code — no explanations, no markdown.

If you receive an error from a previous attempt:
- Read the error carefully.
- Fix the specific issue.
- Do NOT rewrite everything unless necessary.
- Output the corrected code."""


def generate_code(state: AgentState) -> dict:
    """Generate Python code based on the user request."""
    messages = state["messages"]
    retry_count = state.get("retry_count", 0)

    prompt_messages = [SystemMessage(content=SYSTEM_PROMPT)]

    if retry_count > 0 and state.get("generated_code"):
        error_context = (
            f"\n\nYour previous code:\n"
            f"```python\n{state['generated_code']}\n```\n\n"
            f"Error encountered:\n{state['execution_result']}\n\n"
            f"Fix the code. Attempt {retry_count + 1} of "
            f"{state['max_retries']}."
        )
        prompt_messages.extend(messages)
        prompt_messages.append(
            HumanMessage(content=error_context)
        )
    else:
        prompt_messages.extend(messages)

    response = model.invoke(prompt_messages)
    generated_code = response.content.strip()

    if generated_code.startswith("```python"):
        generated_code = generated_code[9:]
    if generated_code.startswith("```"):
        generated_code = generated_code[3:]
    if generated_code.endswith("```"):
        generated_code = generated_code[:-3]
    generated_code = generated_code.strip()

    return {
        "messages": [
            AIMessage(
                content=f"Generated code (attempt "
                f"{retry_count + 1}):\n```python\n"
                f"{generated_code}\n```"
            )
        ],
        "generated_code": generated_code,
    }


# --- Execution Node ---

def execute_code(state: AgentState) -> dict:
    """Execute the generated code and capture the result."""
    code = state["generated_code"]
    result = execute_code_safely(code)

    if result["success"]:
        output_text = (
            result["output"] if result["output"] else "(No output)"
        )
        return {
            "messages": [
                AIMessage(
                    content=f"Execution successful.\n"
                    f"Output:\n{output_text}"
                )
            ],
            "execution_result": output_text,
            "execution_succeeded": True,
        }
    else:
        error_text = result["error"]
        return {
            "messages": [
                AIMessage(
                    content=f"Execution failed.\n"
                    f"Error:\n{error_text}"
                )
            ],
            "execution_result": error_text,
            "execution_succeeded": False,
            "retry_count": state.get("retry_count", 0) + 1,
        }


# --- Evaluation Node ---

def evaluate_result(state: AgentState) -> dict:
    """Check if the output answers the user's question."""
    user_request = ""
    for msg in state["messages"]:
        if isinstance(msg, HumanMessage):
            user_request = msg.content
            break

    eval_prompt = (
        f"The user asked: '{user_request}'\n\n"
        f"The code produced this output:\n"
        f"{state['execution_result']}\n\n"
        f"Does this output correctly and completely answer "
        f"the user's request?\n"
        f"Reply with exactly 'YES' or 'NO: <reason>'."
    )

    response = model.invoke(
        [
            SystemMessage(
                content="You evaluate code execution results. "
                "Be strict but fair."
            ),
            HumanMessage(content=eval_prompt),
        ]
    )

    evaluation = response.content.strip()
    is_correct = evaluation.upper().startswith("YES")

    if is_correct:
        return {
            "messages": [
                AIMessage(
                    content=f"Result verified.\n\n"
                    f"Final answer:\n"
                    f"{state['execution_result']}"
                )
            ],
            "execution_succeeded": True,
        }
    else:
        return {
            "messages": [
                AIMessage(
                    content=f"Output doesn't match the "
                    f"request. Reason: {evaluation}"
                )
            ],
            "execution_result": (
                f"Code ran but output was wrong. "
                f"Evaluation: {evaluation}"
            ),
            "execution_succeeded": False,
            "retry_count": state.get("retry_count", 0) + 1,
        }


# --- Routing Functions ---

def route_after_execution(state: AgentState) -> str:
    """Decide what happens after code execution."""
    if state["execution_succeeded"]:
        return "evaluate"
    elif state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "end"
    else:
        return "retry"


def route_after_evaluation(state: AgentState) -> str:
    """Decide what happens after result evaluation."""
    if state["execution_succeeded"]:
        return "end"
    elif state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "end"
    else:
        return "retry"


# --- Graph Assembly ---

def build_code_agent():
    """Build and compile the code generation agent graph."""
    graph = StateGraph(AgentState)

    graph.add_node("generate", generate_code)
    graph.add_node("execute", execute_code)
    graph.add_node("evaluate", evaluate_result)

    graph.add_edge(START, "generate")
    graph.add_edge("generate", "execute")

    graph.add_conditional_edges(
        "execute",
        route_after_execution,
        {
            "evaluate": "evaluate",
            "retry": "generate",
            "end": END,
        },
    )

    graph.add_conditional_edges(
        "evaluate",
        route_after_evaluation,
        {
            "end": END,
            "retry": "generate",
        },
    )

    return graph.compile()


# --- Run the Agent ---

if __name__ == "__main__":
    agent = build_code_agent()

    result = agent.invoke(
        {
            "messages": [
                HumanMessage(
                    content="Write a Python script that generates "
                    "a list of 10 random numbers between 1 and "
                    "100, sorts them, and prints the sorted list "
                    "along with the average."
                )
            ],
            "generated_code": "",
            "execution_result": "",
            "execution_succeeded": False,
            "retry_count": 0,
            "max_retries": 3,
        }
    )

    for msg in result["messages"]:
        print(f"\n{'='*50}")
        print(f"[{msg.__class__.__name__}]")
        print(msg.content)

    print(f"\nRetries used: {result['retry_count']}")
    print(f"Succeeded: {result['execution_succeeded']}")
    print("\nScript completed successfully.")

Exercise: Add a Code Safety Node

Here is a challenge that tests how well you understand the graph layout. Add a safety gate between generate and execute — a node that blocks risky code before it runs.

Your task: Create a check_safety node that:

Reads state["generated_code"]
Scans for blocked patterns (from the guard rails section)
If risky, returns an error and bumps retry_count
If safe, lets the code through to the runner

You will also need to rewire the graph: generate feeds into check_safety, and check_safety uses conditional edges to route to execute or back to generate.

Hints

**Hint 1:** The safety node needs its own routing function. Model it after `route_after_execution` — check a boolean flag.

**Hint 2:** Add a `code_is_safe` boolean field to `AgentState`. The safety node sets it. The conditional edge reads it.

Solution

python

# Add to AgentState:
# code_is_safe: bool

FORBIDDEN_PATTERNS = [
    "os.remove", "os.rmdir", "shutil.rmtree",
    "subprocess.call", "subprocess.run",
    "os.system", "__import__",
    "eval(", "exec(",
]


def check_safety(state: AgentState) -> dict:
    """Scan generated code for dangerous patterns."""
    code = state["generated_code"]
    for pattern in FORBIDDEN_PATTERNS:
        if pattern in code:
            return {
                "messages": [
                    AIMessage(
                        content=f"Safety check failed: "
                        f"code contains '{pattern}'. "
                        f"Regenerating..."
                    )
                ],
                "execution_result": (
                    f"Code blocked: contains '{pattern}'"
                ),
                "execution_succeeded": False,
                "code_is_safe": False,
                "retry_count": state.get("retry_count", 0) + 1,
            }
    return {"code_is_safe": True}


def route_after_safety(state: AgentState) -> str:
    if state.get("code_is_safe", False):
        return "execute"
    elif state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "end"
    else:
        return "retry"


# Updated graph wiring:
# graph.add_edge("generate", "check_safety")
# graph.add_conditional_edges(
#     "check_safety", route_after_safety,
#     {"execute": "execute", "retry": "generate", "end": END}
# )

The key change: `generate` no longer connects to `execute` directly. The safety node sits in between as a gatekeeper. If code passes, it moves to the runner. If it fails, the agent loops back with the violation as context — and the LLM rewrites to avoid the blocked pattern.

Exercise: Support Multi-Step Tasks

This exercise pushes you to extend what the agent can do. Some tasks need several code runs in a row — for example: “Create a CSV with sample data, then read it and find the statistics.”

Your task: Change the agent so it handles multi-step requests. The agent should:

Break the request into ordered steps
Write and run code for each step
Check each step before moving on
Return the final result after all steps are done

Hints

**Hint 1:** Add `task_steps: list[str]` and `current_step: int` to the state. Create a planning node that splits the request into steps.

**Hint 2:** After each good check, see if `current_step < len(task_steps) - 1`. If more steps remain, bump the counter and route back to `generate`.

Solution

python

# Add to AgentState:
# task_steps: list[str]
# current_step: int

def plan_task(state: AgentState) -> dict:
    """Break the user request into sequential steps."""
    user_request = state["messages"][-1].content

    plan_prompt = (
        f"Break this task into sequential Python steps:\n"
        f"'{user_request}'\n\n"
        f"Return each step on a new line, numbered. "
        f"Each step must be independently executable."
    )

    response = model.invoke(
        [SystemMessage(content="You are a task planner."),
         HumanMessage(content=plan_prompt)]
    )

    steps = [
        line.strip()
        for line in response.content.strip().split("\n")
        if line.strip() and line.strip()[0].isdigit()
    ]

    return {
        "task_steps": steps,
        "current_step": 0,
        "messages": [
            AIMessage(
                content=f"Plan: {len(steps)} steps identified."
            )
        ],
    }


# Modified graph: START → plan → generate → execute → evaluate
# After evaluation, check current_step vs len(task_steps).
# If more steps remain, increment and route to generate.
# If all done, route to END.

The planning node uses a separate LLM call to break the request apart. Each step gets its own write-run-check cycle. The `current_step` counter tracks where we are, and the routing logic decides whether to keep going or wrap up.

Summary

That is a wrap. You now have a working code agent that writes Python, runs it safely, verifies the output, and loops back when something goes wrong — all laid out as an explicit LangGraph state machine.

Here is what we walked through:

State design — a TypedDict that carries code, results, flags, and counters
Safe execution — a subprocess sandbox with a timeout
Code writing — LLM prompts enriched with error context on retries
Output verification — a second LLM call that checks whether results match the request
Conditional routing — LangGraph edges that build the retry loop
Guard rails — blocklist scanning, spend tracking, and script length limits

What we built is a foundation. Layer on tighter sandboxing (Docker, E2B), multi-step planning, long-term memory, or domain-tuned prompts to fit your own project.

FAQ

Can this agent handle tasks that need outside libraries?

Yes, as long as those libraries exist in the run environment. The agent writes imports — if pandas is in the sandbox, the code works. Otherwise, the agent sees ModuleNotFoundError and may rewrite to skip that library.

python

# Verify a library is available before running agent tasks
import importlib
try:
    importlib.import_module("pandas")
    print("pandas is available")
except ImportError:
    print("pandas is NOT installed")

How do I switch to a cloud sandbox?

Swap execute_code_safely with a call to your sandbox provider. E2B, Modal, and LangChain Sandbox all offer an execute() method that takes code and gives back output. The rest of the graph stays the same — only the backend changes.

What does each request cost?

Each write call costs about $0.002-0.005 with GPT-4o-mini. The checker adds $0.001-0.002. A first-try success costs around $0.004 total. Three retries run about $0.015. The checker is the first thing to cut at scale — skip it for tasks with outputs that are easy to verify on their own.

Can I use a local model instead of OpenAI?

Swap ChatOpenAI for any LangChain-compatible chat model. Ollama, vLLM, and Anthropic all plug in. Code quality depends on the model — GPT-4o and Claude write solid code, while smaller models may need more retries.

Topic Cluster: LangGraph Agent Patterns

This article is part of the LangGraph series on MachineLearningPlus. Related articles:

References

LangGraph documentation — Graph API overview. Link
LangGraph documentation — StateGraph and conditional edges. Link
LangChain documentation — Sandboxes for Deep Agents. Link
langchain-sandbox — PyPI package for safe Python execution. Link
E2B documentation — Code Interpreter with LangGraph. Link
LangChain blog — Execute Code with Sandboxes. Link
Modal documentation — Build a coding agent with LangGraph. Link
Yao, S. et al. — “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

LangGraph Code Generation Agent: Build & Run Python

What Is a Code Generation Agent?

Prerequisites

Step 1 — Define the Agent State

Step 2 — Build a Safe Code Runner

Step 3 — Create the Code Writer Node

Step 4 — Build the Runner Node

Step 5 — Add the Checker Node

Step 6 — Wire the Graph with Conditional Routing

Step 7 — Run the Agent

How to Watch the Agent Think Step by Step

How to Add Guard Rails for Production

When Should You Use a Code Agent (and When Not)?

Common Mistakes and How to Fix Them

Complete Code

Exercise: Add a Code Safety Node

Exercise: Support Multi-Step Tasks

Summary

FAQ

Topic Cluster: LangGraph Agent Patterns

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is a Code Generation Agent?

Prerequisites

Step 1 — Define the Agent State

Step 2 — Build a Safe Code Runner

Step 3 — Create the Code Writer Node

Step 4 — Build the Runner Node

Step 5 — Add the Checker Node

Step 6 — Wire the Graph with Conditional Routing

Step 7 — Run the Agent

How to Watch the Agent Think Step by Step

How to Add Guard Rails for Production

When Should You Use a Code Agent (and When Not)?

Common Mistakes and How to Fix Them

Complete Code

Exercise: Add a Code Safety Node

Exercise: Support Multi-Step Tasks

Summary

FAQ

Topic Cluster: LangGraph Agent Patterns

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

Build a Python AI Chatbot with Memory Using LangChain

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.