LangGraph Web Scraping Agent: Autonomous Pipeline

Learn to build a LangGraph pipeline that scrapes websites on its own, pulls structured data, handles pagination and errors, and writes analytical reports.

Written by Selva Prabhakaran | 28 min read

This project shows you how to build a LangGraph agent that scrapes websites, pulls out clean data, deals with errors and page links, and writes a summary report — all on its own.

Picture this: you need data from a website. Not just once — on an ongoing basis. So you write a scraper. It runs fine for a week. Then the site’s layout shifts and the whole thing falls apart. You fix CSS selectors, bolt on error handling, and tack on page-link logic. Pretty soon your code is a tangled mess held together by try-except blocks.

What if an AI agent could handle all of that? One that decides what to scrape, bounces back from errors, and sums up what it found?

That is what we are building here. A LangGraph pipeline where an LLM runs the entire scraping workflow by itself.

Let me walk you through how the parts fit together before we write any code. The pipeline starts with a URL and a goal. The goal might be something like “pull all product listings from this page.” The first step grabs the raw HTML. If the page fails to load, the agent tries again or shifts its plan.

Once the HTML arrives, a parsing step pulls out the data you want — product names, prices, ratings — based on what the LLM finds in the page. Next, the agent asks: are there more pages to go? If so, it circles back and grabs the next one.

After all pages are done, the data moves to a review step. The LLM spots trends, runs some stats, and writes a final report. Each step feeds into the next through LangGraph’s state, and the whole thing runs as a single graph call.

How Is the Pipeline Set Up? Five Nodes, One Loop

The pipeline has five nodes joined by conditional edges. Each node does one job:

Node	Purpose	Input	Output
Fetch	Download HTML, handle HTTP errors	URL from state	Raw HTML or error status
Parse	Pull out data via the LLM	Raw HTML + goal	List of JSON records
Pagination Check	Find “next page” links	Raw HTML	Next URL or “done” signal
Accumulate	Merge new data with what we have	New + old records	Full dataset
Analyze	Write a stats summary	All records	Report string

The conditional edge after the pagination check makes the loop. If more pages exist, flow goes back to Fetch. If not, it moves to Analyze. This is the core pattern — and it is quite simple to set up.

Prerequisites

Python version: 3.10+
Required libraries: langgraph (0.4+), langchain-openai (0.3+), langchain-core (0.3+), requests (2.31+), beautifulsoup4 (4.12+)
Install: pip install langgraph langchain-openai langchain-core requests beautifulsoup4
API Key: OpenAI API key (set as OPENAI_API_KEY environment variable). Create one at platform.openai.com/api-keys.
What you should know: LangGraph basics — nodes, edges, and state. Some comfort with Python’s requests library.
How long it takes: 35-40 minutes

python

import os
import json
import time
import requests
from bs4 import BeautifulSoup
from typing import Annotated
from typing_extensions import TypedDict

from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

We split the imports into three clusters: standard library, third-party scraping tools, and LangGraph/LangChain bits. We go with gpt-4o-mini here — it is fast, cheap, and handles HTML work just fine.

How Do You Define the Pipeline State?

Every LangGraph graph needs a state schema. Think of it as shared memory that nodes read from and write to.

Our state holds the scraping URL, the goal, raw HTML, a growing list of data records, page counters, and the final report. Below is the full schema. Pay close attention to the Annotated types — they hold a key detail.

python

def merge_lists(existing: list, new: list) -> list:
    """Reducer that appends new items to existing list."""
    return existing + new

class ScraperState(TypedDict):
    url: str
    goal: str
    raw_html: str
    extracted_data: Annotated[list[dict], merge_lists]
    current_page: int
    max_pages: int
    next_page_url: str
    analysis_report: str
    error_log: Annotated[list[str], merge_lists]
    retry_count: int
    status: str

Notice the Annotated wrapper on extracted_data and error_log? That merge_lists reducer tells LangGraph what to do when a node writes new data. Without it, returning {"extracted_data": new_records} would wipe out the old list. With the reducer, new items land at the end of the list instead. This is vital for pagination — skip it and every new page erases the data from the page before it.

Key Insight: Reducers decide how LangGraph merges state updates. Choose well and your pipeline keeps all data across pages. Choose wrong and it quietly throws away everything but the last page.

How Do You Build the Fetch Node?

What happens when you ask for a webpage and the server sends back a 503? Or the request times out? The fetch node takes care of all that.

Here is how it works: it grabs the current URL from state, fires off a GET request with a user-agent header that looks like a real browser, and saves the HTML. If the request fails, the node logs the error and adds one to the retry counter. The status field lets later nodes know if the fetch went through.

python

def fetch_page(state: ScraperState) -> dict:
    """Fetch HTML content from the current URL."""
    current_page = state.get("current_page", 1)
    if current_page > 1:
        time.sleep(2)  # Polite delay between requests

    url = state.get("next_page_url") or state["url"]
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        )
    }
    try:
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        return {
            "raw_html": response.text,
            "status": "fetched",
            "retry_count": 0,
        }
    except requests.RequestException as error:
        return {
            "status": "fetch_error",
            "error_log": [f"Fetch failed for {url}: {str(error)}"],
            "retry_count": state.get("retry_count", 0) + 1,
        }

Two design choices are worth a closer look here. First, the user-agent header looks like Chrome. Without it, many sites block script-based requests right away. Second, the retry counter resets to zero on success. We only care about failures in a row, not the total count over time.

The 2-second pause before fetching pages after the first one is just good manners. Hitting a server with fast back-to-back requests is a quick way to get your IP banned.

Warning: Always set a `timeout` on `requests.get()`. Without one, a stalled server freezes your whole pipeline forever. Fifteen seconds works for most pages. Many real-world scrapers use 10.

How Does the Parse Node Work? LLM-Driven Data Pulling

Here is where this pipeline breaks away from the old way of scraping. Instead of fixed CSS selectors that stop working when a site gets a redesign, the LLM reads the page and pulls data based on your goal.

Let me explain what the parse node does step by step. First, it strips away the noise in the HTML — scripts, styles, nav bars. Then it turns what is left into plain text and sends it to the LLM with clear instructions. The LLM sends back a JSON array of records. We cap the text at 12,000 characters to stay within token limits. Most of the real content sits in the first third of a page anyway.

python

def parse_content(state: ScraperState) -> dict:
    """Use LLM to extract structured data from HTML."""
    html = state["raw_html"]
    goal = state["goal"]

    soup = BeautifulSoup(html, "html.parser")

    # Strip noise: scripts, styles, navigation, footer
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()

    clean_text = soup.get_text(separator="\n", strip=True)
    truncated = clean_text[:12000]

    prompt = f"""Extract structured data from this webpage.

Goal: {goal}

Return a JSON array of objects with consistent keys.
If no relevant data exists, return an empty array [].
Return ONLY valid JSON — no markdown, no explanation.

Webpage content:
{truncated}"""

    response = llm.invoke([
        SystemMessage(content="You are a data extraction specialist."),
        HumanMessage(content=prompt),
    ])

    try:
        records = json.loads(response.content)
        if not isinstance(records, list):
            records = [records]
    except json.JSONDecodeError:
        return {
            "extracted_data": [],
            "error_log": ["LLM returned invalid JSON during parsing"],
            "status": "parse_error",
        }

    return {"extracted_data": records, "status": "parsed"}

Why strip <script>, <style>, <nav>, and <footer> tags? A normal webpage weighs 50-100KB in HTML, but the real content might be just 5KB. Scripts, CSS rules, and repeated nav links eat up tokens without adding any value. Removing them cuts costs by 70-80% and makes the LLM’s output much better.

Approach	Selector Upkeep	Adapts to Layout Changes	Cost per Page
Fixed CSS selectors	Manual — breaks on redesign	No	$0
LLM-driven pulling	Zero — LLM adapts	Yes	~$0.002
Visual scraping tools	GUI re-training needed	Partly	Varies

The tradeoff is clear. You pay a tiny amount per page to get rid of the upkeep burden that makes old-school scrapers such a pain.

Tip: Trim HTML hard before you send it to the LLM. Stripping away filler tags makes the model cheaper to run and helps it find the right data, because the signal-to-noise ratio goes way up.

How Does the Pagination Node Follow “Next Page” Links?

Have you noticed how many scraping guides skip pagination? In real life, the data you want often spans many pages. This node solves that problem.

Instead of hard-coding a URL pattern for page links (which breaks the moment the site changes its URL scheme), we first filter anchor tags for keywords like “next” and “>>” and then let the LLM pick the right link. The max_pages guard stops the agent from looping forever.

python

def check_pagination(state: ScraperState) -> dict:
    """Determine if there are more pages to scrape."""
    current = state.get("current_page", 1)
    max_pages = state.get("max_pages", 5)

    if current >= max_pages:
        return {"status": "all_pages_scraped"}

    soup = BeautifulSoup(state["raw_html"], "html.parser")
    links = []
    for a_tag in soup.find_all("a", href=True):
        link_text = a_tag.get_text(strip=True).lower()
        href = a_tag["href"]
        if any(kw in link_text for kw in [
            "next", ">>", "\u203a", "older"
        ]):
            links.append(f"{link_text}: {href}")

    if not links:
        return {"status": "all_pages_scraped"}

    prompt = f"""Which link leads to the next page of results?

Links found:
{chr(10).join(links)}

Current URL: {state.get('next_page_url') or state['url']}

Return ONLY the full absolute URL. If no next page exists, return NONE.
If the href is relative, combine it with the base URL."""

    response = llm.invoke([HumanMessage(content=prompt)])
    answer = response.content.strip()

    if answer == "NONE" or not answer.startswith("http"):
        return {"status": "all_pages_scraped"}

    return {
        "next_page_url": answer,
        "current_page": current + 1,
        "status": "has_next_page",
    }

Why filter links first? Because a normal page has 50 to 200 anchor tags. Dumping all of them into the prompt wastes tokens and clouds the choice. When you filter for words like “next” and “>>”, you trim the list to maybe 1-3 links. The model then has an easy time picking the right one.

How Does the Analysis Node Turn Raw Data into Insights?

Scraping without a report is just data hoarding. What patterns hide in the data? What does the spread look like? Are there odd values? The analysis node answers these questions.

It sends the first 50 records to the LLM (to stay within token limits) and asks for a report with totals, stats, patterns, and data quality notes. For a real system, I would run the math in Python and only send the results to the LLM for review. LLMs are great at reading numbers but sometimes get the math wrong.

python

def analyze_data(state: ScraperState) -> dict:
    """Generate analytical summary from all scraped data."""
    data = state.get("extracted_data", [])

    if not data:
        return {
            "analysis_report": "No data was extracted.",
            "status": "complete",
        }

    data_str = json.dumps(data[:50], indent=2)

    prompt = f"""Analyze this scraped dataset and produce a report.

Total records collected: {len(data)}
Sample data (first 50 records):
{data_str}

Include these sections:
1. **Summary**: Total records, fields present, data completeness
2. **Key Statistics**: Counts, averages, ranges where applicable
3. **Notable Patterns**: Trends, outliers, interesting findings
4. **Data Quality Notes**: Missing fields, inconsistencies

Be specific. Use actual numbers from the data."""

    response = llm.invoke([
        SystemMessage(content="You are a data analyst. Be concise and specific."),
        HumanMessage(content=prompt),
    ])

    return {
        "analysis_report": response.content,
        "status": "complete",
    }

Key Insight: Use Python for math and the LLM for meaning. Do not ask GPT to average 500 prices. Compute it in Python, then ask GPT what the number tells you. It is faster, cheaper, and free of math errors.

How Do You Wire the Graph? Routing and Conditional Edges

This is the part that makes LangGraph click. We link the nodes with edges, and two conditional edges create the branching logic: one for error recovery after fetch, one for the page loop.

The routing functions are kept simple on purpose. Each one checks a single state field and returns a string that maps to the next node.

python

def route_after_fetch(state: ScraperState) -> str:
    """Decide next step after fetching a page."""
    if state["status"] == "fetched":
        return "parse"
    if state.get("retry_count", 0) < 3:
        return "retry_fetch"
    return "analyze"

def route_after_pagination(state: ScraperState) -> str:
    """Continue scraping or move to analysis."""
    if state["status"] == "has_next_page":
        return "fetch_next"
    return "analyze"

Three paths come out of route_after_fetch. On success, go to parsing. On a fixable failure, loop back to fetch. After too many retries, skip ahead to analysis with whatever data we have. No nested conditions. No tangled logic.

Now here is the full graph setup. Notice how add_conditional_edges takes a routing function plus a mapping dict. Every path is spelled out in the code so you can trace it at a glance.

python

graph = StateGraph(ScraperState)

# Add all nodes
graph.add_node("fetch", fetch_page)
graph.add_node("parse", parse_content)
graph.add_node("check_pagination", check_pagination)
graph.add_node("analyze", analyze_data)

# Entry point
graph.add_edge(START, "fetch")

# Conditional: parse on success, retry on failure, analyze on exhaustion
graph.add_conditional_edges(
    "fetch",
    route_after_fetch,
    {
        "parse": "parse",
        "retry_fetch": "fetch",
        "analyze": "analyze",
    },
)

# After parsing, always check for more pages
graph.add_edge("parse", "check_pagination")

# Conditional: fetch next page or finalize
graph.add_conditional_edges(
    "check_pagination",
    route_after_pagination,
    {
        "fetch_next": "fetch",
        "analyze": "analyze",
    },
)

# Analysis is the terminal node
graph.add_edge("analyze", END)

scraper_agent = graph.compile()

That routing map — the third argument to add_conditional_edges — is what makes the graph easy to read. Anyone looking at this code can follow every path the agent might take without running it. I find this much clearer than deep if-else chains.

How Do You Run the Pipeline? A Full Example

Time to see it in action. We will scrape job listings from Real Python’s fake jobs page. It is a static demo site made for scraping practice. No rate limits, no terms-of-service worries.

The starting state sets the target URL, a goal that says what fields to grab, and a 3-page cap to keep the demo quick.

python

initial_state = {
    "url": "https://realpython.github.io/fake-jobs/",
    "goal": (
        "Extract all job listings. For each job, get: "
        "title, company, location, and posting date."
    ),
    "extracted_data": [],
    "current_page": 1,
    "max_pages": 3,
    "next_page_url": "",
    "analysis_report": "",
    "error_log": [],
    "retry_count": 0,
    "status": "ready",
}

result = scraper_agent.invoke(initial_state)

Once the pipeline is done, check the results. The extracted_data list holds clean dicts, and analysis_report has the LLM’s summary.

python

print(f"Records extracted: {len(result['extracted_data'])}")
print(f"\nFirst 3 records:")
for record in result["extracted_data"][:3]:
    print(json.dumps(record, indent=2))

print(f"\n{'='*50}")
print("ANALYSIS REPORT")
print(f"{'='*50}")
print(result["analysis_report"])

if result["error_log"]:
    print(f"\nErrors: {result['error_log']}")

You will see job data and an overview. The exact records depend on what the LLM pulls out, but the layout looks like this:

python

Records extracted: 100

First 3 records:
{
  "title": "Energy engineer",
  "company": "Vasquez-Davidson",
  "location": "Christopherport, AA",
  "posting_date": "2021-04-08"
}
...

Note: The site `realpython.github.io/fake-jobs/` is a static demo page with 100 fake listings on one page — no real pagination. The pipeline handles this just fine. It finds no “next” links and goes straight to analysis. To test the page loop, point the pipeline at a site that does have pages.

How Do You Add Error Recovery to Make It Production-Ready?

The basic pipeline works for happy paths. But what about a 429 rate-limit reply? A timeout on a slow server? A page that sends back garbled HTML?

A separate error handler node looks at what went wrong and sets a status that the routing function can act on. The handler figures out the problem. The router picks the next step. The fetch node carries it out. Clean split of duties.

python

def handle_error(state: ScraperState) -> dict:
    """Classify errors and set recovery strategy."""
    errors = state.get("error_log", [])
    last_error = errors[-1] if errors else "Unknown error"

    if "429" in last_error or "rate" in last_error.lower():
        return {
            "status": "rate_limited",
            "error_log": ["Rate limited — backing off before retry"],
        }
    if "timeout" in last_error.lower():
        return {
            "status": "timeout_retry",
            "error_log": ["Timeout — retrying with longer wait"],
        }
    return {
        "status": "unrecoverable",
        "error_log": [f"Giving up after error: {last_error}"],
    }

In a real setup, you would also want slower retries for rate limits, proxy switching for IP bans, and a dead-letter queue for errors that keep coming back. Those are full topics on their own. The node setup makes adding them easy because each concern lives in its own node.

Exercise 1: Add a Data Validation Node

You have seen how each node does one job. Now add a validate_data node that checks records for quality before they reach the analysis step.

The node should drop records that are missing required keys and log how many it removed. If 90% of records get dropped, that points to a problem with the prompt — not the data.

python

# Complete this function

def validate_data(state: ScraperState) -> dict:
    """Validate and clean extracted data."""
    data = state.get("extracted_data", [])
    required_keys = {"title", "company", "location"}

    # TODO: Filter records that contain all required keys
    # TODO: Count how many records were removed
    # TODO: Return cleaned data with appropriate status

    valid_records = []  # Your filtering logic here
    removed_count = 0   # Your count here

    return {
        "extracted_data": valid_records,
        "error_log": [
            f"Validation: kept {len(valid_records)}, "
            f"removed {removed_count} incomplete records"
        ],
        "status": "validated" if valid_records else "no_valid_data",
    }

Hint 1

Use `all()` inside a list comprehension to check if every required key exists: `all(key in record for key in required_keys)`.

Hint 2

Full filtering: `valid_records = [r for r in data if all(k in r for k in required_keys)]`. Then `removed_count = len(data) – len(valid_records)`.

Solution

python

def validate_data(state: ScraperState) -> dict:
    """Validate and clean extracted data."""
    data = state.get("extracted_data", [])
    required_keys = {"title", "company", "location"}

    valid_records = [
        record for record in data
        if all(key in record for key in required_keys)
    ]
    removed_count = len(data) - len(valid_records)

    return {
        "extracted_data": valid_records,
        "error_log": [
            f"Validation: kept {len(valid_records)}, "
            f"removed {removed_count} incomplete records"
        ],
        "status": "validated" if valid_records else "no_valid_data",
    }

**Why this works:** `all()` returns `True` only when every required key exists in the record. Records with missing fields get dropped. The removed count helps you spot issues. If a lot of records get cut, the prompt likely needs tuning.

How Do You Adapt the Pipeline for Different Goals?

The same pipeline scrapes any kind of data. You do not change the code — you change the goal string. The LLM adapts how it pulls data at runtime.

Want product listings instead of jobs?

python

product_state = {
    "url": "https://example-store.com/electronics",
    "goal": (
        "Extract product listings: name, price in USD, "
        "star rating as a float, and availability status"
    ),
    "extracted_data": [],
    "current_page": 1,
    "max_pages": 5,
    "next_page_url": "",
    "analysis_report": "",
    "error_log": [],
    "retry_count": 0,
    "status": "ready",
}

Research paper details? Same pipeline, different goal:

python

research_state = {
    "url": "https://arxiv.org/list/cs.AI/recent",
    "goal": (
        "Extract paper listings: title, authors, "
        "abstract summary, and submission date"
    ),
    "extracted_data": [],
    "current_page": 1,
    "max_pages": 2,
    "next_page_url": "",
    "analysis_report": "",
    "error_log": [],
    "retry_count": 0,
    "status": "ready",
}

Tip: Be very clear in your goal. Vague goals like “get all data” produce messy, uneven JSON. Goals like “extract product name, price in USD, and star rating as a float” give the LLM sharp targets and produce cleaner output.

Exercise 2: Add Rate Limiting You Can Tune

Web servers do not like rapid-fire requests. Change the fetch logic to use a delay you can set from state instead of the fixed 2 seconds.

python

# Add a 'fetch_delay' field to the state and use it

def fetch_page_configurable(state: ScraperState) -> dict:
    """Fetch with configurable delay between pages."""
    current_page = state.get("current_page", 1)
    delay = state.get("fetch_delay", 2)  # Default 2 seconds

    # TODO: Apply delay for non-first pages
    # TODO: Fetch the URL with error handling

    url = state.get("next_page_url") or state["url"]
    pass  # Complete the implementation

Hint 1

Check `current_page > 1` before sleeping. The first page does not need a delay.

Hint 2

Add `if current_page > 1: time.sleep(delay)` before the request. The rest follows the same pattern as `fetch_page`.

Solution

python

def fetch_page_configurable(state: ScraperState) -> dict:
    """Fetch with configurable delay between pages."""
    current_page = state.get("current_page", 1)
    delay = state.get("fetch_delay", 2)

    if current_page > 1:
        time.sleep(delay)

    url = state.get("next_page_url") or state["url"]
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36"
        )
    }
    try:
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        return {
            "raw_html": response.text,
            "status": "fetched",
            "retry_count": 0,
        }
    except requests.RequestException as error:
        return {
            "status": "fetch_error",
            "error_log": [f"Fetch failed: {str(error)}"],
            "retry_count": state.get("retry_count", 0) + 1,
        }

**Why this matters:** Some sites are fine with 1-second gaps. Others need 5. Making the delay a state value lets you tune it per target without changing any code.

What Are the Most Common Mistakes?

Mistake 1: No Page Limit on Pagination

❌ Wrong:

python

initial_state = {
    "max_pages": 999,  # Or omitting it entirely
}

Why this is risky: Some sites have thousands of pages. Your pipeline runs for hours, burns API credits, and might get your IP banned.

✅ Correct:

python

initial_state = {
    "max_pages": 10,  # Start small, increase if needed
}

Mistake 2: Sending Raw HTML to the LLM

❌ Wrong:

python

prompt = f"Extract data from: {state['raw_html']}"

Why it fails: Raw HTML is 80% junk. A product page might hold 150KB of HTML but only 2KB of real content. You waste tokens and confuse the model.

✅ Correct:

python

soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "nav", "footer"]):
    tag.decompose()
clean_text = soup.get_text(separator="\n", strip=True)[:12000]

Mistake 3: No Status Checks in Routing

❌ Wrong:

python

def route_after_fetch(state):
    return "parse"  # Always parse, even on failure

Why it breaks: If the fetch failed, raw_html is either empty or left over from the last page. The parse node crashes or makes copies.

✅ Correct:

python

def route_after_fetch(state):
    if state["status"] == "fetched":
        return "parse"
    if state.get("retry_count", 0) < 3:
        return "retry_fetch"
    return "analyze"  # Graceful fallback

When Should You NOT Use This Approach?

This pipeline is not the best tool for every job. Here is when you should reach for something else:

Use an API instead if the site has one. APIs give you clean JSON — no parsing, no LLM costs. Always look for dev docs or /api/ endpoints first.

Use fixed selectors for high-volume scraping. At 100K pages with 12K tokens each, LLM costs add up to about $180. CSS selectors cost nothing for pulling data. If the site layout is stable, selectors are the smart choice.

Use a plain scraper for real-time tracking. The LLM adds 1-3 seconds of delay per page. If you need sub-second scraping for price tracking, you need hard-coded logic.

This pipeline shines when site layouts change often, you are scraping many sites with different layouts, or you are building a quick prototype that needs to work across many domains without custom selectors for each one.

Complete Code

Click to expand the full script (copy-paste and run)

python

# Complete code from: Autonomous Web Scraping Pipeline with LangGraph
# Requires: pip install langgraph langchain-openai langchain-core requests beautifulsoup4
# Python 3.10+
# Set OPENAI_API_KEY environment variable before running

import os
import json
import time
import requests
from bs4 import BeautifulSoup
from typing import Annotated
from typing_extensions import TypedDict

from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

# --- Setup ---
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# --- State ---
def merge_lists(existing: list, new: list) -> list:
    return existing + new

class ScraperState(TypedDict):
    url: str
    goal: str
    raw_html: str
    extracted_data: Annotated[list[dict], merge_lists]
    current_page: int
    max_pages: int
    next_page_url: str
    analysis_report: str
    error_log: Annotated[list[str], merge_lists]
    retry_count: int
    status: str

# --- Nodes ---
def fetch_page(state: ScraperState) -> dict:
    current_page = state.get("current_page", 1)
    if current_page > 1:
        time.sleep(2)

    url = state.get("next_page_url") or state["url"]
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        )
    }
    try:
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        return {
            "raw_html": response.text,
            "status": "fetched",
            "retry_count": 0,
        }
    except requests.RequestException as error:
        return {
            "status": "fetch_error",
            "error_log": [f"Fetch failed for {url}: {str(error)}"],
            "retry_count": state.get("retry_count", 0) + 1,
        }

def parse_content(state: ScraperState) -> dict:
    html = state["raw_html"]
    goal = state["goal"]
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    clean_text = soup.get_text(separator="\n", strip=True)
    truncated = clean_text[:12000]

    prompt = f"""Extract structured data from this webpage.

Goal: {goal}

Return a JSON array of objects with consistent keys.
If no relevant data exists, return an empty array [].
Return ONLY valid JSON — no markdown, no explanation.

Webpage content:
{truncated}"""

    response = llm.invoke([
        SystemMessage(content="You are a data extraction specialist."),
        HumanMessage(content=prompt),
    ])

    try:
        records = json.loads(response.content)
        if not isinstance(records, list):
            records = [records]
    except json.JSONDecodeError:
        return {
            "extracted_data": [],
            "error_log": ["LLM returned invalid JSON during parsing"],
            "status": "parse_error",
        }

    return {"extracted_data": records, "status": "parsed"}

def check_pagination(state: ScraperState) -> dict:
    current = state.get("current_page", 1)
    max_pages = state.get("max_pages", 5)
    if current >= max_pages:
        return {"status": "all_pages_scraped"}

    soup = BeautifulSoup(state["raw_html"], "html.parser")
    links = []
    for a_tag in soup.find_all("a", href=True):
        link_text = a_tag.get_text(strip=True).lower()
        href = a_tag["href"]
        if any(kw in link_text for kw in ["next", ">>", "\u203a", "older"]):
            links.append(f"{link_text}: {href}")

    if not links:
        return {"status": "all_pages_scraped"}

    prompt = f"""Which link leads to the next page of results?

Links found:
{chr(10).join(links)}

Current URL: {state.get('next_page_url') or state['url']}

Return ONLY the full absolute URL. If no next page exists, return NONE.
If the href is relative, combine it with the base URL."""

    response = llm.invoke([HumanMessage(content=prompt)])
    answer = response.content.strip()

    if answer == "NONE" or not answer.startswith("http"):
        return {"status": "all_pages_scraped"}

    return {
        "next_page_url": answer,
        "current_page": current + 1,
        "status": "has_next_page",
    }

def analyze_data(state: ScraperState) -> dict:
    data = state.get("extracted_data", [])
    if not data:
        return {"analysis_report": "No data was extracted.", "status": "complete"}

    data_str = json.dumps(data[:50], indent=2)
    prompt = f"""Analyze this scraped dataset and produce a report.

Total records collected: {len(data)}
Sample data (first 50 records):
{data_str}

Include:
1. Summary: total records, fields present, data completeness
2. Key Statistics: counts, averages, ranges where applicable
3. Notable Patterns: trends, outliers, interesting findings
4. Data Quality Notes: missing fields, inconsistencies

Be specific. Use actual numbers from the data."""

    response = llm.invoke([
        SystemMessage(content="You are a data analyst. Be concise and specific."),
        HumanMessage(content=prompt),
    ])

    return {"analysis_report": response.content, "status": "complete"}

# --- Routing ---
def route_after_fetch(state: ScraperState) -> str:
    if state["status"] == "fetched":
        return "parse"
    if state.get("retry_count", 0) < 3:
        return "retry_fetch"
    return "analyze"

def route_after_pagination(state: ScraperState) -> str:
    if state["status"] == "has_next_page":
        return "fetch_next"
    return "analyze"

# --- Graph Assembly ---
graph = StateGraph(ScraperState)
graph.add_node("fetch", fetch_page)
graph.add_node("parse", parse_content)
graph.add_node("check_pagination", check_pagination)
graph.add_node("analyze", analyze_data)

graph.add_edge(START, "fetch")
graph.add_conditional_edges(
    "fetch",
    route_after_fetch,
    {"parse": "parse", "retry_fetch": "fetch", "analyze": "analyze"},
)
graph.add_edge("parse", "check_pagination")
graph.add_conditional_edges(
    "check_pagination",
    route_after_pagination,
    {"fetch_next": "fetch", "analyze": "analyze"},
)
graph.add_edge("analyze", END)

scraper_agent = graph.compile()

# --- Run ---
if __name__ == "__main__":
    result = scraper_agent.invoke({
        "url": "https://realpython.github.io/fake-jobs/",
        "goal": (
            "Extract all job listings. For each job, get: "
            "title, company, location, and posting date."
        ),
        "extracted_data": [],
        "current_page": 1,
        "max_pages": 3,
        "next_page_url": "",
        "analysis_report": "",
        "error_log": [],
        "retry_count": 0,
        "status": "ready",
    })

    print(f"Records extracted: {len(result['extracted_data'])}")
    for record in result["extracted_data"][:3]:
        print(json.dumps(record, indent=2))
    print(f"\n{'='*50}")
    print("ANALYSIS REPORT")
    print(f"{'='*50}")
    print(result["analysis_report"])
    if result["error_log"]:
        print(f"\nErrors encountered: {result['error_log']}")

Summary

You built a web scraping pipeline with LangGraph that fetches pages, pulls out clean data with an LLM, follows page links, bounces back from errors, and writes a report.

Four design choices make it work:

State reducers (merge_lists) stack up data across page loops without wiping out old data
Conditional edges create the page loop and error recovery branches
LLM-driven pulling adapts to any page layout without fixed selectors
Single-job nodes — each does one thing, and routing functions decide the flow

The setup grows in a natural way. Need to check data quality? Add a node. Need CSV export? Add a node. Need to remove copies? Add a node. Each one plugs into the graph at the right spot without touching the code that already works.

Practice Exercise

Extend the pipeline with an export_node that runs after analysis and writes extracted_data to a CSV file using Python’s csv.DictWriter.

Solution

python

import csv

def export_to_csv(state: ScraperState) -> dict:
    """Export extracted data to a CSV file."""
    data = state.get("extracted_data", [])
    if not data:
        return {"status": "no_data_to_export"}

    headers = list(data[0].keys())
    filename = "scraped_data.csv"

    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=headers)
        writer.writeheader()
        writer.writerows(data)

    return {
        "status": "exported",
        "error_log": [f"Exported {len(data)} records to {filename}"],
    }

Wire it into the graph by swapping the `analyze -> END` edge:

python

graph.add_node("export", export_to_csv)
graph.add_edge("analyze", "export")
graph.add_edge("export", END)

Frequently Asked Questions

Can this pipeline handle pages rendered by JavaScript?

No — requests grabs raw HTML only. For JS-heavy sites (React, Vue, Angular), swap requests.get() for Selenium or Playwright in the fetch node. The rest of the pipeline stays the same because each node is on its own.

python

# Swap this into the fetch node for JS-rendered pages
from playwright.sync_api import sync_playwright

def fetch_with_playwright(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        html = page.content()
        browser.close()
    return html

How much does it cost to run this pipeline?

With gpt-4o-mini, each page costs about $0.001-$0.003 in API fees for parsing plus the page-link check. Scraping 100 pages runs about $0.15-$0.30 total. The analysis step adds $0.005-$0.01. If costs matter a lot, swap ChatOpenAI for ChatOllama and run a local model like Llama 3.

Is web scraping legal?

It depends on where you are and which site you scrape. In the US, the hiQ v. LinkedIn ruling said that scraping data that is open to the public does not break the CFAA. That said — always check robots.txt and terms of service. Respect rate limits. Do not scrape personal data without consent.

Add a login_node that signs in first and stores session cookies in state. Later fetch requests send those cookies along. You can also pass auth headers straight into the fetch node’s requests.get() call.

References

LangGraph documentation — StateGraph, conditional edges, and state management. Link
LangChain documentation — ChatOpenAI model integration. Link
BeautifulSoup documentation — Parsing HTML and navigating the tree. Link
Python requests library documentation. Link
Cohorte Projects — How to Build a Smart Web-Scraping AI Agent with LangGraph and Selenium. Link
Firecrawl — Building a Documentation Agent with LangGraph and Firecrawl. Link
Real Python — LangGraph: Build Stateful AI Agents in Python. Link
hiQ Labs, Inc. v. LinkedIn Corp., 938 F.3d 985 (9th Cir. 2019) — Legal precedent for public data scraping.

Reviewed: March 2026 | LangGraph version: 0.4+ | Python: 3.10+

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

LangGraph Web Scraping Agent: Autonomous Pipeline

How Is the Pipeline Set Up? Five Nodes, One Loop

Prerequisites

How Do You Define the Pipeline State?

How Do You Build the Fetch Node?

How Does the Parse Node Work? LLM-Driven Data Pulling

How Does the Analysis Node Turn Raw Data into Insights?

How Do You Wire the Graph? Routing and Conditional Edges

How Do You Run the Pipeline? A Full Example

How Do You Add Error Recovery to Make It Production-Ready?

Exercise 1: Add a Data Validation Node

How Do You Adapt the Pipeline for Different Goals?

Exercise 2: Add Rate Limiting You Can Tune

What Are the Most Common Mistakes?

Mistake 2: Sending Raw HTML to the LLM

Mistake 3: No Status Checks in Routing

When Should You NOT Use This Approach?

Complete Code

Summary

Practice Exercise

Frequently Asked Questions

Can this pipeline handle pages rendered by JavaScript?

How much does it cost to run this pipeline?

Is web scraping legal?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

How Is the Pipeline Set Up? Five Nodes, One Loop

Prerequisites

How Do You Define the Pipeline State?

How Do You Build the Fetch Node?

How Does the Parse Node Work? LLM-Driven Data Pulling

How Does the Pagination Node Follow “Next Page” Links?

How Does the Analysis Node Turn Raw Data into Insights?

How Do You Wire the Graph? Routing and Conditional Edges

How Do You Run the Pipeline? A Full Example

How Do You Add Error Recovery to Make It Production-Ready?

Exercise 1: Add a Data Validation Node

How Do You Adapt the Pipeline for Different Goals?

Exercise 2: Add Rate Limiting You Can Tune

What Are the Most Common Mistakes?

Mistake 1: No Page Limit on Pagination

Mistake 2: Sending Raw HTML to the LLM

Mistake 3: No Status Checks in Routing

When Should You NOT Use This Approach?

Complete Code

Summary

Practice Exercise

Frequently Asked Questions

Can this pipeline handle pages rendered by JavaScript?

How much does it cost to run this pipeline?

Is web scraping legal?

How do I scrape sites that need a login?

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

Build a Python AI Chatbot with Memory Using LangChain

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science