Menu

LangGraph Web Scraping Agent: Autonomous Pipeline

Learn to build a LangGraph pipeline that scrapes websites on its own, pulls structured data, handles pagination and errors, and writes analytical reports.

Written by Selva Prabhakaran | 28 min read

This project shows you how to build a LangGraph agent that scrapes websites, pulls out clean data, deals with errors and page links, and writes a summary report — all on its own.

Picture this: you need data from a website. Not just once — on an ongoing basis. So you write a scraper. It runs fine for a week. Then the site’s layout shifts and the whole thing falls apart. You fix CSS selectors, bolt on error handling, and tack on page-link logic. Pretty soon your code is a tangled mess held together by try-except blocks.

What if an AI agent could handle all of that? One that decides what to scrape, bounces back from errors, and sums up what it found?

That is what we are building here. A LangGraph pipeline where an LLM runs the entire scraping workflow by itself.

Let me walk you through how the parts fit together before we write any code. The pipeline starts with a URL and a goal. The goal might be something like “pull all product listings from this page.” The first step grabs the raw HTML. If the page fails to load, the agent tries again or shifts its plan.

Once the HTML arrives, a parsing step pulls out the data you want — product names, prices, ratings — based on what the LLM finds in the page. Next, the agent asks: are there more pages to go? If so, it circles back and grabs the next one.

After all pages are done, the data moves to a review step. The LLM spots trends, runs some stats, and writes a final report. Each step feeds into the next through LangGraph’s state, and the whole thing runs as a single graph call.

How Is the Pipeline Set Up? Five Nodes, One Loop

The pipeline has five nodes joined by conditional edges. Each node does one job:

NodePurposeInputOutput
FetchDownload HTML, handle HTTP errorsURL from stateRaw HTML or error status
ParsePull out data via the LLMRaw HTML + goalList of JSON records
Pagination CheckFind “next page” linksRaw HTMLNext URL or “done” signal
AccumulateMerge new data with what we haveNew + old recordsFull dataset
AnalyzeWrite a stats summaryAll recordsReport string

The conditional edge after the pagination check makes the loop. If more pages exist, flow goes back to Fetch. If not, it moves to Analyze. This is the core pattern — and it is quite simple to set up.

Prerequisites

  • Python version: 3.10+
  • Required libraries: langgraph (0.4+), langchain-openai (0.3+), langchain-core (0.3+), requests (2.31+), beautifulsoup4 (4.12+)
  • Install: pip install langgraph langchain-openai langchain-core requests beautifulsoup4
  • API Key: OpenAI API key (set as OPENAI_API_KEY environment variable). Create one at platform.openai.com/api-keys.
  • What you should know: LangGraph basics — nodes, edges, and state. Some comfort with Python’s requests library.
  • How long it takes: 35-40 minutes
python
import os
import json
import time
import requests
from bs4 import BeautifulSoup
from typing import Annotated
from typing_extensions import TypedDict

from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

We split the imports into three clusters: standard library, third-party scraping tools, and LangGraph/LangChain bits. We go with gpt-4o-mini here — it is fast, cheap, and handles HTML work just fine.

How Do You Define the Pipeline State?

Every LangGraph graph needs a state schema. Think of it as shared memory that nodes read from and write to.

Our state holds the scraping URL, the goal, raw HTML, a growing list of data records, page counters, and the final report. Below is the full schema. Pay close attention to the Annotated types — they hold a key detail.

python
def merge_lists(existing: list, new: list) -> list:
    """Reducer that appends new items to existing list."""
    return existing + new

class ScraperState(TypedDict):
    url: str
    goal: str
    raw_html: str
    extracted_data: Annotated[list[dict], merge_lists]
    current_page: int
    max_pages: int
    next_page_url: str
    analysis_report: str
    error_log: Annotated[list[str], merge_lists]
    retry_count: int
    status: str

Notice the Annotated wrapper on extracted_data and error_log? That merge_lists reducer tells LangGraph what to do when a node writes new data. Without it, returning {"extracted_data": new_records} would wipe out the old list. With the reducer, new items land at the end of the list instead. This is vital for pagination — skip it and every new page erases the data from the page before it.

Key Insight: Reducers decide how LangGraph merges state updates. Choose well and your pipeline keeps all data across pages. Choose wrong and it quietly throws away everything but the last page.

How Do You Build the Fetch Node?

What happens when you ask for a webpage and the server sends back a 503? Or the request times out? The fetch node takes care of all that.

Here is how it works: it grabs the current URL from state, fires off a GET request with a user-agent header that looks like a real browser, and saves the HTML. If the request fails, the node logs the error and adds one to the retry counter. The status field lets later nodes know if the fetch went through.

python
def fetch_page(state: ScraperState) -> dict:
    """Fetch HTML content from the current URL."""
    current_page = state.get("current_page", 1)
    if current_page > 1:
        time.sleep(2)  # Polite delay between requests

    url = state.get("next_page_url") or state["url"]
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        )
    }
    try:
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        return {
            "raw_html": response.text,
            "status": "fetched",
            "retry_count": 0,
        }
    except requests.RequestException as error:
        return {
            "status": "fetch_error",
            "error_log": [f"Fetch failed for {url}: {str(error)}"],
            "retry_count": state.get("retry_count", 0) + 1,
        }

Two design choices are worth a closer look here. First, the user-agent header looks like Chrome. Without it, many sites block script-based requests right away. Second, the retry counter resets to zero on success. We only care about failures in a row, not the total count over time.

The 2-second pause before fetching pages after the first one is just good manners. Hitting a server with fast back-to-back requests is a quick way to get your IP banned.

Warning: Always set a `timeout` on `requests.get()`. Without one, a stalled server freezes your whole pipeline forever. Fifteen seconds works for most pages. Many real-world scrapers use 10.

How Does the Parse Node Work? LLM-Driven Data Pulling

Here is where this pipeline breaks away from the old way of scraping. Instead of fixed CSS selectors that stop working when a site gets a redesign, the LLM reads the page and pulls data based on your goal.

Let me explain what the parse node does step by step. First, it strips away the noise in the HTML — scripts, styles, nav bars. Then it turns what is left into plain text and sends it to the LLM with clear instructions. The LLM sends back a JSON array of records. We cap the text at 12,000 characters to stay within token limits. Most of the real content sits in the first third of a page anyway.

python
def parse_content(state: ScraperState) -> dict:
    """Use LLM to extract structured data from HTML."""
    html = state["raw_html"]
    goal = state["goal"]

    soup = BeautifulSoup(html, "html.parser")

    # Strip noise: scripts, styles, navigation, footer
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()

    clean_text = soup.get_text(separator="\n", strip=True)
    truncated = clean_text[:12000]

    prompt = f"""Extract structured data from this webpage.

Goal: {goal}

Return a JSON array of objects with consistent keys.
If no relevant data exists, return an empty array [].
Return ONLY valid JSON — no markdown, no explanation.

Webpage content:
{truncated}"""

    response = llm.invoke([
        SystemMessage(content="You are a data extraction specialist."),
        HumanMessage(content=prompt),
    ])

    try:
        records = json.loads(response.content)
        if not isinstance(records, list):
            records = [records]
    except json.JSONDecodeError:
        return {
            "extracted_data": [],
            "error_log": ["LLM returned invalid JSON during parsing"],
            "status": "parse_error",
        }

    return {"extracted_data": records, "status": "parsed"}

Why strip <script>, <style>, <nav>, and <footer> tags? A normal webpage weighs 50-100KB in HTML, but the real content might be just 5KB. Scripts, CSS rules, and repeated nav links eat up tokens without adding any value. Removing them cuts costs by 70-80% and makes the LLM’s output much better.

ApproachSelector UpkeepAdapts to Layout ChangesCost per Page
Fixed CSS selectorsManual — breaks on redesignNo$0
LLM-driven pullingZero — LLM adaptsYes~$0.002
Visual scraping toolsGUI re-training neededPartlyVaries

The tradeoff is clear. You pay a tiny amount per page to get rid of the upkeep burden that makes old-school scrapers such a pain.

Tip: Trim HTML hard before you send it to the LLM. Stripping away filler tags makes the model cheaper to run and helps it find the right data, because the signal-to-noise ratio goes way up.

Have you noticed how many scraping guides skip pagination? In real life, the data you want often spans many pages. This node solves that problem.

Instead of hard-coding a URL pattern for page links (which breaks the moment the site changes its URL scheme), we first filter anchor tags for keywords like “next” and “>>” and then let the LLM pick the right link. The max_pages guard stops the agent from looping forever.

python
def check_pagination(state: ScraperState) -> dict:
    """Determine if there are more pages to scrape."""
    current = state.get("current_page", 1)
    max_pages = state.get("max_pages", 5)

    if current >= max_pages:
        return {"status": "all_pages_scraped"}

    soup = BeautifulSoup(state["raw_html"], "html.parser")
    links = []
    for a_tag in soup.find_all("a", href=True):
        link_text = a_tag.get_text(strip=True).lower()
        href = a_tag["href"]
        if any(kw in link_text for kw in [
            "next", ">>", "\u203a", "older"
        ]):
            links.append(f"{link_text}: {href}")

    if not links:
        return {"status": "all_pages_scraped"}

    prompt = f"""Which link leads to the next page of results?

Links found:
{chr(10).join(links)}

Current URL: {state.get('next_page_url') or state['url']}

Return ONLY the full absolute URL. If no next page exists, return NONE.
If the href is relative, combine it with the base URL."""

    response = llm.invoke([HumanMessage(content=prompt)])
    answer = response.content.strip()

    if answer == "NONE" or not answer.startswith("http"):
        return {"status": "all_pages_scraped"}

    return {
        "next_page_url": answer,
        "current_page": current + 1,
        "status": "has_next_page",
    }

Why filter links first? Because a normal page has 50 to 200 anchor tags. Dumping all of them into the prompt wastes tokens and clouds the choice. When you filter for words like “next” and “>>”, you trim the list to maybe 1-3 links. The model then has an easy time picking the right one.

How Does the Analysis Node Turn Raw Data into Insights?

Scraping without a report is just data hoarding. What patterns hide in the data? What does the spread look like? Are there odd values? The analysis node answers these questions.

It sends the first 50 records to the LLM (to stay within token limits) and asks for a report with totals, stats, patterns, and data quality notes. For a real system, I would run the math in Python and only send the results to the LLM for review. LLMs are great at reading numbers but sometimes get the math wrong.

python
def analyze_data(state: ScraperState) -> dict:
    """Generate analytical summary from all scraped data."""
    data = state.get("extracted_data", [])

    if not data:
        return {
            "analysis_report": "No data was extracted.",
            "status": "complete",
        }

    data_str = json.dumps(data[:50], indent=2)

    prompt = f"""Analyze this scraped dataset and produce a report.

Total records collected: {len(data)}
Sample data (first 50 records):
{data_str}

Include these sections:
1. **Summary**: Total records, fields present, data completeness
2. **Key Statistics**: Counts, averages, ranges where applicable
3. **Notable Patterns**: Trends, outliers, interesting findings
4. **Data Quality Notes**: Missing fields, inconsistencies

Be specific. Use actual numbers from the data."""

    response = llm.invoke([
        SystemMessage(content="You are a data analyst. Be concise and specific."),
        HumanMessage(content=prompt),
    ])

    return {
        "analysis_report": response.content,
        "status": "complete",
    }
Key Insight: Use Python for math and the LLM for meaning. Do not ask GPT to average 500 prices. Compute it in Python, then ask GPT what the number tells you. It is faster, cheaper, and free of math errors.

How Do You Wire the Graph? Routing and Conditional Edges

This is the part that makes LangGraph click. We link the nodes with edges, and two conditional edges create the branching logic: one for error recovery after fetch, one for the page loop.

The routing functions are kept simple on purpose. Each one checks a single state field and returns a string that maps to the next node.

python
def route_after_fetch(state: ScraperState) -> str:
    """Decide next step after fetching a page."""
    if state["status"] == "fetched":
        return "parse"
    if state.get("retry_count", 0) < 3:
        return "retry_fetch"
    return "analyze"

def route_after_pagination(state: ScraperState) -> str:
    """Continue scraping or move to analysis."""
    if state["status"] == "has_next_page":
        return "fetch_next"
    return "analyze"

Three paths come out of route_after_fetch. On success, go to parsing. On a fixable failure, loop back to fetch. After too many retries, skip ahead to analysis with whatever data we have. No nested conditions. No tangled logic.

Now here is the full graph setup. Notice how add_conditional_edges takes a routing function plus a mapping dict. Every path is spelled out in the code so you can trace it at a glance.

python
graph = StateGraph(ScraperState)

# Add all nodes
graph.add_node("fetch", fetch_page)
graph.add_node("parse", parse_content)
graph.add_node("check_pagination", check_pagination)
graph.add_node("analyze", analyze_data)

# Entry point
graph.add_edge(START, "fetch")

# Conditional: parse on success, retry on failure, analyze on exhaustion
graph.add_conditional_edges(
    "fetch",
    route_after_fetch,
    {
        "parse": "parse",
        "retry_fetch": "fetch",
        "analyze": "analyze",
    },
)

# After parsing, always check for more pages
graph.add_edge("parse", "check_pagination")

# Conditional: fetch next page or finalize
graph.add_conditional_edges(
    "check_pagination",
    route_after_pagination,
    {
        "fetch_next": "fetch",
        "analyze": "analyze",
    },
)

# Analysis is the terminal node
graph.add_edge("analyze", END)

scraper_agent = graph.compile()

That routing map — the third argument to add_conditional_edges — is what makes the graph easy to read. Anyone looking at this code can follow every path the agent might take without running it. I find this much clearer than deep if-else chains.

How Do You Run the Pipeline? A Full Example

Time to see it in action. We will scrape job listings from Real Python’s fake jobs page. It is a static demo site made for scraping practice. No rate limits, no terms-of-service worries.

The starting state sets the target URL, a goal that says what fields to grab, and a 3-page cap to keep the demo quick.

python
initial_state = {
    "url": "https://realpython.github.io/fake-jobs/",
    "goal": (
        "Extract all job listings. For each job, get: "
        "title, company, location, and posting date."
    ),
    "extracted_data": [],
    "current_page": 1,
    "max_pages": 3,
    "next_page_url": "",
    "analysis_report": "",
    "error_log": [],
    "retry_count": 0,
    "status": "ready",
}

result = scraper_agent.invoke(initial_state)

Once the pipeline is done, check the results. The extracted_data list holds clean dicts, and analysis_report has the LLM’s summary.

python
print(f"Records extracted: {len(result['extracted_data'])}")
print(f"\nFirst 3 records:")
for record in result["extracted_data"][:3]:
    print(json.dumps(record, indent=2))

print(f"\n{'='*50}")
print("ANALYSIS REPORT")
print(f"{'='*50}")
print(result["analysis_report"])

if result["error_log"]:
    print(f"\nErrors: {result['error_log']}")

You will see job data and an overview. The exact records depend on what the LLM pulls out, but the layout looks like this:

python
Records extracted: 100

First 3 records:
{
  "title": "Energy engineer",
  "company": "Vasquez-Davidson",
  "location": "Christopherport, AA",
  "posting_date": "2021-04-08"
}
...
Note: The site `realpython.github.io/fake-jobs/` is a static demo page with 100 fake listings on one page — no real pagination. The pipeline handles this just fine. It finds no “next” links and goes straight to analysis. To test the page loop, point the pipeline at a site that does have pages.

How Do You Add Error Recovery to Make It Production-Ready?

The basic pipeline works for happy paths. But what about a 429 rate-limit reply? A timeout on a slow server? A page that sends back garbled HTML?

A separate error handler node looks at what went wrong and sets a status that the routing function can act on. The handler figures out the problem. The router picks the next step. The fetch node carries it out. Clean split of duties.

python
def handle_error(state: ScraperState) -> dict:
    """Classify errors and set recovery strategy."""
    errors = state.get("error_log", [])
    last_error = errors[-1] if errors else "Unknown error"

    if "429" in last_error or "rate" in last_error.lower():
        return {
            "status": "rate_limited",
            "error_log": ["Rate limited — backing off before retry"],
        }
    if "timeout" in last_error.lower():
        return {
            "status": "timeout_retry",
            "error_log": ["Timeout — retrying with longer wait"],
        }
    return {
        "status": "unrecoverable",
        "error_log": [f"Giving up after error: {last_error}"],
    }

In a real setup, you would also want slower retries for rate limits, proxy switching for IP bans, and a dead-letter queue for errors that keep coming back. Those are full topics on their own. The node setup makes adding them easy because each concern lives in its own node.

Exercise 1: Add a Data Validation Node

You have seen how each node does one job. Now add a validate_data node that checks records for quality before they reach the analysis step.

The node should drop records that are missing required keys and log how many it removed. If 90% of records get dropped, that points to a problem with the prompt — not the data.

python
# Complete this function

def validate_data(state: ScraperState) -> dict:
    """Validate and clean extracted data."""
    data = state.get("extracted_data", [])
    required_keys = {"title", "company", "location"}

    # TODO: Filter records that contain all required keys
    # TODO: Count how many records were removed
    # TODO: Return cleaned data with appropriate status

    valid_records = []  # Your filtering logic here
    removed_count = 0   # Your count here

    return {
        "extracted_data": valid_records,
        "error_log": [
            f"Validation: kept {len(valid_records)}, "
            f"removed {removed_count} incomplete records"
        ],
        "status": "validated" if valid_records else "no_valid_data",
    }
Hint 1

Use `all()` inside a list comprehension to check if every required key exists: `all(key in record for key in required_keys)`.

Hint 2

Full filtering: `valid_records = [r for r in data if all(k in r for k in required_keys)]`. Then `removed_count = len(data) – len(valid_records)`.

Solution
python
def validate_data(state: ScraperState) -> dict:
    """Validate and clean extracted data."""
    data = state.get("extracted_data", [])
    required_keys = {"title", "company", "location"}

    valid_records = [
        record for record in data
        if all(key in record for key in required_keys)
    ]
    removed_count = len(data) - len(valid_records)

    return {
        "extracted_data": valid_records,
        "error_log": [
            f"Validation: kept {len(valid_records)}, "
            f"removed {removed_count} incomplete records"
        ],
        "status": "validated" if valid_records else "no_valid_data",
    }

**Why this works:** `all()` returns `True` only when every required key exists in the record. Records with missing fields get dropped. The removed count helps you spot issues. If a lot of records get cut, the prompt likely needs tuning.

How Do You Adapt the Pipeline for Different Goals?

The same pipeline scrapes any kind of data. You do not change the code — you change the goal string. The LLM adapts how it pulls data at runtime.

Want product listings instead of jobs?

python
product_state = {
    "url": "https://example-store.com/electronics",
    "goal": (
        "Extract product listings: name, price in USD, "
        "star rating as a float, and availability status"
    ),
    "extracted_data": [],
    "current_page": 1,
    "max_pages": 5,
    "next_page_url": "",
    "analysis_report": "",
    "error_log": [],
    "retry_count": 0,
    "status": "ready",
}

Research paper details? Same pipeline, different goal:

python
research_state = {
    "url": "https://arxiv.org/list/cs.AI/recent",
    "goal": (
        "Extract paper listings: title, authors, "
        "abstract summary, and submission date"
    ),
    "extracted_data": [],
    "current_page": 1,
    "max_pages": 2,
    "next_page_url": "",
    "analysis_report": "",
    "error_log": [],
    "retry_count": 0,
    "status": "ready",
}
Tip: Be very clear in your goal. Vague goals like “get all data” produce messy, uneven JSON. Goals like “extract product name, price in USD, and star rating as a float” give the LLM sharp targets and produce cleaner output.

Exercise 2: Add Rate Limiting You Can Tune

Web servers do not like rapid-fire requests. Change the fetch logic to use a delay you can set from state instead of the fixed 2 seconds.

python
# Add a 'fetch_delay' field to the state and use it

def fetch_page_configurable(state: ScraperState) -> dict:
    """Fetch with configurable delay between pages."""
    current_page = state.get("current_page", 1)
    delay = state.get("fetch_delay", 2)  # Default 2 seconds

    # TODO: Apply delay for non-first pages
    # TODO: Fetch the URL with error handling

    url = state.get("next_page_url") or state["url"]
    pass  # Complete the implementation
Hint 1

Check `current_page > 1` before sleeping. The first page does not need a delay.

Hint 2

Add `if current_page > 1: time.sleep(delay)` before the request. The rest follows the same pattern as `fetch_page`.

Solution
python
def fetch_page_configurable(state: ScraperState) -> dict:
    """Fetch with configurable delay between pages."""
    current_page = state.get("current_page", 1)
    delay = state.get("fetch_delay", 2)

    if current_page > 1:
        time.sleep(delay)

    url = state.get("next_page_url") or state["url"]
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36"
        )
    }
    try:
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        return {
            "raw_html": response.text,
            "status": "fetched",
            "retry_count": 0,
        }
    except requests.RequestException as error:
        return {
            "status": "fetch_error",
            "error_log": [f"Fetch failed: {str(error)}"],
            "retry_count": state.get("retry_count", 0) + 1,
        }

**Why this matters:** Some sites are fine with 1-second gaps. Others need 5. Making the delay a state value lets you tune it per target without changing any code.

What Are the Most Common Mistakes?

Mistake 1: No Page Limit on Pagination

Wrong:

python
initial_state = {
    "max_pages": 999,  # Or omitting it entirely
}

Why this is risky: Some sites have thousands of pages. Your pipeline runs for hours, burns API credits, and might get your IP banned.

Correct:

python
initial_state = {
    "max_pages": 10,  # Start small, increase if needed
}

Mistake 2: Sending Raw HTML to the LLM

Wrong:

python
prompt = f"Extract data from: {state['raw_html']}"

Why it fails: Raw HTML is 80% junk. A product page might hold 150KB of HTML but only 2KB of real content. You waste tokens and confuse the model.

Correct:

python
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "nav", "footer"]):
    tag.decompose()
clean_text = soup.get_text(separator="\n", strip=True)[:12000]

Mistake 3: No Status Checks in Routing

Wrong:

python
def route_after_fetch(state):
    return "parse"  # Always parse, even on failure

Why it breaks: If the fetch failed, raw_html is either empty or left over from the last page. The parse node crashes or makes copies.

Correct:

python
def route_after_fetch(state):
    if state["status"] == "fetched":
        return "parse"
    if state.get("retry_count", 0) < 3:
        return "retry_fetch"
    return "analyze"  # Graceful fallback

When Should You NOT Use This Approach?

This pipeline is not the best tool for every job. Here is when you should reach for something else:

Use an API instead if the site has one. APIs give you clean JSON — no parsing, no LLM costs. Always look for dev docs or /api/ endpoints first.

Use fixed selectors for high-volume scraping. At 100K pages with 12K tokens each, LLM costs add up to about $180. CSS selectors cost nothing for pulling data. If the site layout is stable, selectors are the smart choice.

Use a plain scraper for real-time tracking. The LLM adds 1-3 seconds of delay per page. If you need sub-second scraping for price tracking, you need hard-coded logic.

This pipeline shines when site layouts change often, you are scraping many sites with different layouts, or you are building a quick prototype that needs to work across many domains without custom selectors for each one.

Complete Code

Click to expand the full script (copy-paste and run)
python
# Complete code from: Autonomous Web Scraping Pipeline with LangGraph
# Requires: pip install langgraph langchain-openai langchain-core requests beautifulsoup4
# Python 3.10+
# Set OPENAI_API_KEY environment variable before running

import os
import json
import time
import requests
from bs4 import BeautifulSoup
from typing import Annotated
from typing_extensions import TypedDict

from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

# --- Setup ---
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# --- State ---
def merge_lists(existing: list, new: list) -> list:
    return existing + new

class ScraperState(TypedDict):
    url: str
    goal: str
    raw_html: str
    extracted_data: Annotated[list[dict], merge_lists]
    current_page: int
    max_pages: int
    next_page_url: str
    analysis_report: str
    error_log: Annotated[list[str], merge_lists]
    retry_count: int
    status: str

# --- Nodes ---
def fetch_page(state: ScraperState) -> dict:
    current_page = state.get("current_page", 1)
    if current_page > 1:
        time.sleep(2)

    url = state.get("next_page_url") or state["url"]
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        )
    }
    try:
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        return {
            "raw_html": response.text,
            "status": "fetched",
            "retry_count": 0,
        }
    except requests.RequestException as error:
        return {
            "status": "fetch_error",
            "error_log": [f"Fetch failed for {url}: {str(error)}"],
            "retry_count": state.get("retry_count", 0) + 1,
        }

def parse_content(state: ScraperState) -> dict:
    html = state["raw_html"]
    goal = state["goal"]
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    clean_text = soup.get_text(separator="\n", strip=True)
    truncated = clean_text[:12000]

    prompt = f"""Extract structured data from this webpage.

Goal: {goal}

Return a JSON array of objects with consistent keys.
If no relevant data exists, return an empty array [].
Return ONLY valid JSON — no markdown, no explanation.

Webpage content:
{truncated}"""

    response = llm.invoke([
        SystemMessage(content="You are a data extraction specialist."),
        HumanMessage(content=prompt),
    ])

    try:
        records = json.loads(response.content)
        if not isinstance(records, list):
            records = [records]
    except json.JSONDecodeError:
        return {
            "extracted_data": [],
            "error_log": ["LLM returned invalid JSON during parsing"],
            "status": "parse_error",
        }

    return {"extracted_data": records, "status": "parsed"}

def check_pagination(state: ScraperState) -> dict:
    current = state.get("current_page", 1)
    max_pages = state.get("max_pages", 5)
    if current >= max_pages:
        return {"status": "all_pages_scraped"}

    soup = BeautifulSoup(state["raw_html"], "html.parser")
    links = []
    for a_tag in soup.find_all("a", href=True):
        link_text = a_tag.get_text(strip=True).lower()
        href = a_tag["href"]
        if any(kw in link_text for kw in ["next", ">>", "\u203a", "older"]):
            links.append(f"{link_text}: {href}")

    if not links:
        return {"status": "all_pages_scraped"}

    prompt = f"""Which link leads to the next page of results?

Links found:
{chr(10).join(links)}

Current URL: {state.get('next_page_url') or state['url']}

Return ONLY the full absolute URL. If no next page exists, return NONE.
If the href is relative, combine it with the base URL."""

    response = llm.invoke([HumanMessage(content=prompt)])
    answer = response.content.strip()

    if answer == "NONE" or not answer.startswith("http"):
        return {"status": "all_pages_scraped"}

    return {
        "next_page_url": answer,
        "current_page": current + 1,
        "status": "has_next_page",
    }

def analyze_data(state: ScraperState) -> dict:
    data = state.get("extracted_data", [])
    if not data:
        return {"analysis_report": "No data was extracted.", "status": "complete"}

    data_str = json.dumps(data[:50], indent=2)
    prompt = f"""Analyze this scraped dataset and produce a report.

Total records collected: {len(data)}
Sample data (first 50 records):
{data_str}

Include:
1. Summary: total records, fields present, data completeness
2. Key Statistics: counts, averages, ranges where applicable
3. Notable Patterns: trends, outliers, interesting findings
4. Data Quality Notes: missing fields, inconsistencies

Be specific. Use actual numbers from the data."""

    response = llm.invoke([
        SystemMessage(content="You are a data analyst. Be concise and specific."),
        HumanMessage(content=prompt),
    ])

    return {"analysis_report": response.content, "status": "complete"}

# --- Routing ---
def route_after_fetch(state: ScraperState) -> str:
    if state["status"] == "fetched":
        return "parse"
    if state.get("retry_count", 0) < 3:
        return "retry_fetch"
    return "analyze"

def route_after_pagination(state: ScraperState) -> str:
    if state["status"] == "has_next_page":
        return "fetch_next"
    return "analyze"

# --- Graph Assembly ---
graph = StateGraph(ScraperState)
graph.add_node("fetch", fetch_page)
graph.add_node("parse", parse_content)
graph.add_node("check_pagination", check_pagination)
graph.add_node("analyze", analyze_data)

graph.add_edge(START, "fetch")
graph.add_conditional_edges(
    "fetch",
    route_after_fetch,
    {"parse": "parse", "retry_fetch": "fetch", "analyze": "analyze"},
)
graph.add_edge("parse", "check_pagination")
graph.add_conditional_edges(
    "check_pagination",
    route_after_pagination,
    {"fetch_next": "fetch", "analyze": "analyze"},
)
graph.add_edge("analyze", END)

scraper_agent = graph.compile()

# --- Run ---
if __name__ == "__main__":
    result = scraper_agent.invoke({
        "url": "https://realpython.github.io/fake-jobs/",
        "goal": (
            "Extract all job listings. For each job, get: "
            "title, company, location, and posting date."
        ),
        "extracted_data": [],
        "current_page": 1,
        "max_pages": 3,
        "next_page_url": "",
        "analysis_report": "",
        "error_log": [],
        "retry_count": 0,
        "status": "ready",
    })

    print(f"Records extracted: {len(result['extracted_data'])}")
    for record in result["extracted_data"][:3]:
        print(json.dumps(record, indent=2))
    print(f"\n{'='*50}")
    print("ANALYSIS REPORT")
    print(f"{'='*50}")
    print(result["analysis_report"])
    if result["error_log"]:
        print(f"\nErrors encountered: {result['error_log']}")

Summary

You built a web scraping pipeline with LangGraph that fetches pages, pulls out clean data with an LLM, follows page links, bounces back from errors, and writes a report.

Four design choices make it work:

  • State reducers (merge_lists) stack up data across page loops without wiping out old data
  • Conditional edges create the page loop and error recovery branches
  • LLM-driven pulling adapts to any page layout without fixed selectors
  • Single-job nodes — each does one thing, and routing functions decide the flow

The setup grows in a natural way. Need to check data quality? Add a node. Need CSV export? Add a node. Need to remove copies? Add a node. Each one plugs into the graph at the right spot without touching the code that already works.

Practice Exercise

Extend the pipeline with an export_node that runs after analysis and writes extracted_data to a CSV file using Python’s csv.DictWriter.

Solution
python
import csv

def export_to_csv(state: ScraperState) -> dict:
    """Export extracted data to a CSV file."""
    data = state.get("extracted_data", [])
    if not data:
        return {"status": "no_data_to_export"}

    headers = list(data[0].keys())
    filename = "scraped_data.csv"

    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=headers)
        writer.writeheader()
        writer.writerows(data)

    return {
        "status": "exported",
        "error_log": [f"Exported {len(data)} records to {filename}"],
    }

Wire it into the graph by swapping the `analyze -> END` edge:

python
graph.add_node("export", export_to_csv)
graph.add_edge("analyze", "export")
graph.add_edge("export", END)

Frequently Asked Questions

Can this pipeline handle pages rendered by JavaScript?

No — requests grabs raw HTML only. For JS-heavy sites (React, Vue, Angular), swap requests.get() for Selenium or Playwright in the fetch node. The rest of the pipeline stays the same because each node is on its own.

python
# Swap this into the fetch node for JS-rendered pages
from playwright.sync_api import sync_playwright

def fetch_with_playwright(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        html = page.content()
        browser.close()
    return html

How much does it cost to run this pipeline?

With gpt-4o-mini, each page costs about $0.001-$0.003 in API fees for parsing plus the page-link check. Scraping 100 pages runs about $0.15-$0.30 total. The analysis step adds $0.005-$0.01. If costs matter a lot, swap ChatOpenAI for ChatOllama and run a local model like Llama 3.

It depends on where you are and which site you scrape. In the US, the hiQ v. LinkedIn ruling said that scraping data that is open to the public does not break the CFAA. That said — always check robots.txt and terms of service. Respect rate limits. Do not scrape personal data without consent.

How do I scrape sites that need a login?

Add a login_node that signs in first and stores session cookies in state. Later fetch requests send those cookies along. You can also pass auth headers straight into the fetch node’s requests.get() call.

References

  1. LangGraph documentation — StateGraph, conditional edges, and state management. Link
  2. LangChain documentation — ChatOpenAI model integration. Link
  3. BeautifulSoup documentation — Parsing HTML and navigating the tree. Link
  4. Python requests library documentation. Link
  5. Cohorte Projects — How to Build a Smart Web-Scraping AI Agent with LangGraph and Selenium. Link
  6. Firecrawl — Building a Documentation Agent with LangGraph and Firecrawl. Link
  7. Real Python — LangGraph: Build Stateful AI Agents in Python. Link
  8. hiQ Labs, Inc. v. LinkedIn Corp., 938 F.3d 985 (9th Cir. 2019) — Legal precedent for public data scraping.

Reviewed: March 2026 | LangGraph version: 0.4+ | Python: 3.10+

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science