machine learning +
LLM Temperature, Top-P, and Top-K Explained — With Python Simulations
LangGraph Web Scraping Agent: Autonomous Pipeline
Learn to build a LangGraph pipeline that scrapes websites on its own, pulls structured data, handles pagination and errors, and writes analytical reports.
This project shows you how to build a LangGraph agent that scrapes websites, pulls out clean data, deals with errors and page links, and writes a summary report — all on its own.
Picture this: you need data from a website. Not just once — on an ongoing basis. So you write a scraper. It runs fine for a week. Then the site’s layout shifts and the whole thing falls apart. You fix CSS selectors, bolt on error handling, and tack on page-link logic. Pretty soon your code is a tangled mess held together by try-except blocks.
What if an AI agent could handle all of that? One that decides what to scrape, bounces back from errors, and sums up what it found?
That is what we are building here. A LangGraph pipeline where an LLM runs the entire scraping workflow by itself.
Let me walk you through how the parts fit together before we write any code. The pipeline starts with a URL and a goal. The goal might be something like “pull all product listings from this page.” The first step grabs the raw HTML. If the page fails to load, the agent tries again or shifts its plan.
Once the HTML arrives, a parsing step pulls out the data you want — product names, prices, ratings — based on what the LLM finds in the page. Next, the agent asks: are there more pages to go? If so, it circles back and grabs the next one.
After all pages are done, the data moves to a review step. The LLM spots trends, runs some stats, and writes a final report. Each step feeds into the next through LangGraph’s state, and the whole thing runs as a single graph call.
How Is the Pipeline Set Up? Five Nodes, One Loop
The pipeline has five nodes joined by conditional edges. Each node does one job:
| Node | Purpose | Input | Output |
|---|---|---|---|
| Fetch | Download HTML, handle HTTP errors | URL from state | Raw HTML or error status |
| Parse | Pull out data via the LLM | Raw HTML + goal | List of JSON records |
| Pagination Check | Find “next page” links | Raw HTML | Next URL or “done” signal |
| Accumulate | Merge new data with what we have | New + old records | Full dataset |
| Analyze | Write a stats summary | All records | Report string |
The conditional edge after the pagination check makes the loop. If more pages exist, flow goes back to Fetch. If not, it moves to Analyze. This is the core pattern — and it is quite simple to set up.
Prerequisites
- Python version: 3.10+
- Required libraries: langgraph (0.4+), langchain-openai (0.3+), langchain-core (0.3+), requests (2.31+), beautifulsoup4 (4.12+)
- Install:
pip install langgraph langchain-openai langchain-core requests beautifulsoup4 - API Key: OpenAI API key (set as
OPENAI_API_KEYenvironment variable). Create one at platform.openai.com/api-keys. - What you should know: LangGraph basics — nodes, edges, and state. Some comfort with Python’s
requestslibrary. - How long it takes: 35-40 minutes
python
import os
import json
import time
import requests
from bs4 import BeautifulSoup
from typing import Annotated
from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
We split the imports into three clusters: standard library, third-party scraping tools, and LangGraph/LangChain bits. We go with gpt-4o-mini here — it is fast, cheap, and handles HTML work just fine.
How Do You Define the Pipeline State?
Every LangGraph graph needs a state schema. Think of it as shared memory that nodes read from and write to.
Our state holds the scraping URL, the goal, raw HTML, a growing list of data records, page counters, and the final report. Below is the full schema. Pay close attention to the Annotated types — they hold a key detail.
python
def merge_lists(existing: list, new: list) -> list:
"""Reducer that appends new items to existing list."""
return existing + new
class ScraperState(TypedDict):
url: str
goal: str
raw_html: str
extracted_data: Annotated[list[dict], merge_lists]
current_page: int
max_pages: int
next_page_url: str
analysis_report: str
error_log: Annotated[list[str], merge_lists]
retry_count: int
status: str
Notice the Annotated wrapper on extracted_data and error_log? That merge_lists reducer tells LangGraph what to do when a node writes new data. Without it, returning {"extracted_data": new_records} would wipe out the old list. With the reducer, new items land at the end of the list instead. This is vital for pagination — skip it and every new page erases the data from the page before it.
Key Insight: Reducers decide how LangGraph merges state updates. Choose well and your pipeline keeps all data across pages. Choose wrong and it quietly throws away everything but the last page.
How Do You Build the Fetch Node?
What happens when you ask for a webpage and the server sends back a 503? Or the request times out? The fetch node takes care of all that.
Here is how it works: it grabs the current URL from state, fires off a GET request with a user-agent header that looks like a real browser, and saves the HTML. If the request fails, the node logs the error and adds one to the retry counter. The status field lets later nodes know if the fetch went through.
python
def fetch_page(state: ScraperState) -> dict:
"""Fetch HTML content from the current URL."""
current_page = state.get("current_page", 1)
if current_page > 1:
time.sleep(2) # Polite delay between requests
url = state.get("next_page_url") or state["url"]
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
}
try:
response = requests.get(url, headers=headers, timeout=15)
response.raise_for_status()
return {
"raw_html": response.text,
"status": "fetched",
"retry_count": 0,
}
except requests.RequestException as error:
return {
"status": "fetch_error",
"error_log": [f"Fetch failed for {url}: {str(error)}"],
"retry_count": state.get("retry_count", 0) + 1,
}
Two design choices are worth a closer look here. First, the user-agent header looks like Chrome. Without it, many sites block script-based requests right away. Second, the retry counter resets to zero on success. We only care about failures in a row, not the total count over time.
The 2-second pause before fetching pages after the first one is just good manners. Hitting a server with fast back-to-back requests is a quick way to get your IP banned.
Warning: Always set a `timeout` on `requests.get()`. Without one, a stalled server freezes your whole pipeline forever. Fifteen seconds works for most pages. Many real-world scrapers use 10.
How Does the Parse Node Work? LLM-Driven Data Pulling
Here is where this pipeline breaks away from the old way of scraping. Instead of fixed CSS selectors that stop working when a site gets a redesign, the LLM reads the page and pulls data based on your goal.
Let me explain what the parse node does step by step. First, it strips away the noise in the HTML — scripts, styles, nav bars. Then it turns what is left into plain text and sends it to the LLM with clear instructions. The LLM sends back a JSON array of records. We cap the text at 12,000 characters to stay within token limits. Most of the real content sits in the first third of a page anyway.
python
def parse_content(state: ScraperState) -> dict:
"""Use LLM to extract structured data from HTML."""
html = state["raw_html"]
goal = state["goal"]
soup = BeautifulSoup(html, "html.parser")
# Strip noise: scripts, styles, navigation, footer
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
clean_text = soup.get_text(separator="\n", strip=True)
truncated = clean_text[:12000]
prompt = f"""Extract structured data from this webpage.
Goal: {goal}
Return a JSON array of objects with consistent keys.
If no relevant data exists, return an empty array [].
Return ONLY valid JSON — no markdown, no explanation.
Webpage content:
{truncated}"""
response = llm.invoke([
SystemMessage(content="You are a data extraction specialist."),
HumanMessage(content=prompt),
])
try:
records = json.loads(response.content)
if not isinstance(records, list):
records = [records]
except json.JSONDecodeError:
return {
"extracted_data": [],
"error_log": ["LLM returned invalid JSON during parsing"],
"status": "parse_error",
}
return {"extracted_data": records, "status": "parsed"}
Why strip <script>, <style>, <nav>, and <footer> tags? A normal webpage weighs 50-100KB in HTML, but the real content might be just 5KB. Scripts, CSS rules, and repeated nav links eat up tokens without adding any value. Removing them cuts costs by 70-80% and makes the LLM’s output much better.
| Approach | Selector Upkeep | Adapts to Layout Changes | Cost per Page |
|---|---|---|---|
| Fixed CSS selectors | Manual — breaks on redesign | No | $0 |
| LLM-driven pulling | Zero — LLM adapts | Yes | ~$0.002 |
| Visual scraping tools | GUI re-training needed | Partly | Varies |
The tradeoff is clear. You pay a tiny amount per page to get rid of the upkeep burden that makes old-school scrapers such a pain.
Tip: Trim HTML hard before you send it to the LLM. Stripping away filler tags makes the model cheaper to run and helps it find the right data, because the signal-to-noise ratio goes way up.
How Does the Pagination Node Follow “Next Page” Links?
Have you noticed how many scraping guides skip pagination? In real life, the data you want often spans many pages. This node solves that problem.
Instead of hard-coding a URL pattern for page links (which breaks the moment the site changes its URL scheme), we first filter anchor tags for keywords like “next” and “>>” and then let the LLM pick the right link. The max_pages guard stops the agent from looping forever.
python
def check_pagination(state: ScraperState) -> dict:
"""Determine if there are more pages to scrape."""
current = state.get("current_page", 1)
max_pages = state.get("max_pages", 5)
if current >= max_pages:
return {"status": "all_pages_scraped"}
soup = BeautifulSoup(state["raw_html"], "html.parser")
links = []
for a_tag in soup.find_all("a", href=True):
link_text = a_tag.get_text(strip=True).lower()
href = a_tag["href"]
if any(kw in link_text for kw in [
"next", ">>", "\u203a", "older"
]):
links.append(f"{link_text}: {href}")
if not links:
return {"status": "all_pages_scraped"}
prompt = f"""Which link leads to the next page of results?
Links found:
{chr(10).join(links)}
Current URL: {state.get('next_page_url') or state['url']}
Return ONLY the full absolute URL. If no next page exists, return NONE.
If the href is relative, combine it with the base URL."""
response = llm.invoke([HumanMessage(content=prompt)])
answer = response.content.strip()
if answer == "NONE" or not answer.startswith("http"):
return {"status": "all_pages_scraped"}
return {
"next_page_url": answer,
"current_page": current + 1,
"status": "has_next_page",
}
Why filter links first? Because a normal page has 50 to 200 anchor tags. Dumping all of them into the prompt wastes tokens and clouds the choice. When you filter for words like “next” and “>>”, you trim the list to maybe 1-3 links. The model then has an easy time picking the right one.
How Does the Analysis Node Turn Raw Data into Insights?
Scraping without a report is just data hoarding. What patterns hide in the data? What does the spread look like? Are there odd values? The analysis node answers these questions.
It sends the first 50 records to the LLM (to stay within token limits) and asks for a report with totals, stats, patterns, and data quality notes. For a real system, I would run the math in Python and only send the results to the LLM for review. LLMs are great at reading numbers but sometimes get the math wrong.
python
def analyze_data(state: ScraperState) -> dict:
"""Generate analytical summary from all scraped data."""
data = state.get("extracted_data", [])
if not data:
return {
"analysis_report": "No data was extracted.",
"status": "complete",
}
data_str = json.dumps(data[:50], indent=2)
prompt = f"""Analyze this scraped dataset and produce a report.
Total records collected: {len(data)}
Sample data (first 50 records):
{data_str}
Include these sections:
1. **Summary**: Total records, fields present, data completeness
2. **Key Statistics**: Counts, averages, ranges where applicable
3. **Notable Patterns**: Trends, outliers, interesting findings
4. **Data Quality Notes**: Missing fields, inconsistencies
Be specific. Use actual numbers from the data."""
response = llm.invoke([
SystemMessage(content="You are a data analyst. Be concise and specific."),
HumanMessage(content=prompt),
])
return {
"analysis_report": response.content,
"status": "complete",
}
Key Insight: Use Python for math and the LLM for meaning. Do not ask GPT to average 500 prices. Compute it in Python, then ask GPT what the number tells you. It is faster, cheaper, and free of math errors.
How Do You Wire the Graph? Routing and Conditional Edges
This is the part that makes LangGraph click. We link the nodes with edges, and two conditional edges create the branching logic: one for error recovery after fetch, one for the page loop.
The routing functions are kept simple on purpose. Each one checks a single state field and returns a string that maps to the next node.
python
def route_after_fetch(state: ScraperState) -> str:
"""Decide next step after fetching a page."""
if state["status"] == "fetched":
return "parse"
if state.get("retry_count", 0) < 3:
return "retry_fetch"
return "analyze"
def route_after_pagination(state: ScraperState) -> str:
"""Continue scraping or move to analysis."""
if state["status"] == "has_next_page":
return "fetch_next"
return "analyze"
Three paths come out of route_after_fetch. On success, go to parsing. On a fixable failure, loop back to fetch. After too many retries, skip ahead to analysis with whatever data we have. No nested conditions. No tangled logic.
Now here is the full graph setup. Notice how add_conditional_edges takes a routing function plus a mapping dict. Every path is spelled out in the code so you can trace it at a glance.
python
graph = StateGraph(ScraperState)
# Add all nodes
graph.add_node("fetch", fetch_page)
graph.add_node("parse", parse_content)
graph.add_node("check_pagination", check_pagination)
graph.add_node("analyze", analyze_data)
# Entry point
graph.add_edge(START, "fetch")
# Conditional: parse on success, retry on failure, analyze on exhaustion
graph.add_conditional_edges(
"fetch",
route_after_fetch,
{
"parse": "parse",
"retry_fetch": "fetch",
"analyze": "analyze",
},
)
# After parsing, always check for more pages
graph.add_edge("parse", "check_pagination")
# Conditional: fetch next page or finalize
graph.add_conditional_edges(
"check_pagination",
route_after_pagination,
{
"fetch_next": "fetch",
"analyze": "analyze",
},
)
# Analysis is the terminal node
graph.add_edge("analyze", END)
scraper_agent = graph.compile()
That routing map — the third argument to add_conditional_edges — is what makes the graph easy to read. Anyone looking at this code can follow every path the agent might take without running it. I find this much clearer than deep if-else chains.
How Do You Run the Pipeline? A Full Example
Time to see it in action. We will scrape job listings from Real Python’s fake jobs page. It is a static demo site made for scraping practice. No rate limits, no terms-of-service worries.
The starting state sets the target URL, a goal that says what fields to grab, and a 3-page cap to keep the demo quick.
python
initial_state = {
"url": "https://realpython.github.io/fake-jobs/",
"goal": (
"Extract all job listings. For each job, get: "
"title, company, location, and posting date."
),
"extracted_data": [],
"current_page": 1,
"max_pages": 3,
"next_page_url": "",
"analysis_report": "",
"error_log": [],
"retry_count": 0,
"status": "ready",
}
result = scraper_agent.invoke(initial_state)
Once the pipeline is done, check the results. The extracted_data list holds clean dicts, and analysis_report has the LLM’s summary.
python
print(f"Records extracted: {len(result['extracted_data'])}")
print(f"\nFirst 3 records:")
for record in result["extracted_data"][:3]:
print(json.dumps(record, indent=2))
print(f"\n{'='*50}")
print("ANALYSIS REPORT")
print(f"{'='*50}")
print(result["analysis_report"])
if result["error_log"]:
print(f"\nErrors: {result['error_log']}")
You will see job data and an overview. The exact records depend on what the LLM pulls out, but the layout looks like this:
python
Records extracted: 100
First 3 records:
{
"title": "Energy engineer",
"company": "Vasquez-Davidson",
"location": "Christopherport, AA",
"posting_date": "2021-04-08"
}
...
Note: The site `realpython.github.io/fake-jobs/` is a static demo page with 100 fake listings on one page — no real pagination. The pipeline handles this just fine. It finds no “next” links and goes straight to analysis. To test the page loop, point the pipeline at a site that does have pages.
How Do You Add Error Recovery to Make It Production-Ready?
The basic pipeline works for happy paths. But what about a 429 rate-limit reply? A timeout on a slow server? A page that sends back garbled HTML?
A separate error handler node looks at what went wrong and sets a status that the routing function can act on. The handler figures out the problem. The router picks the next step. The fetch node carries it out. Clean split of duties.
python
def handle_error(state: ScraperState) -> dict:
"""Classify errors and set recovery strategy."""
errors = state.get("error_log", [])
last_error = errors[-1] if errors else "Unknown error"
if "429" in last_error or "rate" in last_error.lower():
return {
"status": "rate_limited",
"error_log": ["Rate limited — backing off before retry"],
}
if "timeout" in last_error.lower():
return {
"status": "timeout_retry",
"error_log": ["Timeout — retrying with longer wait"],
}
return {
"status": "unrecoverable",
"error_log": [f"Giving up after error: {last_error}"],
}
In a real setup, you would also want slower retries for rate limits, proxy switching for IP bans, and a dead-letter queue for errors that keep coming back. Those are full topics on their own. The node setup makes adding them easy because each concern lives in its own node.
Exercise 1: Add a Data Validation Node
You have seen how each node does one job. Now add a validate_data node that checks records for quality before they reach the analysis step.
The node should drop records that are missing required keys and log how many it removed. If 90% of records get dropped, that points to a problem with the prompt — not the data.
python
# Complete this function
def validate_data(state: ScraperState) -> dict:
"""Validate and clean extracted data."""
data = state.get("extracted_data", [])
required_keys = {"title", "company", "location"}
# TODO: Filter records that contain all required keys
# TODO: Count how many records were removed
# TODO: Return cleaned data with appropriate status
valid_records = [] # Your filtering logic here
removed_count = 0 # Your count here
return {
"extracted_data": valid_records,
"error_log": [
f"Validation: kept {len(valid_records)}, "
f"removed {removed_count} incomplete records"
],
"status": "validated" if valid_records else "no_valid_data",
}
How Do You Adapt the Pipeline for Different Goals?
The same pipeline scrapes any kind of data. You do not change the code — you change the goal string. The LLM adapts how it pulls data at runtime.
Want product listings instead of jobs?
python
product_state = {
"url": "https://example-store.com/electronics",
"goal": (
"Extract product listings: name, price in USD, "
"star rating as a float, and availability status"
),
"extracted_data": [],
"current_page": 1,
"max_pages": 5,
"next_page_url": "",
"analysis_report": "",
"error_log": [],
"retry_count": 0,
"status": "ready",
}
Research paper details? Same pipeline, different goal:
python
research_state = {
"url": "https://arxiv.org/list/cs.AI/recent",
"goal": (
"Extract paper listings: title, authors, "
"abstract summary, and submission date"
),
"extracted_data": [],
"current_page": 1,
"max_pages": 2,
"next_page_url": "",
"analysis_report": "",
"error_log": [],
"retry_count": 0,
"status": "ready",
}
Tip: Be very clear in your goal. Vague goals like “get all data” produce messy, uneven JSON. Goals like “extract product name, price in USD, and star rating as a float” give the LLM sharp targets and produce cleaner output.
Exercise 2: Add Rate Limiting You Can Tune
Web servers do not like rapid-fire requests. Change the fetch logic to use a delay you can set from state instead of the fixed 2 seconds.
python
# Add a 'fetch_delay' field to the state and use it
def fetch_page_configurable(state: ScraperState) -> dict:
"""Fetch with configurable delay between pages."""
current_page = state.get("current_page", 1)
delay = state.get("fetch_delay", 2) # Default 2 seconds
# TODO: Apply delay for non-first pages
# TODO: Fetch the URL with error handling
url = state.get("next_page_url") or state["url"]
pass # Complete the implementation
What Are the Most Common Mistakes?
Mistake 1: No Page Limit on Pagination
❌ Wrong:
python
initial_state = {
"max_pages": 999, # Or omitting it entirely
}
Why this is risky: Some sites have thousands of pages. Your pipeline runs for hours, burns API credits, and might get your IP banned.
✅ Correct:
python
initial_state = {
"max_pages": 10, # Start small, increase if needed
}
Mistake 2: Sending Raw HTML to the LLM
❌ Wrong:
python
prompt = f"Extract data from: {state['raw_html']}"
Why it fails: Raw HTML is 80% junk. A product page might hold 150KB of HTML but only 2KB of real content. You waste tokens and confuse the model.
✅ Correct:
python
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
clean_text = soup.get_text(separator="\n", strip=True)[:12000]
Mistake 3: No Status Checks in Routing
❌ Wrong:
python
def route_after_fetch(state):
return "parse" # Always parse, even on failure
Why it breaks: If the fetch failed, raw_html is either empty or left over from the last page. The parse node crashes or makes copies.
✅ Correct:
python
def route_after_fetch(state):
if state["status"] == "fetched":
return "parse"
if state.get("retry_count", 0) < 3:
return "retry_fetch"
return "analyze" # Graceful fallback
When Should You NOT Use This Approach?
This pipeline is not the best tool for every job. Here is when you should reach for something else:
Use an API instead if the site has one. APIs give you clean JSON — no parsing, no LLM costs. Always look for dev docs or /api/ endpoints first.
Use fixed selectors for high-volume scraping. At 100K pages with 12K tokens each, LLM costs add up to about $180. CSS selectors cost nothing for pulling data. If the site layout is stable, selectors are the smart choice.
Use a plain scraper for real-time tracking. The LLM adds 1-3 seconds of delay per page. If you need sub-second scraping for price tracking, you need hard-coded logic.
This pipeline shines when site layouts change often, you are scraping many sites with different layouts, or you are building a quick prototype that needs to work across many domains without custom selectors for each one.
Complete Code
Summary
You built a web scraping pipeline with LangGraph that fetches pages, pulls out clean data with an LLM, follows page links, bounces back from errors, and writes a report.
Four design choices make it work:
- State reducers (
merge_lists) stack up data across page loops without wiping out old data - Conditional edges create the page loop and error recovery branches
- LLM-driven pulling adapts to any page layout without fixed selectors
- Single-job nodes — each does one thing, and routing functions decide the flow
The setup grows in a natural way. Need to check data quality? Add a node. Need CSV export? Add a node. Need to remove copies? Add a node. Each one plugs into the graph at the right spot without touching the code that already works.
Practice Exercise
Extend the pipeline with an export_node that runs after analysis and writes extracted_data to a CSV file using Python’s csv.DictWriter.
Frequently Asked Questions
Can this pipeline handle pages rendered by JavaScript?
No — requests grabs raw HTML only. For JS-heavy sites (React, Vue, Angular), swap requests.get() for Selenium or Playwright in the fetch node. The rest of the pipeline stays the same because each node is on its own.
python
# Swap this into the fetch node for JS-rendered pages
from playwright.sync_api import sync_playwright
def fetch_with_playwright(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
html = page.content()
browser.close()
return html
How much does it cost to run this pipeline?
With gpt-4o-mini, each page costs about $0.001-$0.003 in API fees for parsing plus the page-link check. Scraping 100 pages runs about $0.15-$0.30 total. The analysis step adds $0.005-$0.01. If costs matter a lot, swap ChatOpenAI for ChatOllama and run a local model like Llama 3.
Is web scraping legal?
It depends on where you are and which site you scrape. In the US, the hiQ v. LinkedIn ruling said that scraping data that is open to the public does not break the CFAA. That said — always check robots.txt and terms of service. Respect rate limits. Do not scrape personal data without consent.
How do I scrape sites that need a login?
Add a login_node that signs in first and stores session cookies in state. Later fetch requests send those cookies along. You can also pass auth headers straight into the fetch node’s requests.get() call.
References
- LangGraph documentation — StateGraph, conditional edges, and state management. Link
- LangChain documentation — ChatOpenAI model integration. Link
- BeautifulSoup documentation — Parsing HTML and navigating the tree. Link
- Python requests library documentation. Link
- Cohorte Projects — How to Build a Smart Web-Scraping AI Agent with LangGraph and Selenium. Link
- Firecrawl — Building a Documentation Agent with LangGraph and Firecrawl. Link
- Real Python — LangGraph: Build Stateful AI Agents in Python. Link
- hiQ Labs, Inc. v. LinkedIn Corp., 938 F.3d 985 (9th Cir. 2019) — Legal precedent for public data scraping.
Reviewed: March 2026 | LangGraph version: 0.4+ | Python: 3.10+
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Up Next in Learning Path
LangGraph Document Processing Agent: Multi-Modal
