Menu

LangGraph Document Processing Agent: Multi-Modal

Build a LangGraph agent that reads PDFs, images, and text, cross-checks facts across sources, and writes a clean JSON report — with full code walkthrough.

Written by Selva Prabhakaran | 31 min read

Build a LangGraph agent that reads PDFs, images, and text — then cross-references them and produces a structured report.

Imagine this: you get a PDF spec, two whiteboard photos, and a text file full of meeting notes. Your boss wants a single summary that brings data from all three together. You could open each file, copy bits by hand, and piece a report together. Or you could build an agent that does it in under a minute. That is what we are going to build here.

Before we write any code, let me explain how the parts fit together.

The agent takes in a batch of files — PDFs, images, plain text, or any blend. It has no idea what each file holds. So the very first thing it does is look at each file’s type and hand it off to the matching reader. PDFs get parsed page by page. Images get sent to a vision model that can read text in photos and screenshots. Text files are read straight from disk.

Once every file has been turned into plain text, the agent puts all the content into one place. But raw text alone is not very useful. The agent needs to find links between the files — which ones mention the same people, dates, or facts. That is the cross-check step. It looks across all files and flags overlaps, clashes, and details that show up in only one source.

At the end, the agent shapes everything into a clean JSON report. The report has a summary, key facts from each source, and a map of which facts got backed up by more than one file. That is what your later systems can use.

We will build each piece as its own node in a LangGraph state graph. By the time you finish, you will have a working agent you can point at any folder of mixed files.

What Does the Layout Look Like? Five Nodes, One Graph

I like to start every project with a clear picture of the parts. Here is what we are going to build.

NodeInputOutputPurpose
classify_documentsRaw file pathsPaths + found typesRoutes each file to its reader
extract_contentTyped file pathsText per filePulls text from PDFs, images, text files
cross_referenceAll pulled textsCross-check resultsFinds overlaps, clashes, unique facts
generate_reportCross-checked dataJSON reportWrites the final output
quality_checkThe reportPass or retry flagCatches gaps in the output

On a typical run the graph moves in a straight line: classify, extract, cross-check, report, quality check. The twist comes at the end. The quality check node carries a conditional edge. When the report has gaps, it sends the flow back to generate_report for a second attempt. When the report looks complete, the graph wraps up.

Key Insight: One node, one job. The classifier never pulls text. The reader never runs analysis. Keeping these duties apart makes the agent simple to debug — when something fails, you can trace it to one spot right away.

Prerequisites

  • Python version: 3.10+
  • Required libraries: langgraph (0.4+), langchain-openai (0.3+), langchain-core (0.3+), pymupdf (1.24+), Pillow (10.0+)
  • Install: pip install langgraph langchain-openai langchain-core pymupdf Pillow
  • API key: An OpenAI API key set as OPENAI_API_KEY. See OpenAI’s docs to create one.
  • How long it takes: ~45 minutes
  • What you should know: LangGraph basics (nodes, edges, state). If you are new to LangGraph, start with our LangGraph fundamentals guide.

The code block below loads all the tools we will use: the LLM wrapper, message types, LangGraph graph helpers, and the libraries for reading files.

python
import os
import json
import base64
from typing import TypedDict, Annotated, Literal
from pathlib import Path

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages

import fitz  # PyMuPDF
from PIL import Image

Step 1 — How Do You Define the Agent State?

Every LangGraph graph begins with a state schema. Think of it as a shared bag of data that every node can read and write.

For our agent, the state tracks file paths, detected types, pulled text, and the finished report. We lay it out as a TypedDict with clear names. One detail to notice: the messages field uses the add_messages reducer. This means new messages get tacked on to the end of the list rather than replacing it. That is what we want as the nodes pass data back and forth.

python
class DocumentState(TypedDict):
    """State that flows through the document processing agent."""
    # Input
    file_paths: list[str]

    # Classification results
    classified_files: dict[str, str]  # {path: "pdf"|"image"|"text"}

    # Extraction results
    extracted_texts: dict[str, str]  # {path: extracted_text}

    # Cross-reference analysis
    cross_references: dict

    # Final output
    report: dict
    report_quality: str  # "pass" or "retry"

    # LLM conversation
    messages: Annotated[list, add_messages]

Every field has a single role. classified_files links each path to its type. extracted_texts links each path to the text we got from it. cross_references stores the review across sources. And report holds the end result.

Tip: Keep your state fields flat. Nesting dicts inside dicts makes debugging a pain. If you find yourself three levels deep, it is time to rethink the layout.

Step 2 — How Do You Build the File Classifier?

The classifier peeks at each file’s ending and gives it a type label. This is pure routing logic — it decides which reader each file will land on.

No LLM needed for this step. File endings are plenty. Why burn tokens when a basic string check does the trick?

python
def classify_documents(state: DocumentState) -> dict:
    """Classify each document by file type."""
    classified = {}
    image_extensions = {".png", ".jpg", ".jpeg", ".bmp", ".tiff", ".webp"}
    text_extensions = {".txt", ".md", ".csv", ".json"}

    for file_path in state["file_paths"]:
        ext = Path(file_path).suffix.lower()
        if ext == ".pdf":
            classified[file_path] = "pdf"
        elif ext in image_extensions:
            classified[file_path] = "image"
        elif ext in text_extensions:
            classified[file_path] = "text"
        else:
            classified[file_path] = "text"  # fallback

    return {"classified_files": classified}

The function hands back a dict update. LangGraph folds it into the state behind the scenes. After this step, every file carries a type tag.

Step 3 — How Do You Build the Content Readers?

This is where the multimodal magic happens. We need three ways to read files — one for each type.

PDF reading uses PyMuPDF (fitz). It grabs text from every page and labels them. That makes it easy to trace where a fact came from later on.

python
# Initialize the vision-capable model
llm = ChatOpenAI(model="gpt-4o", temperature=0)


def extract_from_pdf(file_path: str) -> str:
    """Extract text from a PDF using PyMuPDF."""
    doc = fitz.open(file_path)
    text_parts = []
    for page_num, page in enumerate(doc):
        page_text = page.get_text()
        if page_text.strip():
            text_parts.append(f"[Page {page_num + 1}]\n{page_text}")
    doc.close()
    return "\n\n".join(text_parts)

Image reading sends the file to GPT-4o’s vision mode. The model can read printed text, screenshots, even handwriting — pretty much anything visual. We turn the image into base64 and pair it with a prompt that tells the model what to look for.

python
def extract_from_image(file_path: str) -> str:
    """Extract text from an image using GPT-4o vision."""
    with open(file_path, "rb") as f:
        image_bytes = f.read()

    image_base64 = base64.b64encode(image_bytes).decode("utf-8")
    ext = Path(file_path).suffix.lower().lstrip(".")
    mime_type = f"image/{ext}" if ext != "jpg" else "image/jpeg"

    message = HumanMessage(
        content=[
            {
                "type": "text",
                "text": (
                    "Extract ALL text and information from this image. "
                    "Include handwritten text, printed text, diagrams, "
                    "labels, numbers, and tables. Return the content "
                    "as structured text."
                ),
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:{mime_type};base64,{image_base64}"
                },
            },
        ]
    )

    response = llm.invoke([message])
    return response.content

Text reading is the easy one. Open the file and read what is inside.

python
def extract_from_text(file_path: str) -> str:
    """Read a text file directly."""
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()

The reader node below ties all three methods into one place. It walks through the sorted files and calls the matching function for each type. Watch the try/except block — one corrupt file must not bring down the whole pipeline.

python
def extract_content(state: DocumentState) -> dict:
    """Extract text content from each classified document."""
    extracted = {}
    for file_path, file_type in state["classified_files"].items():
        try:
            if file_type == "pdf":
                extracted[file_path] = extract_from_pdf(file_path)
            elif file_type == "image":
                extracted[file_path] = extract_from_image(file_path)
            elif file_type == "text":
                extracted[file_path] = extract_from_text(file_path)
        except Exception as e:
            extracted[file_path] = f"[EXTRACTION ERROR: {str(e)}]"

    return {"extracted_texts": extracted}
Warning: Base64-encoding big images eats memory. A 10 MB photo turns into ~13.3 MB as base64. If you are working with lots of high-res images, shrink them first. The vision model does not need 4K to read text.

Step 4 — How Do You Cross-Check Across Files?

Here is where the agent goes from tool to analyst. Any script can pull text. Finding which claims get backed up by more than one source — that is what makes this an agent.

The cross-checker sends all the pulled text to the LLM in one prompt. It asks the model to find three things: facts that appear in two or more files, points where files clash, and facts that show up in only one source.

python
def cross_reference(state: DocumentState) -> dict:
    """Cross-reference information across all documents."""
    context_parts = []
    for file_path, text in state["extracted_texts"].items():
        filename = Path(file_path).name
        context_parts.append(f"=== Document: {filename} ===\n{text}")

    combined_context = "\n\n".join(context_parts)

    system_prompt = SystemMessage(content=(
        "You are a document analyst. You receive text extracted "
        "from multiple documents. Cross-reference them and return "
        "a JSON object with three keys:\n\n"
        "1. 'confirmed_facts': Facts in 2+ documents. Include "
        "which documents confirm each fact.\n"
        "2. 'contradictions': Facts where documents disagree. "
        "Include conflicting claims and sources.\n"
        "3. 'unique_facts': Facts in only one document. "
        "Include the source.\n\n"
        "Return ONLY valid JSON. No markdown, no explanation."
    ))

    human_prompt = HumanMessage(content=combined_context)
    response = llm.invoke([system_prompt, human_prompt])

    try:
        cross_ref_data = json.loads(response.content)
    except json.JSONDecodeError:
        cross_ref_data = {
            "confirmed_facts": [],
            "contradictions": [],
            "unique_facts": [],
            "raw_analysis": response.content,
        }

    return {
        "cross_references": cross_ref_data,
        "messages": [human_prompt, response],
    }

The function tags each file’s text with its filename, then passes the whole batch to the LLM in one shot. We try to parse the reply as JSON. If the model sends back broken JSON (and it does this more than you would expect), we save the raw text as a fallback.

Step 5 — How Do You Build the Report?

The report builder takes cross-checked data and shapes it into the final output. We hand the LLM a strict JSON schema to follow.

python
def generate_report(state: DocumentState) -> dict:
    """Generate a structured report from cross-referenced data."""
    system_prompt = SystemMessage(content=(
        "You are a report writer. Using the cross-reference "
        "analysis below, generate a JSON report with:\n\n"
        "{\n"
        '  "title": "Document Analysis Report",\n'
        '  "summary": "2-3 sentence executive summary",\n'
        '  "sources": ["list of source documents"],\n'
        '  "key_findings": [\n'
        '    {"finding": "...", "confidence": "high|medium|low", '
        '     "sources": ["doc1", "doc2"]}\n'
        "  ],\n"
        '  "contradictions": [\n'
        '    {"claim_a": "...", "source_a": "...", '
        '     "claim_b": "...", "source_b": "..."}\n'
        "  ],\n"
        '  "recommendations": ["actionable next steps"]\n'
        "}\n\n"
        "Return ONLY valid JSON."
    ))

    human_prompt = HumanMessage(
        content=json.dumps(state["cross_references"], indent=2)
    )

    response = llm.invoke([system_prompt, human_prompt])

    try:
        report = json.loads(response.content)
    except json.JSONDecodeError:
        report = {"raw_report": response.content, "error": "Invalid JSON"}

    return {"report": report, "messages": [response]}

One thing to note: the builder never goes back to the raw files. It feeds off the cross-checked data only. This keeps the prompt lean and the output sharp.

Step 6 — How Does the Quality Check Work with Conditional Routing?

The quality check opens up the report and asks a simple question: did we get everything? If key fields are gone or sections sit empty, the node flags a redo.

python
def quality_check(state: DocumentState) -> dict:
    """Validate the generated report for completeness."""
    report = state.get("report", {})
    required_fields = [
        "title", "summary", "sources",
        "key_findings", "recommendations",
    ]

    missing = [f for f in required_fields if f not in report]

    if missing:
        return {"report_quality": "retry"}

    if not report.get("key_findings"):
        return {"report_quality": "retry"}

    if isinstance(report.get("summary"), str) and len(report["summary"]) < 20:
        return {"report_quality": "retry"}

    return {"report_quality": "pass"}

The routing function below checks report_quality in the state. When it says “retry”, the graph loops back to report writing. When it says “pass”, the run is over.

python
def route_after_quality_check(
    state: DocumentState,
) -> Literal["generate_report", "__end__"]:
    """Route based on quality check results."""
    if state.get("report_quality") == "retry":
        return "generate_report"
    return "__end__"

This is a pattern you will see a lot in LangGraph. Conditional edges give you retry loops, branching paths, and multi-step flows. Our check is basic on purpose — just field presence and length. If you need more rigor, you could layer in fact-checking against the source texts.

Step 7 — How Do You Wire the Graph and Run It?

All five nodes are built. Now we connect the dots.

python
# Build the graph
workflow = StateGraph(DocumentState)

# Add nodes
workflow.add_node("classify_documents", classify_documents)
workflow.add_node("extract_content", extract_content)
workflow.add_node("cross_reference", cross_reference)
workflow.add_node("generate_report", generate_report)
workflow.add_node("quality_check", quality_check)

# Linear edges
workflow.add_edge(START, "classify_documents")
workflow.add_edge("classify_documents", "extract_content")
workflow.add_edge("extract_content", "cross_reference")
workflow.add_edge("cross_reference", "generate_report")
workflow.add_edge("generate_report", "quality_check")

# Conditional edge — retry or finish
workflow.add_conditional_edges(
    "quality_check",
    route_after_quality_check,
)

# Compile
app = workflow.compile()

That is it: five nodes, five edges, and one conditional edge. The compile() call checks the wiring and hands back a runnable app.

Tip: Draw your graph during dev work. Call `app.get_graph().draw_mermaid_png()` in a notebook to see the layout. It catches wiring mistakes right away.

Let’s try it out with some sample files. We will make three text files that mimic a real project — meeting notes, a project spec, and a budget memo. I planted a clash on purpose: the meeting notes list the budget as $150,000, while the project spec says $145,000. Let’s see if the agent spots it.

python
# Create sample documents
os.makedirs("sample_docs", exist_ok=True)

# Document 1: Meeting notes
with open("sample_docs/meeting_notes.txt", "w") as f:
    f.write(
        "Project Alpha Meeting Notes - March 2026\n"
        "=========================================\n\n"
        "Attendees: Sarah Chen, Mike Rivera, Priya Patel\n\n"
        "Key Decisions:\n"
        "- Budget approved: $150,000 for Q2\n"
        "- Launch date: June 15, 2026\n"
        "- Tech stack: Python + FastAPI + PostgreSQL\n"
        "- Mike will lead the backend team\n\n"
        "Action Items:\n"
        "- Sarah: finalize vendor contracts by March 20\n"
        "- Priya: complete UI mockups by March 25\n"
        "- Mike: set up CI/CD pipeline by April 1\n"
    )

# Document 2: Project spec
with open("sample_docs/project_spec.txt", "w") as f:
    f.write(
        "Project Alpha - Technical Specification\n"
        "=======================================\n\n"
        "Overview: Internal tool for automated report generation.\n"
        "Budget: $145,000 (approved by Finance on Feb 28)\n"
        "Timeline: Development starts March 1, launch June 2026\n\n"
        "Tech Stack:\n"
        "- Backend: Python 3.12, FastAPI\n"
        "- Database: PostgreSQL 16\n"
        "- Frontend: React + TypeScript\n"
        "- Deployment: AWS ECS\n\n"
        "Team:\n"
        "- Lead: Mike Rivera (backend)\n"
        "- Frontend: Priya Patel\n"
        "- PM: Sarah Chen\n"
    )

# Document 3: Budget memo
with open("sample_docs/budget_memo.txt", "w") as f:
    f.write(
        "Finance Department - Budget Memo\n"
        "================================\n\n"
        "Project: Alpha\n"
        "Approved Budget: $150,000\n"
        "Approval Date: February 28, 2026\n"
        "Approved By: CFO James Wilson\n\n"
        "Breakdown:\n"
        "- Development: $90,000\n"
        "- Infrastructure: $35,000\n"
        "- Testing & QA: $25,000\n\n"
        "Note: Budget includes 10% contingency.\n"
        "Q2 allocation confirmed.\n"
    )

print("Sample documents created.")
python
Sample documents created.

Time to run the agent on all three files.

python
result = app.invoke({
    "file_paths": [
        "sample_docs/meeting_notes.txt",
        "sample_docs/project_spec.txt",
        "sample_docs/budget_memo.txt",
    ],
    "classified_files": {},
    "extracted_texts": {},
    "cross_references": {},
    "report": {},
    "report_quality": "",
    "messages": [],
})

print(json.dumps(result["report"], indent=2))

The agent hands back a JSON report. The exact phrasing shifts from run to run, but the shape sticks to our schema.

json
{
  "title": "Document Analysis Report",
  "summary": "Analysis of three Project Alpha documents. Budget discrepancy found between meeting notes ($150K) and project spec ($145K). Team roles and tech stack are consistent across sources.",
  "sources": [
    "meeting_notes.txt",
    "project_spec.txt",
    "budget_memo.txt"
  ],
  "key_findings": [
    {
      "finding": "Launch date is June 2026",
      "confidence": "high",
      "sources": ["meeting_notes.txt", "project_spec.txt"]
    },
    {
      "finding": "Mike Rivera leads the backend team",
      "confidence": "high",
      "sources": ["meeting_notes.txt", "project_spec.txt"]
    },
    {
      "finding": "Tech stack: Python, FastAPI, PostgreSQL",
      "confidence": "high",
      "sources": ["meeting_notes.txt", "project_spec.txt"]
    }
  ],
  "contradictions": [
    {
      "claim_a": "Budget is $150,000",
      "source_a": "meeting_notes.txt",
      "claim_b": "Budget is $145,000",
      "source_b": "project_spec.txt"
    }
  ],
  "recommendations": [
    "Resolve the budget discrepancy between meeting notes and project spec",
    "Confirm the exact launch date (June 15 vs general June 2026)",
    "Add frontend details to the meeting notes for completeness"
  ]
}

Look at that — the agent found the $5,000 budget mismatch. It also confirmed team roles and the tech stack across sources. It even flagged the slight gap in launch date detail (June 15 vs. just “June 2026”).

How Do You Extend This to Real Image Files?

We tested with text files above. But remember, the agent already knows how to handle images via extract_from_image. Drop in an image path and the classifier routes it to the vision reader all by itself.

python
# Adding an image to the pipeline
# The classifier sees .png and routes to image extraction
image_paths = [
    "sample_docs/meeting_notes.txt",
    "sample_docs/project_spec.txt",
    "path/to/whiteboard_photo.png",  # uses GPT-4o vision
]

# classify_documents returns:
#   {"path/to/whiteboard_photo.png": "image"}
# extract_content calls extract_from_image() for that file

The best part of this design? Plugging in a new file type takes very little work. Want to handle Word docs? Toss .docx into the classifier, write an extract_from_docx helper, and the rest of the pipeline stays untouched.

Note: Vision calls cost more than text reads. Each one hits GPT-4o pricing. When your budget is tight, use local OCR (Tesseract) for clean printed text. Keep the vision API for handwritten notes or tricky visual layouts.

Exercise 1: Add a Document Summary Node

You have seen how the agent reads, cross-checks, and reports. Your turn. Add a node that writes short summaries of each file before the cross-check step.

python
{
  type: 'exercise',
  id: 'add-summary-node',
  title: 'Exercise 1: Add a Document Summary Node',
  difficulty: 'advanced',
  exerciseType: 'write',
  instructions: 'Create a `summarize_documents` function that takes extracted_texts from the state and produces a 2-3 sentence summary per document using the LLM. Store results in a dict called `document_summaries`. Wire this node between `extract_content` and `cross_reference` in the graph.',
  starterCode: 'def summarize_documents(state: DocumentState) -> dict:\n    """Summarize each extracted document."""\n    summaries = {}\n    for file_path, text in state["extracted_texts"].items():\n        # Use the LLM to generate a 2-3 sentence summary\n        # Hint: SystemMessage for instructions, HumanMessage for text\n        pass\n    return {"document_summaries": summaries}\n',
  testCases: [
    { id: 'tc1', input: 'result = summarize_documents({"extracted_texts": {"test.txt": "Budget is $50K. Launch in June."}, "messages": []})\nprint(type(result["document_summaries"]))', expectedOutput: "<class 'dict'>", description: 'Returns a dict' },
    { id: 'tc2', input: 'result = summarize_documents({"extracted_texts": {"a.txt": "Hello world", "b.txt": "Test content"}, "messages": []})\nprint(len(result["document_summaries"]))', expectedOutput: '2', description: 'One summary per document' }
  ],
  hints: [
    'Use llm.invoke() with a SystemMessage + HumanMessage pair for each document.',
    'Full approach: response = llm.invoke([SystemMessage(content="Summarize in 2-3 sentences."), HumanMessage(content=text)]); summaries[file_path] = response.content'
  ],
  solution: 'def summarize_documents(state: DocumentState) -> dict:\n    summaries = {}\n    for file_path, text in state["extracted_texts"].items():\n        response = llm.invoke([\n            SystemMessage(content="Summarize this document in 2-3 sentences. Be specific about facts, dates, and numbers."),\n            HumanMessage(content=text)\n        ])\n        summaries[file_path] = response.content\n    return {"document_summaries": summaries}\n',
  solutionExplanation: 'The function loops through extracted documents, sends each to the LLM with a summary prompt, and collects results. Each summary captures key facts from one document.',
  xpReward: 20,
}

How Do You Handle Errors in a Real System?

In the real world, files break all the time. PDFs are locked with passwords. Images arrive corrupted. Text files use strange encodings. A solid agent must handle these bumps without taking the whole pipeline down.

Our reader already wraps each file in try/except. Below is an upgraded version that adds retry logic for fleeting problems — network timeouts, API rate limits, and the like.

python
def extract_content_with_retries(
    state: DocumentState, max_retries: int = 2
) -> dict:
    """Extract content with retry logic for failures."""
    extracted = {}

    for file_path, file_type in state["classified_files"].items():
        for attempt in range(max_retries + 1):
            try:
                if file_type == "pdf":
                    extracted[file_path] = extract_from_pdf(file_path)
                elif file_type == "image":
                    extracted[file_path] = extract_from_image(file_path)
                elif file_type == "text":
                    extracted[file_path] = extract_from_text(file_path)
                break  # success
            except Exception as e:
                if attempt == max_retries:
                    msg = f"{type(e).__name__}: {str(e)}"
                    extracted[file_path] = (
                        f"[FAILED after {max_retries + 1} attempts: {msg}]"
                    )

    return {"extracted_texts": extracted}

Temporary glitches get another chance. Lasting errors still get caught and logged. Either way the agent keeps going instead of coming to a halt.

Exercise 2: Add Retry Limits to Quality Check

The quality check can send the graph back to generate_report for a retry. But what if the report keeps failing? Without a cap, the agent loops forever. Add a counter to stop that.

python
{
  type: 'exercise',
  id: 'add-retry-limit',
  title: 'Exercise 2: Add Retry Limits to Quality Check',
  difficulty: 'advanced',
  exerciseType: 'write',
  instructions: 'Modify quality_check to track retries. If retry_count reaches 2, force a "pass" even if the report is incomplete. Add a retry_count field to the state.',
  starterCode: 'def quality_check_with_limit(state: DocumentState) -> dict:\n    """Validate report with retry limit."""\n    current_retries = state.get("retry_count", 0)\n    report = state.get("report", {})\n    required = ["title", "summary", "sources", "key_findings"]\n    missing = [f for f in required if f not in report]\n\n    # If fields missing AND retries < 2: retry\n    # If retries >= 2: pass (give up gracefully)\n    pass\n',
  testCases: [
    { id: 'tc1', input: 'r = quality_check_with_limit({"report": {}, "retry_count": 0})\nprint(r["report_quality"])', expectedOutput: 'retry', description: 'Retries on first failure' },
    { id: 'tc2', input: 'r = quality_check_with_limit({"report": {}, "retry_count": 2})\nprint(r["report_quality"])', expectedOutput: 'pass', description: 'Passes after max retries' }
  ],
  hints: [
    'Check current_retries >= 2 first. If true, return pass regardless.',
    'Full: if missing and current_retries < 2: return {"report_quality": "retry", "retry_count": current_retries + 1}; else: return {"report_quality": "pass"}'
  ],
  solution: 'def quality_check_with_limit(state: DocumentState) -> dict:\n    current_retries = state.get("retry_count", 0)\n    report = state.get("report", {})\n    required = ["title", "summary", "sources", "key_findings"]\n    missing = [f for f in required if f not in report]\n    if missing and current_retries < 2:\n        return {"report_quality": "retry", "retry_count": current_retries + 1}\n    return {"report_quality": "pass"}\n',
  solutionExplanation: 'The function checks two conditions: are fields missing AND have we retried less than 2 times? If both true, it increments the counter and retries. Otherwise it passes — either the report is complete or we exhausted the retry budget.',
  xpReward: 20,
}

What Are the Most Common Mistakes?

Mistake 1: Not setting up state fields at the start

Wrong:

python
result = app.invoke({
    "file_paths": ["doc.txt"],
})
# KeyError in downstream nodes

Why it breaks: Every node reads from the state. If classified_files does not exist yet, extract_content throws a KeyError and the whole run fails.

Fix:

python
result = app.invoke({
    "file_paths": ["doc.txt"],
    "classified_files": {},
    "extracted_texts": {},
    "cross_references": {},
    "report": {},
    "report_quality": "",
    "messages": [],
})

Mistake 2: Not catching bad JSON from the LLM

Wrong:

python
response = llm.invoke([system_prompt, human_prompt])
report = json.loads(response.content)  # crashes on markdown-wrapped JSON

Why it breaks: LLMs love to wrap JSON in code fences or drop in a sentence before the actual block. A bare json.loads() call chokes on the extra text.

Fix:

python
response = llm.invoke([system_prompt, human_prompt])
try:
    content = response.content.strip()
    if content.startswith("```"):
        content = content.split("\n", 1)[1].rsplit("```", 1)[0]
    report = json.loads(content)
except json.JSONDecodeError:
    report = {"raw_output": response.content, "parse_error": True}

Mistake 3: Blowing past the LLM context window

Wrong:

python
# Concatenating 50-page PDFs without limits
combined = "\n".join(all_texts)  # could be 200K+ tokens
response = llm.invoke([SystemMessage(...), HumanMessage(content=combined)])

Why it breaks: GPT-4o caps out at 128K tokens. Three hefty PDFs can blow past that limit and trigger an API error.

Fix:

python
MAX_CHARS_PER_DOC = 15000  # ~3750 tokens per doc
for file_path, text in extracted_texts.items():
    if len(text) > MAX_CHARS_PER_DOC:
        text = text[:MAX_CHARS_PER_DOC] + "\n[TRUNCATED]"

When Should You NOT Use This Setup?

This agent works great for small to mid-sized batches (3-20 files, under 100 pages in total). But it is not the answer for every use case.

Choose a different path when:
– You have hundreds of files. The cross-check step dumps ALL text into one prompt. For huge collections, pair a vector database with RAG instead.
– You need instant results. Each LLM call adds 2-5 seconds. For timing-sensitive pipelines, stick with rule-based readers or local models.
– Your data must not leave your network. Try self-hosted models like Llama 3 or Mistral for on-site work.

Stronger options at scale:
LangChain + ChromaDB for RAG across massive file sets
Apache Tika + custom scripts for high-volume reading with zero LLM cost
Azure Document Intelligence or AWS Textract for enterprise-grade OCR

Complete Code

Click to expand the full script (copy-paste and run)
python
# Complete code from: Document Processing Agent with Multi-Modal Inputs
# Requires: pip install langgraph langchain-openai langchain-core pymupdf Pillow
# Python 3.10+
# Set OPENAI_API_KEY environment variable before running

import os
import json
import base64
from typing import TypedDict, Annotated, Literal
from pathlib import Path

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages

import fitz  # PyMuPDF
from PIL import Image

# --- State Definition ---

class DocumentState(TypedDict):
    file_paths: list[str]
    classified_files: dict[str, str]
    extracted_texts: dict[str, str]
    cross_references: dict
    report: dict
    report_quality: str
    messages: Annotated[list, add_messages]

# --- LLM Setup ---

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# --- Extraction Functions ---

def extract_from_pdf(file_path: str) -> str:
    doc = fitz.open(file_path)
    text_parts = []
    for page_num, page in enumerate(doc):
        page_text = page.get_text()
        if page_text.strip():
            text_parts.append(f"[Page {page_num + 1}]\n{page_text}")
    doc.close()
    return "\n\n".join(text_parts)

def extract_from_image(file_path: str) -> str:
    with open(file_path, "rb") as f:
        image_bytes = f.read()
    image_base64 = base64.b64encode(image_bytes).decode("utf-8")
    ext = Path(file_path).suffix.lower().lstrip(".")
    mime_type = f"image/{ext}" if ext != "jpg" else "image/jpeg"
    message = HumanMessage(
        content=[
            {
                "type": "text",
                "text": (
                    "Extract ALL text and information from this image. "
                    "Include handwritten text, printed text, diagrams, "
                    "labels, numbers, and tables. Return the content "
                    "as structured text."
                ),
            },
            {
                "type": "image_url",
                "image_url": {"url": f"data:{mime_type};base64,{image_base64}"},
            },
        ]
    )
    response = llm.invoke([message])
    return response.content

def extract_from_text(file_path: str) -> str:
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()

# --- Node Functions ---

def classify_documents(state: DocumentState) -> dict:
    classified = {}
    image_extensions = {".png", ".jpg", ".jpeg", ".bmp", ".tiff", ".webp"}
    text_extensions = {".txt", ".md", ".csv", ".json"}
    for file_path in state["file_paths"]:
        ext = Path(file_path).suffix.lower()
        if ext == ".pdf":
            classified[file_path] = "pdf"
        elif ext in image_extensions:
            classified[file_path] = "image"
        elif ext in text_extensions:
            classified[file_path] = "text"
        else:
            classified[file_path] = "text"
    return {"classified_files": classified}

def extract_content(state: DocumentState) -> dict:
    extracted = {}
    for file_path, file_type in state["classified_files"].items():
        try:
            if file_type == "pdf":
                extracted[file_path] = extract_from_pdf(file_path)
            elif file_type == "image":
                extracted[file_path] = extract_from_image(file_path)
            elif file_type == "text":
                extracted[file_path] = extract_from_text(file_path)
        except Exception as e:
            extracted[file_path] = f"[EXTRACTION ERROR: {str(e)}]"
    return {"extracted_texts": extracted}

def cross_reference(state: DocumentState) -> dict:
    context_parts = []
    for file_path, text in state["extracted_texts"].items():
        filename = Path(file_path).name
        context_parts.append(f"=== Document: {filename} ===\n{text}")
    combined_context = "\n\n".join(context_parts)
    system_prompt = SystemMessage(content=(
        "You are a document analyst. Cross-reference the documents "
        "and return JSON with: 'confirmed_facts' (in 2+ docs), "
        "'contradictions' (disagreements), 'unique_facts' (in 1 doc). "
        "Return ONLY valid JSON."
    ))
    human_prompt = HumanMessage(content=combined_context)
    response = llm.invoke([system_prompt, human_prompt])
    try:
        cross_ref_data = json.loads(response.content)
    except json.JSONDecodeError:
        cross_ref_data = {
            "confirmed_facts": [],
            "contradictions": [],
            "unique_facts": [],
            "raw_analysis": response.content,
        }
    return {"cross_references": cross_ref_data, "messages": [human_prompt, response]}

def generate_report(state: DocumentState) -> dict:
    system_prompt = SystemMessage(content=(
        "Generate a JSON report with: title, summary, sources, "
        "key_findings (with confidence and sources), contradictions, "
        "and recommendations. Return ONLY valid JSON."
    ))
    human_prompt = HumanMessage(
        content=json.dumps(state["cross_references"], indent=2)
    )
    response = llm.invoke([system_prompt, human_prompt])
    try:
        report = json.loads(response.content)
    except json.JSONDecodeError:
        report = {"raw_report": response.content, "error": "Invalid JSON"}
    return {"report": report, "messages": [response]}

def quality_check(state: DocumentState) -> dict:
    report = state.get("report", {})
    required = ["title", "summary", "sources", "key_findings", "recommendations"]
    missing = [f for f in required if f not in report]
    if missing or not report.get("key_findings"):
        return {"report_quality": "retry"}
    return {"report_quality": "pass"}

def route_after_quality_check(
    state: DocumentState,
) -> Literal["generate_report", "__end__"]:
    if state.get("report_quality") == "retry":
        return "generate_report"
    return "__end__"

# --- Build the Graph ---

workflow = StateGraph(DocumentState)
workflow.add_node("classify_documents", classify_documents)
workflow.add_node("extract_content", extract_content)
workflow.add_node("cross_reference", cross_reference)
workflow.add_node("generate_report", generate_report)
workflow.add_node("quality_check", quality_check)

workflow.add_edge(START, "classify_documents")
workflow.add_edge("classify_documents", "extract_content")
workflow.add_edge("extract_content", "cross_reference")
workflow.add_edge("cross_reference", "generate_report")
workflow.add_edge("generate_report", "quality_check")
workflow.add_conditional_edges("quality_check", route_after_quality_check)

app = workflow.compile()

# --- Sample Documents ---

os.makedirs("sample_docs", exist_ok=True)

with open("sample_docs/meeting_notes.txt", "w") as f:
    f.write(
        "Project Alpha Meeting Notes - March 2026\n"
        "=========================================\n\n"
        "Attendees: Sarah Chen, Mike Rivera, Priya Patel\n\n"
        "Key Decisions:\n"
        "- Budget approved: $150,000 for Q2\n"
        "- Launch date: June 15, 2026\n"
        "- Tech stack: Python + FastAPI + PostgreSQL\n"
        "- Mike will lead the backend team\n\n"
        "Action Items:\n"
        "- Sarah: finalize vendor contracts by March 20\n"
        "- Priya: complete UI mockups by March 25\n"
        "- Mike: set up CI/CD pipeline by April 1\n"
    )

with open("sample_docs/project_spec.txt", "w") as f:
    f.write(
        "Project Alpha - Technical Specification\n"
        "=======================================\n\n"
        "Overview: Internal tool for automated report generation.\n"
        "Budget: $145,000 (approved by Finance on Feb 28)\n"
        "Timeline: Development starts March 1, launch June 2026\n\n"
        "Tech Stack:\n"
        "- Backend: Python 3.12, FastAPI\n"
        "- Database: PostgreSQL 16\n"
        "- Frontend: React + TypeScript\n"
        "- Deployment: AWS ECS\n\n"
        "Team:\n"
        "- Lead: Mike Rivera (backend)\n"
        "- Frontend: Priya Patel\n"
        "- PM: Sarah Chen\n"
    )

with open("sample_docs/budget_memo.txt", "w") as f:
    f.write(
        "Finance Department - Budget Memo\n"
        "================================\n\n"
        "Project: Alpha\n"
        "Approved Budget: $150,000\n"
        "Approval Date: February 28, 2026\n"
        "Approved By: CFO James Wilson\n\n"
        "Breakdown:\n"
        "- Development: $90,000\n"
        "- Infrastructure: $35,000\n"
        "- Testing & QA: $25,000\n\n"
        "Note: Budget includes 10% contingency.\n"
        "Q2 allocation confirmed.\n"
    )

# --- Run ---

result = app.invoke({
    "file_paths": [
        "sample_docs/meeting_notes.txt",
        "sample_docs/project_spec.txt",
        "sample_docs/budget_memo.txt",
    ],
    "classified_files": {},
    "extracted_texts": {},
    "cross_references": {},
    "report": {},
    "report_quality": "",
    "messages": [],
})

print(json.dumps(result["report"], indent=2))
print("\nScript completed successfully.")

Summary

You built a multimodal file-reading agent with LangGraph. Here is what it does.

Five nodes handle the whole pipeline: sort files by type, pull text from PDFs and images and plain text, cross-check facts across sources, write a JSON report, and check the output with a retry loop. The conditional edge from quality check back to report writing is the key pattern — it lets the agent fix its own work.

The multimodal power comes from routing each file to the right reader. PDFs use PyMuPDF. Images use GPT-4o vision. Text files get read straight from disk. Nodes further down the pipeline do not care how the text got there — they all work with plain text.

Practice exercise: Extend the agent to handle .docx files. You will need the python-docx library. Add .docx to the classifier, write an extract_from_docx function, and test with a real Word file.

Solution
python
# pip install python-docx
from docx import Document as DocxDocument

def extract_from_docx(file_path: str) -> str:
    """Extract text from a Word document."""
    doc = DocxDocument(file_path)
    paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
    return "\n\n".join(paragraphs)

# In classify_documents, add:
# elif ext == ".docx":
#     classified[file_path] = "docx"

# In extract_content, add:
# elif file_type == "docx":
#     extracted[file_path] = extract_from_docx(file_path)

Frequently Asked Questions

Can this agent handle scanned PDFs?

Plain PDF reading with PyMuPDF only works on PDFs that have a text layer. Scanned PDFs are really just images and give back empty text. To handle them, watch for empty results and fall back to image reading — turn each page into an image with fitz, then send it through GPT-4o vision. This takes more time but covers both types.

How many files can the agent handle at once?

The cap is the LLM’s context window. GPT-4o supports 128K tokens — about 96K words. The cross-check node sends all text in one prompt. For most business files, 10-20 work well. Past that, break files into chunks or switch to a RAG setup.

Is it safe to send private files to the API?

OpenAI’s API rules say that inputs are not used for model training. But data does travel through their servers. For private files, use a self-hosted model (Llama 3, Mistral) or Azure OpenAI with private networking. Check your org’s data rules first.

Can I add spreadsheet support?

Yes. CSV files already work (the text reader handles them as plain text). For Excel, use openpyxl to read sheets and turn rows into text. Add .xlsx to the classifier and write an extract_from_excel function.

References

  1. LangGraph Documentation — StateGraph and Nodes. Link
  2. LangChain Documentation — ChatOpenAI with Vision. Link
  3. OpenAI API Documentation — Vision capabilities. Link
  4. PyMuPDF Documentation — Text Extraction. Link
  5. Yao, S. et al. — “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Link
  6. LangGraph GitHub Repository — Examples and Tutorials. Link
  7. Hugging Face Agents Course — Document Analysis Graph. Link
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science