Menu

Project — Build a Document Processing Agent with Multi-Modal Inputs

Written by Selva Prabhakaran | 29 min read


You get a PDF spec, two whiteboard screenshots, and a text file with meeting notes. Your boss wants one summary that pulls data from all three. You could open each file, copy sections by hand, and stitch a report together. Or you could build an agent that does it in under a minute. That’s what we’re building here.

Before we write any code, here’s how the pieces fit together.

The agent receives a batch of documents — PDFs, images, plain text, or a mix. It doesn’t know what’s inside yet. So the first thing it does is figure out each file’s type and route it to the right extraction tool. PDFs go through a parser. Images go through a vision model that reads text from photos and screenshots. Text files get read directly.

Once every document is converted to plain text, the agent merges all the content into a single context window. But raw text isn’t useful on its own. The agent needs to find connections — which documents mention the same entities, dates, or facts. That’s the cross-referencing step. It compares content across all documents and flags overlaps, contradictions, and unique details.

Finally, the agent structures everything into a clean JSON report. The report includes a summary, key facts from each source, and a section mapping which facts were confirmed by multiple documents. That’s what your downstream systems consume.

We’ll build each piece as a separate node in a LangGraph state graph. By the end, you’ll have a working agent you can point at any folder of mixed documents.

Architecture — Five Nodes, One Graph

I like to start every project with a clear picture of the components. Here’s what we’re building.

Node Input Output Purpose
classify_documents Raw file paths Paths + detected types Routes each file to its extractor
extract_content Typed file paths Extracted text per doc Pulls text from PDFs, images, text files
cross_reference All extracted texts Cross-ref analysis Finds overlaps, contradictions, unique facts
generate_report Cross-referenced data Structured JSON report Produces the final deliverable
quality_check Generated report Pass or retry flag Catches incomplete output

The graph flows linearly for most runs: classify, extract, cross-reference, report, quality check. But the quality check node has a conditional edge. If the report is incomplete, it routes back to generate_report for a retry. If it passes, the graph ends.

Key Insight: **Each node does exactly one thing.** Classification doesn’t extract. Extraction doesn’t analyze. This separation makes the agent debuggable — when something breaks, you know which node to inspect.

Prerequisites

  • Python version: 3.10+
  • Required libraries: langgraph (0.4+), langchain-openai (0.3+), langchain-core (0.3+), pymupdf (1.24+), Pillow (10.0+)
  • Install: pip install langgraph langchain-openai langchain-core pymupdf Pillow
  • API key: An OpenAI API key set as OPENAI_API_KEY. See OpenAI’s docs to create one.
  • Time to complete: ~45 minutes
  • Prior knowledge: Basic LangGraph concepts (nodes, edges, state). If you’re new to LangGraph, start with our LangGraph fundamentals guide.

The first block imports everything we need. We pull in the LLM wrapper, message types, LangGraph’s graph utilities, and the document processing libraries.

python
import os
import json
import base64
from typing import TypedDict, Annotated, Literal
from pathlib import Path

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages

import fitz  # PyMuPDF
from PIL import Image

Step 1 — Define the Agent State

Every LangGraph graph starts with a state definition. The state holds all data that flows between nodes. For our agent, the state tracks input paths, detected types, extracted text, and the final report.

We use a TypedDict with clear field names. The messages field uses the add_messages reducer. This appends new messages instead of replacing the list — exactly what we want as the conversation grows across nodes.

python
class DocumentState(TypedDict):
    """State that flows through the document processing agent."""
    # Input
    file_paths: list[str]

    # Classification results
    classified_files: dict[str, str]  # {path: "pdf"|"image"|"text"}

    # Extraction results
    extracted_texts: dict[str, str]  # {path: extracted_text}

    # Cross-reference analysis
    cross_references: dict

    # Final output
    report: dict
    report_quality: str  # "pass" or "retry"

    # LLM conversation
    messages: Annotated[list, add_messages]

Each field serves one purpose. classified_files maps every path to its type. extracted_texts maps every path to the text we pulled from it. cross_references holds the analysis. And report holds the final output.

Tip: **Keep your state fields flat.** Nesting dicts inside dicts makes debugging harder. If you’re reaching three levels deep, rethink the structure.

Step 2 — Build the Document Classifier

The classifier checks each file’s extension and maps it to a type. This is routing logic — it decides which extraction method each file gets.

We don’t need an LLM here. File extensions are reliable enough. Why burn tokens on something a string comparison handles perfectly?

python
def classify_documents(state: DocumentState) -> dict:
    """Classify each document by file type."""
    classified = {}
    image_extensions = {".png", ".jpg", ".jpeg", ".bmp", ".tiff", ".webp"}
    text_extensions = {".txt", ".md", ".csv", ".json"}

    for file_path in state["file_paths"]:
        ext = Path(file_path).suffix.lower()
        if ext == ".pdf":
            classified[file_path] = "pdf"
        elif ext in image_extensions:
            classified[file_path] = "image"
        elif ext in text_extensions:
            classified[file_path] = "text"
        else:
            classified[file_path] = "text"  # fallback

    return {"classified_files": classified}

The function returns a dictionary update. LangGraph merges it into the state automatically. Every file now has a type label.

Step 3 — Build the Content Extractors

This is where multimodal processing happens. We need three extraction methods — one for each document type.

PDF extraction uses PyMuPDF (fitz). It reads text from every page and labels them. This makes it easy to trace where a fact came from later.

python
# Initialize the vision-capable model
llm = ChatOpenAI(model="gpt-4o", temperature=0)


def extract_from_pdf(file_path: str) -> str:
    """Extract text from a PDF using PyMuPDF."""
    doc = fitz.open(file_path)
    text_parts = []
    for page_num, page in enumerate(doc):
        page_text = page.get_text()
        if page_text.strip():
            text_parts.append(f"[Page {page_num + 1}]\n{page_text}")
    doc.close()
    return "\n\n".join(text_parts)

Image extraction sends the file to GPT-4o’s vision capability. The model reads text from photos, screenshots, handwritten notes — anything visual. We encode the image as base64 and include a detailed instruction prompt.

python
def extract_from_image(file_path: str) -> str:
    """Extract text from an image using GPT-4o vision."""
    with open(file_path, "rb") as f:
        image_bytes = f.read()

    image_base64 = base64.b64encode(image_bytes).decode("utf-8")
    ext = Path(file_path).suffix.lower().lstrip(".")
    mime_type = f"image/{ext}" if ext != "jpg" else "image/jpeg"

    message = HumanMessage(
        content=[
            {
                "type": "text",
                "text": (
                    "Extract ALL text and information from this image. "
                    "Include handwritten text, printed text, diagrams, "
                    "labels, numbers, and tables. Return the content "
                    "as structured text."
                ),
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:{mime_type};base64,{image_base64}"
                },
            },
        ]
    )

    response = llm.invoke([message])
    return response.content

Text extraction is the simplest case. Just read the file directly.

python
def extract_from_text(file_path: str) -> str:
    """Read a text file directly."""
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()

The extractor node ties all three methods together. It loops through classified files and calls the right function for each type. Notice the try/except — corrupted files shouldn’t crash the whole pipeline.

python
def extract_content(state: DocumentState) -> dict:
    """Extract text content from each classified document."""
    extracted = {}
    for file_path, file_type in state["classified_files"].items():
        try:
            if file_type == "pdf":
                extracted[file_path] = extract_from_pdf(file_path)
            elif file_type == "image":
                extracted[file_path] = extract_from_image(file_path)
            elif file_type == "text":
                extracted[file_path] = extract_from_text(file_path)
        except Exception as e:
            extracted[file_path] = f"[EXTRACTION ERROR: {str(e)}]"

    return {"extracted_texts": extracted}
Warning: **Base64-encoding large images eats memory.** A 10 MB photo becomes ~13.3 MB as base64. If you’re processing dozens of high-res images, resize them first. The vision model doesn’t need 4K resolution to read text.

Step 4 — Cross-Reference Across Documents

Here’s where the agent goes from tool to analyst. Any script can extract text. Finding which claims are confirmed by multiple sources — that’s what makes this an agent.

The cross-referencer sends all extracted texts to the LLM in one prompt. It asks the model to find three things: facts confirmed across documents, contradictions between documents, and facts unique to a single source.

python
def cross_reference(state: DocumentState) -> dict:
    """Cross-reference information across all documents."""
    context_parts = []
    for file_path, text in state["extracted_texts"].items():
        filename = Path(file_path).name
        context_parts.append(f"=== Document: {filename} ===\n{text}")

    combined_context = "\n\n".join(context_parts)

    system_prompt = SystemMessage(content=(
        "You are a document analyst. You receive text extracted "
        "from multiple documents. Cross-reference them and return "
        "a JSON object with three keys:\n\n"
        "1. 'confirmed_facts': Facts in 2+ documents. Include "
        "which documents confirm each fact.\n"
        "2. 'contradictions': Facts where documents disagree. "
        "Include conflicting claims and sources.\n"
        "3. 'unique_facts': Facts in only one document. "
        "Include the source.\n\n"
        "Return ONLY valid JSON. No markdown, no explanation."
    ))

    human_prompt = HumanMessage(content=combined_context)
    response = llm.invoke([system_prompt, human_prompt])

    try:
        cross_ref_data = json.loads(response.content)
    except json.JSONDecodeError:
        cross_ref_data = {
            "confirmed_facts": [],
            "contradictions": [],
            "unique_facts": [],
            "raw_analysis": response.content,
        }

    return {
        "cross_references": cross_ref_data,
        "messages": [human_prompt, response],
    }

The function labels each document’s content with its filename, then passes everything to the LLM. We parse the response as JSON. If the model returns malformed JSON (it happens more than you’d expect), we fall back to storing the raw text.

Step 5 — Generate the Structured Report

The report generator takes cross-referenced data and produces the final output. We give the LLM a strict JSON schema to follow.

python
def generate_report(state: DocumentState) -> dict:
    """Generate a structured report from cross-referenced data."""
    system_prompt = SystemMessage(content=(
        "You are a report writer. Using the cross-reference "
        "analysis below, generate a JSON report with:\n\n"
        "{\n"
        '  "title": "Document Analysis Report",\n'
        '  "summary": "2-3 sentence executive summary",\n'
        '  "sources": ["list of source documents"],\n'
        '  "key_findings": [\n'
        '    {"finding": "...", "confidence": "high|medium|low", '
        '     "sources": ["doc1", "doc2"]}\n'
        "  ],\n"
        '  "contradictions": [\n'
        '    {"claim_a": "...", "source_a": "...", '
        '     "claim_b": "...", "source_b": "..."}\n'
        "  ],\n"
        '  "recommendations": ["actionable next steps"]\n'
        "}\n\n"
        "Return ONLY valid JSON."
    ))

    human_prompt = HumanMessage(
        content=json.dumps(state["cross_references"], indent=2)
    )

    response = llm.invoke([system_prompt, human_prompt])

    try:
        report = json.loads(response.content)
    except json.JSONDecodeError:
        report = {"raw_report": response.content, "error": "Invalid JSON"}

    return {"report": report, "messages": [response]}

The generator doesn’t re-read the original documents. It works with the processed cross-reference data. This keeps the context window smaller and the output focused.

Step 6 — Quality Check with Conditional Routing

The quality check inspects the report and decides: is it complete? If required fields are missing or sections are empty, it flags a retry.

python
def quality_check(state: DocumentState) -> dict:
    """Validate the generated report for completeness."""
    report = state.get("report", {})
    required_fields = [
        "title", "summary", "sources",
        "key_findings", "recommendations",
    ]

    missing = [f for f in required_fields if f not in report]

    if missing:
        return {"report_quality": "retry"}

    if not report.get("key_findings"):
        return {"report_quality": "retry"}

    if isinstance(report.get("summary"), str) and len(report["summary"]) < 20:
        return {"report_quality": "retry"}

    return {"report_quality": "pass"}

The routing function reads report_quality from state. “retry” loops back to report generation. “pass” ends the graph.

python
def route_after_quality_check(
    state: DocumentState,
) -> Literal["generate_report", "__end__"]:
    """Route based on quality check results."""
    if state.get("report_quality") == "retry":
        return "generate_report"
    return "__end__"

This is a common LangGraph pattern. Conditional edges let you build retry loops, branching logic, and multi-path workflows. The check here is simple — field presence and length. You could add semantic validation or fact-checking against source texts for more rigor.

Step 7 — Wire the Graph and Run It

We’ve built all five nodes. Time to connect them.

python
# Build the graph
workflow = StateGraph(DocumentState)

# Add nodes
workflow.add_node("classify_documents", classify_documents)
workflow.add_node("extract_content", extract_content)
workflow.add_node("cross_reference", cross_reference)
workflow.add_node("generate_report", generate_report)
workflow.add_node("quality_check", quality_check)

# Linear edges
workflow.add_edge(START, "classify_documents")
workflow.add_edge("classify_documents", "extract_content")
workflow.add_edge("extract_content", "cross_reference")
workflow.add_edge("cross_reference", "generate_report")
workflow.add_edge("generate_report", "quality_check")

# Conditional edge — retry or finish
workflow.add_conditional_edges(
    "quality_check",
    route_after_quality_check,
)

# Compile
app = workflow.compile()

Five nodes. Five edges. One conditional edge. The compile() call validates the graph structure and returns an executable app.

Tip: **Visualize your graph during development.** Call `app.get_graph().draw_mermaid_png()` in a notebook to see the layout. It catches wiring mistakes immediately.

Let’s test with sample documents. We’ll create three text files that simulate a real scenario — meeting notes, a project spec, and a budget memo. I’ve planted a deliberate contradiction: the meeting notes say the budget is \(150,000, but the project spec says \)145,000.

python
# Create sample documents
os.makedirs("sample_docs", exist_ok=True)

# Document 1: Meeting notes
with open("sample_docs/meeting_notes.txt", "w") as f:
    f.write(
        "Project Alpha Meeting Notes - March 2026\n"
        "=========================================\n\n"
        "Attendees: Sarah Chen, Mike Rivera, Priya Patel\n\n"
        "Key Decisions:\n"
        "- Budget approved: $150,000 for Q2\n"
        "- Launch date: June 15, 2026\n"
        "- Tech stack: Python + FastAPI + PostgreSQL\n"
        "- Mike will lead the backend team\n\n"
        "Action Items:\n"
        "- Sarah: finalize vendor contracts by March 20\n"
        "- Priya: complete UI mockups by March 25\n"
        "- Mike: set up CI/CD pipeline by April 1\n"
    )

# Document 2: Project spec
with open("sample_docs/project_spec.txt", "w") as f:
    f.write(
        "Project Alpha - Technical Specification\n"
        "=======================================\n\n"
        "Overview: Internal tool for automated report generation.\n"
        "Budget: $145,000 (approved by Finance on Feb 28)\n"
        "Timeline: Development starts March 1, launch June 2026\n\n"
        "Tech Stack:\n"
        "- Backend: Python 3.12, FastAPI\n"
        "- Database: PostgreSQL 16\n"
        "- Frontend: React + TypeScript\n"
        "- Deployment: AWS ECS\n\n"
        "Team:\n"
        "- Lead: Mike Rivera (backend)\n"
        "- Frontend: Priya Patel\n"
        "- PM: Sarah Chen\n"
    )

# Document 3: Budget memo
with open("sample_docs/budget_memo.txt", "w") as f:
    f.write(
        "Finance Department - Budget Memo\n"
        "================================\n\n"
        "Project: Alpha\n"
        "Approved Budget: $150,000\n"
        "Approval Date: February 28, 2026\n"
        "Approved By: CFO James Wilson\n\n"
        "Breakdown:\n"
        "- Development: $90,000\n"
        "- Infrastructure: $35,000\n"
        "- Testing & QA: $25,000\n\n"
        "Note: Budget includes 10% contingency.\n"
        "Q2 allocation confirmed.\n"
    )

print("Sample documents created.")
python
Sample documents created.

Run the agent on these three files.

python
result = app.invoke({
    "file_paths": [
        "sample_docs/meeting_notes.txt",
        "sample_docs/project_spec.txt",
        "sample_docs/budget_memo.txt",
    ],
    "classified_files": {},
    "extracted_texts": {},
    "cross_references": {},
    "report": {},
    "report_quality": "",
    "messages": [],
})

print(json.dumps(result["report"], indent=2))

The agent produces a structured JSON report. The exact wording varies between runs, but the structure follows our schema.

json
{
  "title": "Document Analysis Report",
  "summary": "Analysis of three Project Alpha documents. Budget discrepancy found between meeting notes (\(150K) and project spec (\)145K). Team roles and tech stack are consistent across sources.",
  "sources": [
    "meeting_notes.txt",
    "project_spec.txt",
    "budget_memo.txt"
  ],
  "key_findings": [
    {
      "finding": "Launch date is June 2026",
      "confidence": "high",
      "sources": ["meeting_notes.txt", "project_spec.txt"]
    },
    {
      "finding": "Mike Rivera leads the backend team",
      "confidence": "high",
      "sources": ["meeting_notes.txt", "project_spec.txt"]
    },
    {
      "finding": "Tech stack: Python, FastAPI, PostgreSQL",
      "confidence": "high",
      "sources": ["meeting_notes.txt", "project_spec.txt"]
    }
  ],
  "contradictions": [
    {
      "claim_a": "Budget is $150,000",
      "source_a": "meeting_notes.txt",
      "claim_b": "Budget is $145,000",
      "source_b": "project_spec.txt"
    }
  ],
  "recommendations": [
    "Resolve the budget discrepancy between meeting notes and project spec",
    "Confirm the exact launch date (June 15 vs general June 2026)",
    "Add frontend details to the meeting notes for completeness"
  ]
}

The agent caught the $5,000 budget discrepancy. It confirmed team assignments and tech stack across documents. It even flagged the slightly different launch date granularity.

Extending to Real Image Documents

So far we tested with text files. But the agent already supports images through extract_from_image. When you pass an image file, the classifier routes it to vision extraction automatically.

python
# Adding an image to the pipeline
# The classifier sees .png and routes to image extraction
image_paths = [
    "sample_docs/meeting_notes.txt",
    "sample_docs/project_spec.txt",
    "path/to/whiteboard_photo.png",  # uses GPT-4o vision
]

# classify_documents returns:
#   {"path/to/whiteboard_photo.png": "image"}
# extract_content calls extract_from_image() for that file

The beauty of this design? Adding a new document type is straightforward. Want to support Word docs? Add .docx to the classifier and write an extract_from_docx function. The rest of the pipeline doesn’t change at all.

Note: **Image extraction costs more than text extraction.** Each vision call uses GPT-4o pricing. For cost-sensitive apps, consider local OCR (Tesseract) for printed text. Reserve vision API calls for handwritten or complex visual documents.

Exercise 1: Add a Document Summary Node

You’ve seen how the agent extracts, cross-references, and reports. Your turn. Add a node that creates individual summaries before cross-referencing.

python
{
  type: 'exercise',
  id: 'add-summary-node',
  title: 'Exercise 1: Add a Document Summary Node',
  difficulty: 'advanced',
  exerciseType: 'write',
  instructions: 'Create a `summarize_documents` function that takes extracted_texts from the state and produces a 2-3 sentence summary per document using the LLM. Store results in a dict called `document_summaries`. Wire this node between `extract_content` and `cross_reference` in the graph.',
  starterCode: 'def summarize_documents(state: DocumentState) -> dict:\n    """Summarize each extracted document."""\n    summaries = {}\n    for file_path, text in state["extracted_texts"].items():\n        # Use the LLM to generate a 2-3 sentence summary\n        # Hint: SystemMessage for instructions, HumanMessage for text\n        pass\n    return {"document_summaries": summaries}\n',
  testCases: [
    { id: 'tc1', input: 'result = summarize_documents({"extracted_texts": {"test.txt": "Budget is $50K. Launch in June."}, "messages": []})\nprint(type(result["document_summaries"]))', expectedOutput: "<class 'dict'>", description: 'Returns a dict' },
    { id: 'tc2', input: 'result = summarize_documents({"extracted_texts": {"a.txt": "Hello world", "b.txt": "Test content"}, "messages": []})\nprint(len(result["document_summaries"]))', expectedOutput: '2', description: 'One summary per document' }
  ],
  hints: [
    'Use llm.invoke() with a SystemMessage + HumanMessage pair for each document.',
    'Full approach: response = llm.invoke([SystemMessage(content="Summarize in 2-3 sentences."), HumanMessage(content=text)]); summaries[file_path] = response.content'
  ],
  solution: 'def summarize_documents(state: DocumentState) -> dict:\n    summaries = {}\n    for file_path, text in state["extracted_texts"].items():\n        response = llm.invoke([\n            SystemMessage(content="Summarize this document in 2-3 sentences. Be specific about facts, dates, and numbers."),\n            HumanMessage(content=text)\n        ])\n        summaries[file_path] = response.content\n    return {"document_summaries": summaries}\n',
  solutionExplanation: 'The function loops through extracted documents, sends each to the LLM with a summary prompt, and collects results. Each summary captures key facts from one document.',
  xpReward: 20,
}

Handling Errors in Production

Real documents break. PDFs have password protection. Images are corrupted. Text files use unexpected encodings. A production agent needs to handle these without crashing the pipeline.

Our extractor already wraps each file in try/except. Here’s a version with retry logic for transient failures like network timeouts and API rate limits.

python
def extract_content_with_retries(
    state: DocumentState, max_retries: int = 2
) -> dict:
    """Extract content with retry logic for failures."""
    extracted = {}

    for file_path, file_type in state["classified_files"].items():
        for attempt in range(max_retries + 1):
            try:
                if file_type == "pdf":
                    extracted[file_path] = extract_from_pdf(file_path)
                elif file_type == "image":
                    extracted[file_path] = extract_from_image(file_path)
                elif file_type == "text":
                    extracted[file_path] = extract_from_text(file_path)
                break  # success
            except Exception as e:
                if attempt == max_retries:
                    msg = f"{type(e).__name__}: {str(e)}"
                    extracted[file_path] = (
                        f"[FAILED after {max_retries + 1} attempts: {msg}]"
                    )

    return {"extracted_texts": extracted}

Transient failures get a second chance. Permanent failures still get caught and logged. The agent moves on instead of stopping.

Exercise 2: Add Retry Limits to Quality Check

The quality check can route back to generate_report for a retry. But what if the report keeps failing? Without a limit, the agent loops forever. Add a counter to prevent that.

python
{
  type: 'exercise',
  id: 'add-retry-limit',
  title: 'Exercise 2: Add Retry Limits to Quality Check',
  difficulty: 'advanced',
  exerciseType: 'write',
  instructions: 'Modify quality_check to track retries. If retry_count reaches 2, force a "pass" even if the report is incomplete. Add a retry_count field to the state.',
  starterCode: 'def quality_check_with_limit(state: DocumentState) -> dict:\n    """Validate report with retry limit."""\n    current_retries = state.get("retry_count", 0)\n    report = state.get("report", {})\n    required = ["title", "summary", "sources", "key_findings"]\n    missing = [f for f in required if f not in report]\n\n    # If fields missing AND retries < 2: retry\n    # If retries >= 2: pass (give up gracefully)\n    pass\n',
  testCases: [
    { id: 'tc1', input: 'r = quality_check_with_limit({"report": {}, "retry_count": 0})\nprint(r["report_quality"])', expectedOutput: 'retry', description: 'Retries on first failure' },
    { id: 'tc2', input: 'r = quality_check_with_limit({"report": {}, "retry_count": 2})\nprint(r["report_quality"])', expectedOutput: 'pass', description: 'Passes after max retries' }
  ],
  hints: [
    'Check current_retries >= 2 first. If true, return pass regardless.',
    'Full: if missing and current_retries < 2: return {"report_quality": "retry", "retry_count": current_retries + 1}; else: return {"report_quality": "pass"}'
  ],
  solution: 'def quality_check_with_limit(state: DocumentState) -> dict:\n    current_retries = state.get("retry_count", 0)\n    report = state.get("report", {})\n    required = ["title", "summary", "sources", "key_findings"]\n    missing = [f for f in required if f not in report]\n    if missing and current_retries < 2:\n        return {"report_quality": "retry", "retry_count": current_retries + 1}\n    return {"report_quality": "pass"}\n',
  solutionExplanation: 'The function checks two conditions: are fields missing AND have we retried less than 2 times? If both true, it increments the counter and retries. Otherwise it passes — either the report is complete or we exhausted the retry budget.',
  xpReward: 20,
}

Common Mistakes and How to Fix Them

Mistake 1: Forgetting to initialize state fields

Wrong:

python
result = app.invoke({
    "file_paths": ["doc.txt"],
})
# KeyError in downstream nodes

Why it breaks: LangGraph reads state fields in every node. If classified_files isn’t initialized, extract_content crashes with a KeyError.

Fix:

python
result = app.invoke({
    "file_paths": ["doc.txt"],
    "classified_files": {},
    "extracted_texts": {},
    "cross_references": {},
    "report": {},
    "report_quality": "",
    "messages": [],
})

Mistake 2: Not handling JSON parse errors from the LLM

Wrong:

python
response = llm.invoke([system_prompt, human_prompt])
report = json.loads(response.content)  # crashes on markdown-wrapped JSON

Why it breaks: LLMs sometimes wrap JSON in code fences or add text before the JSON. Direct json.loads() fails.

Fix:

python
response = llm.invoke([system_prompt, human_prompt])
try:
    content = response.content.strip()
    if content.startswith("```"):
        content = content.split("\n", 1)[1].rsplit("```", 1)[0]
    report = json.loads(content)
except json.JSONDecodeError:
    report = {"raw_output": response.content, "parse_error": True}

Mistake 3: Exceeding the LLM context window

Wrong:

python
# Concatenating 50-page PDFs without limits
combined = "\n".join(all_texts)  # could be 200K+ tokens
response = llm.invoke([SystemMessage(...), HumanMessage(content=combined)])

Why it breaks: GPT-4o has a 128K token context window. Three large PDFs can exceed this and cause an API error.

Fix:

python
MAX_CHARS_PER_DOC = 15000  # ~3750 tokens per doc
for file_path, text in extracted_texts.items():
    if len(text) > MAX_CHARS_PER_DOC:
        text = text[:MAX_CHARS_PER_DOC] + "\n[TRUNCATED]"

When NOT to Use This Architecture

This agent handles small-to-medium document sets well (3-20 files, under 100 pages total). But it’s not the right tool for every scenario.

Skip this approach when:
– You have hundreds of documents. The cross-reference step sends ALL text at once. For large corpora, use RAG with a vector database instead.
– You need real-time processing. Each LLM call adds 2-5 seconds. For speed-critical pipelines, use rule-based extractors or local models.
– Documents contain data that can’t leave your network. Consider self-hosted models like Llama 3 or Mistral for on-premise deployment.

Better alternatives for scale:
LangChain + ChromaDB for RAG over large document collections
Apache Tika + custom scripts for high-volume extraction without LLM costs
Azure Document Intelligence or AWS Textract for enterprise OCR

Complete Code

Click to expand the full script (copy-paste and run)
python
# Complete code from: Document Processing Agent with Multi-Modal Inputs
# Requires: pip install langgraph langchain-openai langchain-core pymupdf Pillow
# Python 3.10+
# Set OPENAI_API_KEY environment variable before running

import os
import json
import base64
from typing import TypedDict, Annotated, Literal
from pathlib import Path

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages

import fitz  # PyMuPDF
from PIL import Image

# --- State Definition ---

class DocumentState(TypedDict):
    file_paths: list[str]
    classified_files: dict[str, str]
    extracted_texts: dict[str, str]
    cross_references: dict
    report: dict
    report_quality: str
    messages: Annotated[list, add_messages]

# --- LLM Setup ---

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# --- Extraction Functions ---

def extract_from_pdf(file_path: str) -> str:
    doc = fitz.open(file_path)
    text_parts = []
    for page_num, page in enumerate(doc):
        page_text = page.get_text()
        if page_text.strip():
            text_parts.append(f"[Page {page_num + 1}]\n{page_text}")
    doc.close()
    return "\n\n".join(text_parts)

def extract_from_image(file_path: str) -> str:
    with open(file_path, "rb") as f:
        image_bytes = f.read()
    image_base64 = base64.b64encode(image_bytes).decode("utf-8")
    ext = Path(file_path).suffix.lower().lstrip(".")
    mime_type = f"image/{ext}" if ext != "jpg" else "image/jpeg"
    message = HumanMessage(
        content=[
            {
                "type": "text",
                "text": (
                    "Extract ALL text and information from this image. "
                    "Include handwritten text, printed text, diagrams, "
                    "labels, numbers, and tables. Return the content "
                    "as structured text."
                ),
            },
            {
                "type": "image_url",
                "image_url": {"url": f"data:{mime_type};base64,{image_base64}"},
            },
        ]
    )
    response = llm.invoke([message])
    return response.content

def extract_from_text(file_path: str) -> str:
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()

# --- Node Functions ---

def classify_documents(state: DocumentState) -> dict:
    classified = {}
    image_extensions = {".png", ".jpg", ".jpeg", ".bmp", ".tiff", ".webp"}
    text_extensions = {".txt", ".md", ".csv", ".json"}
    for file_path in state["file_paths"]:
        ext = Path(file_path).suffix.lower()
        if ext == ".pdf":
            classified[file_path] = "pdf"
        elif ext in image_extensions:
            classified[file_path] = "image"
        elif ext in text_extensions:
            classified[file_path] = "text"
        else:
            classified[file_path] = "text"
    return {"classified_files": classified}

def extract_content(state: DocumentState) -> dict:
    extracted = {}
    for file_path, file_type in state["classified_files"].items():
        try:
            if file_type == "pdf":
                extracted[file_path] = extract_from_pdf(file_path)
            elif file_type == "image":
                extracted[file_path] = extract_from_image(file_path)
            elif file_type == "text":
                extracted[file_path] = extract_from_text(file_path)
        except Exception as e:
            extracted[file_path] = f"[EXTRACTION ERROR: {str(e)}]"
    return {"extracted_texts": extracted}

def cross_reference(state: DocumentState) -> dict:
    context_parts = []
    for file_path, text in state["extracted_texts"].items():
        filename = Path(file_path).name
        context_parts.append(f"=== Document: {filename} ===\n{text}")
    combined_context = "\n\n".join(context_parts)
    system_prompt = SystemMessage(content=(
        "You are a document analyst. Cross-reference the documents "
        "and return JSON with: 'confirmed_facts' (in 2+ docs), "
        "'contradictions' (disagreements), 'unique_facts' (in 1 doc). "
        "Return ONLY valid JSON."
    ))
    human_prompt = HumanMessage(content=combined_context)
    response = llm.invoke([system_prompt, human_prompt])
    try:
        cross_ref_data = json.loads(response.content)
    except json.JSONDecodeError:
        cross_ref_data = {
            "confirmed_facts": [],
            "contradictions": [],
            "unique_facts": [],
            "raw_analysis": response.content,
        }
    return {"cross_references": cross_ref_data, "messages": [human_prompt, response]}

def generate_report(state: DocumentState) -> dict:
    system_prompt = SystemMessage(content=(
        "Generate a JSON report with: title, summary, sources, "
        "key_findings (with confidence and sources), contradictions, "
        "and recommendations. Return ONLY valid JSON."
    ))
    human_prompt = HumanMessage(
        content=json.dumps(state["cross_references"], indent=2)
    )
    response = llm.invoke([system_prompt, human_prompt])
    try:
        report = json.loads(response.content)
    except json.JSONDecodeError:
        report = {"raw_report": response.content, "error": "Invalid JSON"}
    return {"report": report, "messages": [response]}

def quality_check(state: DocumentState) -> dict:
    report = state.get("report", {})
    required = ["title", "summary", "sources", "key_findings", "recommendations"]
    missing = [f for f in required if f not in report]
    if missing or not report.get("key_findings"):
        return {"report_quality": "retry"}
    return {"report_quality": "pass"}

def route_after_quality_check(
    state: DocumentState,
) -> Literal["generate_report", "__end__"]:
    if state.get("report_quality") == "retry":
        return "generate_report"
    return "__end__"

# --- Build the Graph ---

workflow = StateGraph(DocumentState)
workflow.add_node("classify_documents", classify_documents)
workflow.add_node("extract_content", extract_content)
workflow.add_node("cross_reference", cross_reference)
workflow.add_node("generate_report", generate_report)
workflow.add_node("quality_check", quality_check)

workflow.add_edge(START, "classify_documents")
workflow.add_edge("classify_documents", "extract_content")
workflow.add_edge("extract_content", "cross_reference")
workflow.add_edge("cross_reference", "generate_report")
workflow.add_edge("generate_report", "quality_check")
workflow.add_conditional_edges("quality_check", route_after_quality_check)

app = workflow.compile()

# --- Sample Documents ---

os.makedirs("sample_docs", exist_ok=True)

with open("sample_docs/meeting_notes.txt", "w") as f:
    f.write(
        "Project Alpha Meeting Notes - March 2026\n"
        "=========================================\n\n"
        "Attendees: Sarah Chen, Mike Rivera, Priya Patel\n\n"
        "Key Decisions:\n"
        "- Budget approved: $150,000 for Q2\n"
        "- Launch date: June 15, 2026\n"
        "- Tech stack: Python + FastAPI + PostgreSQL\n"
        "- Mike will lead the backend team\n\n"
        "Action Items:\n"
        "- Sarah: finalize vendor contracts by March 20\n"
        "- Priya: complete UI mockups by March 25\n"
        "- Mike: set up CI/CD pipeline by April 1\n"
    )

with open("sample_docs/project_spec.txt", "w") as f:
    f.write(
        "Project Alpha - Technical Specification\n"
        "=======================================\n\n"
        "Overview: Internal tool for automated report generation.\n"
        "Budget: $145,000 (approved by Finance on Feb 28)\n"
        "Timeline: Development starts March 1, launch June 2026\n\n"
        "Tech Stack:\n"
        "- Backend: Python 3.12, FastAPI\n"
        "- Database: PostgreSQL 16\n"
        "- Frontend: React + TypeScript\n"
        "- Deployment: AWS ECS\n\n"
        "Team:\n"
        "- Lead: Mike Rivera (backend)\n"
        "- Frontend: Priya Patel\n"
        "- PM: Sarah Chen\n"
    )

with open("sample_docs/budget_memo.txt", "w") as f:
    f.write(
        "Finance Department - Budget Memo\n"
        "================================\n\n"
        "Project: Alpha\n"
        "Approved Budget: $150,000\n"
        "Approval Date: February 28, 2026\n"
        "Approved By: CFO James Wilson\n\n"
        "Breakdown:\n"
        "- Development: $90,000\n"
        "- Infrastructure: $35,000\n"
        "- Testing & QA: $25,000\n\n"
        "Note: Budget includes 10% contingency.\n"
        "Q2 allocation confirmed.\n"
    )

# --- Run ---

result = app.invoke({
    "file_paths": [
        "sample_docs/meeting_notes.txt",
        "sample_docs/project_spec.txt",
        "sample_docs/budget_memo.txt",
    ],
    "classified_files": {},
    "extracted_texts": {},
    "cross_references": {},
    "report": {},
    "report_quality": "",
    "messages": [],
})

print(json.dumps(result["report"], indent=2))
print("\nScript completed successfully.")

Summary

You built a multimodal document processing agent with LangGraph. Here’s what it does.

Five nodes handle the full pipeline: classify files by type, extract text from PDFs/images/text, cross-reference facts across documents, generate a structured JSON report, and validate the output with a retry loop. The conditional edge from quality check back to report generation is the key pattern — it lets the agent self-correct.

The multimodal capability comes from routing each file to its extraction method. PDFs use PyMuPDF. Images use GPT-4o vision. Text files get read directly. Downstream nodes don’t care how the text arrived — they work with unified plain text.

Practice exercise: Extend the agent to support .docx files. You’ll need the python-docx library. Add .docx to the classifier, write an extract_from_docx function, and test with a real Word document.

Solution
python
# pip install python-docx
from docx import Document as DocxDocument

def extract_from_docx(file_path: str) -> str:
    """Extract text from a Word document."""
    doc = DocxDocument(file_path)
    paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
    return "\n\n".join(paragraphs)

# In classify_documents, add:
# elif ext == ".docx":
#     classified[file_path] = "docx"

# In extract_content, add:
# elif file_type == "docx":
#     extracted[file_path] = extract_from_docx(file_path)

Frequently Asked Questions

Can this agent handle scanned PDFs?

Standard PDF extraction with PyMuPDF only works on PDFs with a text layer. Scanned PDFs are essentially images and return empty text. To handle them, detect empty extraction results and fall back to image extraction — convert each page to an image with fitz, then send it through GPT-4o vision. This adds latency but handles both types.

How many documents can the agent process at once?

The limit is the LLM’s context window. GPT-4o supports 128K tokens (~96K words). The cross-reference node sends all text in one prompt. For most business documents, 10-20 files work well. Beyond that, chunk documents or switch to a RAG approach.

Is it safe to send confidential documents to the API?

OpenAI’s API policy states that inputs aren’t used for model training. But data does transit through their servers. For sensitive documents, use a self-hosted model (Llama 3, Mistral) or Azure OpenAI with private networking. Check your org’s data governance policy first.

Can I add spreadsheet support?

Yes. CSV files already work (the text extractor reads them as plain text). For Excel, use openpyxl to read sheets and convert rows to text. Add .xlsx to the classifier and write an extract_from_excel function.

References

  1. LangGraph Documentation — StateGraph and Nodes. Link
  2. LangChain Documentation — ChatOpenAI with Vision. Link
  3. OpenAI API Documentation — Vision capabilities. Link
  4. PyMuPDF Documentation — Text Extraction. Link
  5. Yao, S. et al. — “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Link
  6. LangGraph GitHub Repository — Examples and Tutorials. Link
  7. Hugging Face Agents Course — Document Analysis Graph. Link
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science