Menu

Building a RAG Agent with LangGraph — Retrieval-Augmented Generation Done Right

Written by Selva Prabhakaran | 31 min read

You ask your chatbot about your company’s HR policy. It confidently says employees get 30 vacation days. The actual policy says 20. That’s a hallucination — and a RAG agent built with LangGraph won’t make it. It retrieves the right documents, grades their relevance, and verifies its own answer before responding.

What Is RAG and Why Do Agents Make It Better?

RAG stands for Retrieval-Augmented Generation. You give an LLM access to external documents so it answers questions from your data — not just its training knowledge.

The simplest version works like this: take the user’s question, search a vector database, stuff the top results into the prompt, and generate an answer. That’s “naive RAG.” It works well for straightforward questions over clean documents.

But it breaks down fast. What happens when the retrieved documents aren’t relevant? The LLM generates an answer anyway — often a confident-sounding wrong one. What if the question is ambiguous? Naive RAG doesn’t rephrase or retry.

KEY INSIGHT: Agentic RAG replaces the rigid retrieve-then-generate pipeline with a decision-making graph. The agent chooses whether to retrieve, evaluates what it got, and loops back to try again — just like a human researcher would.

Here’s how they compare:

Feature Naive RAG Agentic RAG
Retrieval Always retrieves, one pass Decides IF and WHEN to retrieve
Relevance check None — uses whatever comes back Grades each document, discards junk
Query refinement None Rewrites query if results are poor
Fallback sources None Web search, alternative indexes
Hallucination check None Verifies answer against sources
Answer quality No verification Checks if answer addresses the question
Error recovery Fails silently Loops back and retries

The difference is control. Naive RAG is a straight pipe. Agentic RAG is a loop with decision points.

Here’s the full pipeline we’ll build: route the question, retrieve documents, grade relevance, generate an answer, check for hallucinations, and verify answer quality. Six stages, all wired together in a LangGraph StateGraph.

Prerequisites

  • Python version: 3.10+
  • Required libraries: langgraph (0.4+), langchain (0.3+), langchain-openai, langchain-community, chromadb, tiktoken
  • Install: pip install langgraph langchain langchain-openai langchain-community chromadb tiktoken
  • API key: An OpenAI API key (set as OPENAI_API_KEY environment variable). See the OpenAI platform to create one.
  • Previous knowledge: Familiarity with LangGraph basics (nodes, edges, state). See our earlier posts on graph concepts and state management.
  • Time to complete: 35-40 minutes

Setting Up the Retrieval Pipeline

Before we build the agent, we need documents to retrieve from. We’ll create a small knowledge base, embed those documents, and store them in ChromaDB — a lightweight vector database that runs locally.

This first block imports everything we’ll use and sets up the API key. We’re pulling in LangGraph’s StateGraph for building the agent, LangChain’s document and embedding classes, and ChromaDB for vector storage.

python
import os
from typing import List, TypedDict
from langgraph.graph import StateGraph, END, START
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from pydantic import BaseModel, Field

os.environ["OPENAI_API_KEY"] = "your-api-key-here"

Next, we create sample documents representing an HR knowledge base. In a real project, you’d load these from PDFs, databases, or web pages. Each Document holds text content and metadata like the source file.

python
documents = [
    Document(
        page_content="Employees receive 20 vacation days per year. "
        "After 5 years of service, this increases to 25 days.",
        metadata={"source": "hr-policy.pdf", "section": "leave"},
    ),
    Document(
        page_content="The company matches 401(k) contributions up to 6% "
        "of the employee's salary. Vesting is immediate.",
        metadata={"source": "benefits-guide.pdf",
                  "section": "retirement"},
    ),
    Document(
        page_content="Remote work is permitted 3 days per week. Employees "
        "must be in-office on Tuesdays and Thursdays.",
        metadata={"source": "hr-policy.pdf",
                  "section": "remote-work"},
    ),
    Document(
        page_content="Performance reviews occur twice per year, in June "
        "and December. Managers use a 5-point rating scale.",
        metadata={"source": "hr-policy.pdf", "section": "reviews"},
    ),
    Document(
        page_content="Health insurance covers medical, dental, and vision. "
        "The company pays 80% of premiums for employees "
        "and 60% for dependents.",
        metadata={"source": "benefits-guide.pdf",
                  "section": "health"},
    ),
    Document(
        page_content="New employees complete a 90-day probation period. "
        "During probation, either party may terminate with "
        "one week's notice.",
        metadata={"source": "hr-policy.pdf",
                  "section": "onboarding"},
    ),
]

We embed those documents and store them in ChromaDB. The OpenAIEmbeddings model converts text into vectors. ChromaDB indexes those vectors for similarity search. The from_documents method handles embedding and indexing in one call.

python
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    collection_name="hr_knowledge_base",
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

The retriever returns the top 3 most similar documents for any query. We’ll plug this retriever into our agent graph.

TIP: Choose k based on your context window budget. Each retrieved document eats tokens. With GPT-4, you have room. With smaller models, keep k at 2-3 to leave space for the system prompt.

Defining the Agent State

Every LangGraph application needs a state schema — a TypedDict that defines what data flows through the graph. Our RAG agent state tracks the question, retrieved documents, the generated answer, and control flags for routing.

python
class RAGState(TypedDict):
    question: str
    documents: List[Document]
    generation: str
    query_rewrite_count: int
    relevance_decision: str  # "relevant" or "not_relevant"
    hallucination_check: str  # "grounded" or "not_grounded"
    answer_quality: str  # "useful" or "not_useful"

Seven fields total. The first three hold data (question, documents, answer). The last four are control signals the agent uses to decide where to go next.

Building the RAG Agent Graph Node by Node

This is where the real work happens. We’ll build six nodes — one for each pipeline stage. Then we wire them together with conditional edges.

Node 1: Query Router

Should this question go to the vector store, or can the LLM answer directly? A question like “How many vacation days do I get?” needs retrieval. “Hi, how are you?” doesn’t.

We use structured output to get a clean routing decision. The RouteDecision Pydantic model forces the LLM to return either “retrieve” or “direct_answer.”

python
class RouteDecision(BaseModel):
    """Route the question to retrieval or direct answer."""
    route: str = Field(
        description="Route to 'retrieve' for domain questions "
        "or 'direct_answer' for greetings and general chat"
    )

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

route_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a router. Given a user question, decide if it needs "
     "document retrieval or can be answered directly.\n"
     "- Use 'retrieve' for questions about company policies, "
     "benefits, HR topics, or anything domain-specific.\n"
     "- Use 'direct_answer' for greetings, general chat, or "
     "questions that don't need company documents.\n"
     "Respond with only the route."),
    ("human", "{question}"),
])

structured_llm_router = llm.with_structured_output(RouteDecision)

I prefer gpt-4o-mini for routing and grading calls. It’s fast, cheap, and reliable enough for binary decisions. Save the heavier models for generation.

Node 2: Retrieve Documents

The retrieval node queries the vector store and returns matching documents. It’s the simplest node in the graph.

python
def retrieve(state: RAGState) -> dict:
    """Retrieve documents from the vector store."""
    question = state["question"]
    docs = retriever.invoke(question)
    return {"documents": docs}

Short and clean. The retriever embeds the question, searches ChromaDB, and returns top-k results. We just pass them into the state.

Node 3: Grade Document Relevance

This node separates agentic RAG from naive RAG. Instead of accepting whatever the retriever returns, we check each document. The LLM reads each document alongside the question and gives a binary “yes” or “no.”

python
class RelevanceGrade(BaseModel):
    """Binary relevance grade for a retrieved document."""
    score: str = Field(
        description="'yes' if the document is relevant to the "
        "question, 'no' otherwise"
    )

grade_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a document relevance grader. Given a user question "
     "and a retrieved document, decide if the document contains "
     "information relevant to answering the question.\n"
     "Give a binary 'yes' or 'no' score."),
    ("human",
     "Question: {question}\n\nDocument: {document}"),
])

relevance_grader = grade_prompt | llm.with_structured_output(
    RelevanceGrade
)

The grade_documents function iterates over all retrieved documents, grades each one, and keeps only the relevant ones. If none survive, the agent rewrites the query.

python
def grade_documents(state: RAGState) -> dict:
    """Grade retrieved documents for relevance."""
    question = state["question"]
    docs = state["documents"]

    relevant_docs = []
    for doc in docs:
        result = relevance_grader.invoke({
            "question": question,
            "document": doc.page_content,
        })
        if result.score == "yes":
            relevant_docs.append(doc)

    decision = "relevant" if relevant_docs else "not_relevant"
    return {
        "documents": relevant_docs,
        "relevance_decision": decision,
    }

KEY INSIGHT: Document grading is the cheapest form of quality control in a RAG pipeline. A single grading call with GPT-4o-mini costs a fraction of a cent. Letting an irrelevant document pollute your generation costs the user’s trust.

Node 4: Generate Answer

When the agent has relevant documents, it generates an answer. The prompt instructs the LLM to use only the provided context. This “grounded generation” approach reduces hallucinations significantly.

python
generate_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are an assistant answering questions using the provided "
     "context. Use ONLY the information in the context to answer. "
     "If the context doesn't contain enough information, say so. "
     "Keep answers concise and direct."),
    ("human",
     "Question: {question}\n\nContext:\n{context}"),
])

generate_chain = generate_prompt | llm | StrOutputParser()


def generate(state: RAGState) -> dict:
    """Generate an answer from relevant documents."""
    question = state["question"]
    docs = state["documents"]
    context = "\n\n".join(doc.page_content for doc in docs)

    answer = generate_chain.invoke({
        "question": question,
        "context": context,
    })
    return {"generation": answer}

Node 5: Hallucination Check

The generated answer might sound right but say things the documents don’t support. This node compares the generation against source documents and asks: “Is every claim grounded?”

python
class HallucinationGrade(BaseModel):
    """Check if generation is grounded in the documents."""
    score: str = Field(
        description="'yes' if the answer is grounded in the "
        "documents, 'no' if it contains unsupported claims"
    )

hallucination_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a hallucination grader. Given a set of source "
     "documents and an LLM generation, determine if the "
     "generation is supported by the documents.\n"
     "Score 'yes' if all claims are grounded in the documents. "
     "Score 'no' if any claim is not supported."),
    ("human",
     "Documents:\n{documents}\n\nGeneration: {generation}"),
])

hallucination_grader = hallucination_prompt | llm.with_structured_output(
    HallucinationGrade
)


def check_hallucination(state: RAGState) -> dict:
    """Check if the generation is grounded in documents."""
    docs = state["documents"]
    generation = state["generation"]
    doc_text = "\n\n".join(doc.page_content for doc in docs)

    result = hallucination_grader.invoke({
        "documents": doc_text,
        "generation": generation,
    })

    return {
        "hallucination_check": (
            "grounded" if result.score == "yes"
            else "not_grounded"
        )
    }

Node 6: Answer Quality Check

An answer can be perfectly grounded yet still miss the point. This final check asks whether the response actually addresses what the user asked.

python
class AnswerGrade(BaseModel):
    """Check if the answer addresses the question."""
    score: str = Field(
        description="'yes' if the answer addresses the question, "
        "'no' if it misses the point"
    )

answer_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are an answer quality grader. Given a user question "
     "and an LLM generation, determine if the answer addresses "
     "the question.\n"
     "Score 'yes' if it is useful and relevant. "
     "Score 'no' if it doesn't answer what was asked."),
    ("human",
     "Question: {question}\n\nAnswer: {generation}"),
])

answer_grader = answer_prompt | llm.with_structured_output(
    AnswerGrade
)


def check_answer_quality(state: RAGState) -> dict:
    """Check if the generation answers the question."""
    question = state["question"]
    generation = state["generation"]

    result = answer_grader.invoke({
        "question": question,
        "generation": generation,
    })

    return {
        "answer_quality": (
            "useful" if result.score == "yes"
            else "not_useful"
        )
    }

The Query Rewrite Node

When relevance grading or quality checks fail, the agent rephrases the question for better retrieval. It also increments a counter so we don’t loop forever.

python
rewrite_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a query rewriter. Given a user question that "
     "didn't produce relevant search results, rewrite it to "
     "improve retrieval. Make it more specific or use "
     "different keywords. Return only the rewritten question."),
    ("human", "Original question: {question}"),
])

rewrite_chain = rewrite_prompt | llm | StrOutputParser()


def rewrite_query(state: RAGState) -> dict:
    """Rewrite the question for better retrieval."""
    question = state["question"]
    rewritten = rewrite_chain.invoke({"question": question})
    count = state.get("query_rewrite_count", 0)
    return {
        "question": rewritten,
        "query_rewrite_count": count + 1,
    }

WARNING: Always set a maximum rewrite limit. Without one, a question that genuinely can’t be answered from your documents will loop forever. Two retries is a good default — after that, return what you have or say “I don’t know.”

The Direct Answer Node

For questions that don’t need retrieval — greetings, general knowledge, off-topic chatter — we let the LLM answer directly.

python
def direct_answer(state: RAGState) -> dict:
    """Answer without retrieval for non-domain questions."""
    question = state["question"]
    response = llm.invoke(
        f"Answer this briefly: {question}"
    )
    return {"generation": response.content}

Wiring the Graph Together

All the nodes are built. Now we connect them with conditional edges. Three routing functions control the flow: route_question handles initial routing, decide_after_grading checks document relevance, and decide_after_checks manages hallucination and quality results.

python
def route_question(state: RAGState) -> str:
    """Route based on question type."""
    question = state["question"]
    result = structured_llm_router.invoke(
        route_prompt.invoke({"question": question})
    )
    return result.route


def decide_after_grading(state: RAGState) -> str:
    """Decide next step based on document relevance."""
    if state["relevance_decision"] == "relevant":
        return "generate"
    count = state.get("query_rewrite_count", 0)
    if count >= 2:
        return "generate"  # proceed with what we have
    return "rewrite"


def decide_after_checks(state: RAGState) -> str:
    """Decide based on hallucination and quality checks."""
    if state["hallucination_check"] == "not_grounded":
        return "generate"  # regenerate
    if state["answer_quality"] == "not_useful":
        count = state.get("query_rewrite_count", 0)
        if count >= 2:
            return "finish"
        return "rewrite"
    return "finish"

Notice the correction strategy varies by failure type. A hallucinated answer means the documents were fine but generation went wrong — so we regenerate with the same documents. A low-quality answer means we probably retrieved wrong documents — so we rewrite the query and re-retrieve.

Here’s the graph assembly. Each add_node registers a function. Each add_conditional_edges tells LangGraph how to route between nodes.

python
workflow = StateGraph(RAGState)

# Add all nodes
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade_documents", grade_documents)
workflow.add_node("generate", generate)
workflow.add_node("check_hallucination", check_hallucination)
workflow.add_node("check_answer_quality", check_answer_quality)
workflow.add_node("rewrite_query", rewrite_query)
workflow.add_node("direct_answer", direct_answer)

# Entry point: route the question
workflow.add_conditional_edges(
    START,
    route_question,
    {
        "retrieve": "retrieve",
        "direct_answer": "direct_answer",
    },
)

# After retrieval, grade documents
workflow.add_edge("retrieve", "grade_documents")

# After grading, decide: generate or rewrite
workflow.add_conditional_edges(
    "grade_documents",
    decide_after_grading,
    {
        "generate": "generate",
        "rewrite": "rewrite_query",
    },
)

# After rewriting, retrieve again
workflow.add_edge("rewrite_query", "retrieve")

# After generation, check hallucination
workflow.add_edge("generate", "check_hallucination")

# After hallucination check, check answer quality
workflow.add_edge("check_hallucination", "check_answer_quality")

# After quality check, decide: finish or retry
workflow.add_conditional_edges(
    "check_answer_quality",
    decide_after_checks,
    {
        "finish": END,
        "generate": "generate",
        "rewrite": "rewrite_query",
    },
)

# Direct answers go straight to END
workflow.add_edge("direct_answer", END)

# Compile the graph
rag_agent = workflow.compile()

That’s the complete graph. Six processing nodes, three routing functions, and conditional edges forming a self-correcting loop.

Running the RAG Agent in LangGraph

Let’s test with a domain question that triggers retrieval. The invoke method runs the full graph and returns the final state.

python
result = rag_agent.invoke({
    "question": "How many vacation days do employees get?",
    "documents": [],
    "generation": "",
    "query_rewrite_count": 0,
    "relevance_decision": "",
    "hallucination_check": "",
    "answer_quality": "",
})
print(result["generation"])
python
Employees receive 20 vacation days per year. After 5 years of service, this increases to 25 days.

The agent retrieved the leave policy document, graded it as relevant, generated a grounded answer, and passed both the hallucination and quality checks.

A greeting that should skip retrieval entirely:

python
result = rag_agent.invoke({
    "question": "Hello, how are you today?",
    "documents": [],
    "generation": "",
    "query_rewrite_count": 0,
    "relevance_decision": "",
    "hallucination_check": "",
    "answer_quality": "",
})
print(result["generation"])
python
Hello! I'm doing well, thanks for asking. How can I help you today?

This time the router sent the question straight to direct_answer, skipping the entire retrieval pipeline.

TIP: Use rag_agent.get_graph().draw_mermaid() to visualize your graph. It generates a Mermaid diagram showing all nodes and edges — invaluable for debugging flow issues.

Self-Corrective RAG Agent — The Loop That Fixes Itself

The self-corrective pattern is the most powerful idea here. When something goes wrong — irrelevant results, hallucinated answer, or a response that misses the point — the agent adjusts and retries.

The correction strategy depends on where the failure happened:

Failure Point Correction Strategy
No relevant documents Rewrite query, then re-retrieve
Hallucinated answer Re-generate with same documents
Answer misses the question Rewrite query, re-retrieve, re-generate
Max retries exceeded Return best-effort answer with disclaimer

Why the different strategies? A hallucinated answer means the documents were fine but the LLM drifted. Re-generating usually fixes it. A low-quality answer means the wrong documents were retrieved. You need to go further back in the pipeline and search again.

KEY INSIGHT: Self-corrective RAG matches how experts actually research. They don’t stop at the first search result. They evaluate, refine their terms, and verify their conclusions. Building this loop into your agent makes it dramatically more reliable.

Adaptive RAG Agent — Routing to the Right Strategy

The agent we built handles one retrieval source. Real-world systems often need multiple sources depending on the question. Adaptive RAG extends the routing concept.

Instead of a binary “retrieve or don’t” decision, adaptive RAG routes to different strategies:

  • Vector search — for semantic similarity questions (“What’s our leave policy?”)
  • Web search — for current events not in your documents
  • SQL query — for structured data (“How many employees joined last quarter?”)
  • Direct LLM — for general knowledge questions

Here’s how to add web search as a fallback using Tavily, a search API designed for LLMs.

python
# pip install tavily-python
# os.environ["TAVILY_API_KEY"] = "your-tavily-key"

from langchain_community.tools.tavily_search import (
    TavilySearchResults,
)

web_search_tool = TavilySearchResults(max_results=3)


class AdaptiveRouteDecision(BaseModel):
    """Route to the appropriate retrieval strategy."""
    route: str = Field(
        description="One of: 'vectorstore', 'web_search', "
        "'direct_answer'"
    )

adaptive_route_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a query router for a company knowledge system.\n"
     "Route to 'vectorstore' for company policy, benefits, "
     "and HR questions.\n"
     "Route to 'web_search' for current events, industry "
     "trends, or information not in company docs.\n"
     "Route to 'direct_answer' for greetings and general chat."
     ),
    ("human", "{question}"),
])

The web search node converts results into Document objects so they flow through the same grading and generation pipeline. Regardless of where documents come from, downstream processing stays the same.

python
def web_search(state: RAGState) -> dict:
    """Search the web as a fallback retrieval source."""
    question = state["question"]
    results = web_search_tool.invoke({"query": question})
    web_docs = [
        Document(
            page_content=r["content"],
            metadata={"source": r["url"]},
        )
        for r in results
    ]
    return {"documents": web_docs}

RAG Agent with Source Citations

Let’s make the agent production-ready by adding source citations. Users need to verify answers, and citations give them a clear path to the original document.

The prompt tells the LLM to reference which documents it used. The function formats each document with its source metadata so the LLM can cite them properly.

python
citation_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a company knowledge assistant. Answer the "
     "question using ONLY the provided context.\n"
     "Rules:\n"
     "1. If the context answers the question, provide a "
     "clear, concise response.\n"
     "2. After your answer, list the sources you used.\n"
     "3. If the context doesn't contain enough information, "
     "say 'I don't have enough information to answer this "
     "question' and suggest who to contact.\n"
     "Format sources as: [Source: filename, section]"),
    ("human",
     "Question: {question}\n\nContext:\n{context}"),
])


def generate_with_citations(state: RAGState) -> dict:
    """Generate an answer with source citations."""
    question = state["question"]
    docs = state["documents"]

    context_parts = []
    for doc in docs:
        source = doc.metadata.get("source", "unknown")
        section = doc.metadata.get("section", "")
        context_parts.append(
            f"[From {source}, section: {section}]\n"
            f"{doc.page_content}"
        )
    context = "\n\n".join(context_parts)

    chain = citation_prompt | llm | StrOutputParser()
    answer = chain.invoke({
        "question": question,
        "context": context,
    })
    return {"generation": answer}

This produces answers like: “Employees receive 20 vacation days per year, increasing to 25 after 5 years. [Source: hr-policy.pdf, section: leave]”. Users can verify the answer against the original document.

WARNING: Don’t trust the LLM to get citations right 100% of the time. It sometimes attributes information to the wrong source. For critical applications, validate citations programmatically by checking which document actually contains the claimed text.

RAG vs Fine-Tuning — When to Use Which

Should you use RAG or fine-tune the model on your data? They solve different problems.

Factor RAG Fine-Tuning
Best for Factual Q&A over specific documents Changing the model’s style or domain knowledge
Data freshness Always current — update the vector store Static — retrain to update
Cost API calls per query + embedding storage One-time training cost, cheaper inference
Hallucination risk Lower — answers grounded in documents Higher — generates from learned patterns
Setup complexity Moderate — needs vector store High — needs training pipeline

Use RAG when users ask questions about specific documents that change over time. Use fine-tuning when you want the model to follow a particular style or domain vocabulary. Many production systems combine both: fine-tune for the domain’s language, then use RAG for factual grounding.

Common Mistakes and How to Fix Them

Mistake 1: Skipping the relevance grading step

python
# WRONG: Directly use all retrieved docs — no filtering
def naive_generate(state):
    docs = state["documents"]
    context = "\n".join(d.page_content for d in docs)
    return generate_chain.invoke({
        "question": state["question"],
        "context": context,
    })

Why it’s wrong: The retriever returns documents by similarity score, not actual relevance. A document about “company day celebrations” might score high for “how many vacation days” because of the word “days.” Without grading, that noise pollutes your context.

Fix: Use the grade_documents node to filter before generating.

Mistake 2: No recursion limit on the correction loop

python
# WRONG: loops forever if docs don't exist
def decide_after_grading(state):
    if state["relevance_decision"] == "not_relevant":
        return "rewrite"  # no exit condition!
    return "generate"

Why it’s wrong: If the user asks about a topic not in your documents, the agent rewrites and retrieves endlessly. Each loop costs API calls and time.

Fix: Track query_rewrite_count in the state. After 2 retries, proceed with what you have or return “I don’t know.”

Mistake 3: Mismatched embedding models

python
# WRONG: indexing and querying with different models
index_embeddings = OpenAIEmbeddings(
    model="text-embedding-3-large"
)
vectorstore = Chroma.from_documents(docs, index_embeddings)

query_embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)
retriever = vectorstore.as_retriever()
# Similarity scores will be meaningless!

Why it’s wrong: Different embedding models produce vectors in different spaces. Comparing a query vector from one model against document vectors from another gives meaningless similarity scores.

Fix: Always use the same embedding model for both indexing and querying.

Practice Exercise

Build an extended version of the RAG agent that adds a “confidence score” to each answer. The agent should rate its confidence as “high” (multiple relevant sources agree), “medium” (one source found), or “low” (answer is a best guess).

Click to see the solution
python
class ConfidenceState(TypedDict):
    question: str
    documents: List[Document]
    generation: str
    query_rewrite_count: int
    relevance_decision: str
    hallucination_check: str
    answer_quality: str
    confidence: str


def assess_confidence(state: ConfidenceState) -> dict:
    """Assess answer confidence based on source coverage."""
    docs = state["documents"]
    hallucination = state["hallucination_check"]

    if hallucination == "not_grounded":
        return {"confidence": "low"}

    relevant_count = len(docs)
    if relevant_count >= 2:
        return {"confidence": "high"}
    elif relevant_count == 1:
        return {"confidence": "medium"}
    else:
        return {"confidence": "low"}

Add this node after `check_answer_quality` and before `END`. The confidence score tells users how much to trust the answer.

typescript
{
  type: 'exercise',
  id: 'rag-confidence-ex1',
  title: 'Exercise 1: Add Confidence Scoring',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Write a function `assess_confidence` that takes the agent state and returns a confidence level. Return "high" if 2+ relevant documents were found and the answer is grounded, "medium" if exactly 1 relevant document was found, and "low" otherwise.',
  starterCode: 'def assess_confidence(state: dict) -> dict:\n    """Assess answer confidence based on sources."""\n    docs = state["documents"]\n    hallucination = state["hallucination_check"]\n    \n    # Your code here\n    # Return {"confidence": "high"/"medium"/"low"}\n    pass',
  testCases: [
    { id: 'tc1', input: 'result = assess_confidence({"documents": [doc1, doc2], "hallucination_check": "grounded"})\nprint(result["confidence"])', expectedOutput: 'high', description: '2 docs + grounded = high confidence' },
    { id: 'tc2', input: 'result = assess_confidence({"documents": [doc1], "hallucination_check": "grounded"})\nprint(result["confidence"])', expectedOutput: 'medium', description: '1 doc + grounded = medium confidence' },
    { id: 'tc3', input: 'result = assess_confidence({"documents": [], "hallucination_check": "not_grounded"})\nprint(result["confidence"])', expectedOutput: 'low', description: 'no docs or not grounded = low', hidden: true },
  ],
  hints: [
    'Check hallucination_check first — if "not_grounded", confidence is always "low"',
    'Then count docs: len(docs) >= 2 means "high", == 1 means "medium", else "low"',
  ],
  solution: 'def assess_confidence(state: dict) -> dict:\n    docs = state["documents"]\n    hallucination = state["hallucination_check"]\n    if hallucination == "not_grounded":\n        return {"confidence": "low"}\n    if len(docs) >= 2:\n        return {"confidence": "high"}\n    elif len(docs) == 1:\n        return {"confidence": "medium"}\n    return {"confidence": "low"}',
  solutionExplanation: 'The function checks grounding first. If the answer is not grounded, confidence is automatically low. Then it uses document count as a proxy — more corroborating sources means higher confidence.',
  xpReward: 15,
}
typescript
{
  type: 'exercise',
  id: 'rag-routing-ex2',
  title: 'Exercise 2: Build a Three-Way Router',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Write a `route_question` function that returns "vectorstore" for company/HR questions, "web_search" for current events or news, and "direct_answer" for greetings. Use simple keyword matching (no LLM needed).',
  starterCode: 'def route_question(question: str) -> str:\n    """Route question to the right retrieval strategy."""\n    question_lower = question.lower()\n    \n    # Your code here\n    # Return "vectorstore", "web_search", or "direct_answer"\n    pass',
  testCases: [
    { id: 'tc1', input: 'print(route_question("What is the vacation policy?"))', expectedOutput: 'vectorstore', description: 'HR question routes to vectorstore' },
    { id: 'tc2', input: 'print(route_question("What are the latest AI news?"))', expectedOutput: 'web_search', description: 'News question routes to web search' },
    { id: 'tc3', input: 'print(route_question("Hello!"))', expectedOutput: 'direct_answer', description: 'Greeting routes to direct answer' },
  ],
  hints: [
    'Check for HR-related keywords like "policy", "benefits", "vacation", "salary" for vectorstore routing',
    'Check for news keywords like "latest", "news", "current", "today" for web_search. Default to "direct_answer".',
  ],
  solution: 'def route_question(question: str) -> str:\n    question_lower = question.lower()\n    hr_keywords = ["policy", "benefits", "vacation", "salary", "hr", "leave", "insurance"]\n    web_keywords = ["latest", "news", "current", "today", "recent", "trending"]\n    if any(kw in question_lower for kw in hr_keywords):\n        return "vectorstore"\n    if any(kw in question_lower for kw in web_keywords):\n        return "web_search"\n    return "direct_answer"',
  solutionExplanation: 'This keyword-based router checks domain terms first, then news terms, and defaults to direct answer. In production, you would use an LLM for more nuanced routing.',
  xpReward: 15,
}

Complete Code

Click to expand the full script (copy-paste and run)
python
# Complete code: Building a RAG Agent with LangGraph
# Requires: pip install langgraph langchain langchain-openai
#           langchain-community chromadb tiktoken
# Python 3.10+

import os
from typing import List, TypedDict
from langgraph.graph import StateGraph, END, START
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from pydantic import BaseModel, Field

os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# --- Documents and Vector Store ---
documents = [
    Document(
        page_content="Employees receive 20 vacation days per year. "
        "After 5 years of service, this increases to 25 days.",
        metadata={"source": "hr-policy.pdf", "section": "leave"},
    ),
    Document(
        page_content="The company matches 401(k) contributions up to "
        "6% of the employee's salary. Vesting is immediate.",
        metadata={"source": "benefits-guide.pdf",
                  "section": "retirement"},
    ),
    Document(
        page_content="Remote work is permitted 3 days per week. "
        "Employees must be in-office Tuesdays and Thursdays.",
        metadata={"source": "hr-policy.pdf",
                  "section": "remote-work"},
    ),
    Document(
        page_content="Performance reviews occur twice per year, in "
        "June and December. Managers use a 5-point rating scale.",
        metadata={"source": "hr-policy.pdf",
                  "section": "reviews"},
    ),
    Document(
        page_content="Health insurance covers medical, dental, and "
        "vision. The company pays 80% of premiums for employees "
        "and 60% for dependents.",
        metadata={"source": "benefits-guide.pdf",
                  "section": "health"},
    ),
    Document(
        page_content="New employees complete a 90-day probation "
        "period. During probation, either party may terminate "
        "with one week's notice.",
        metadata={"source": "hr-policy.pdf",
                  "section": "onboarding"},
    ),
]

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    collection_name="hr_knowledge_base",
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# --- State Definition ---
class RAGState(TypedDict):
    question: str
    documents: List[Document]
    generation: str
    query_rewrite_count: int
    relevance_decision: str
    hallucination_check: str
    answer_quality: str

# --- LLM and Graders ---
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

class RouteDecision(BaseModel):
    route: str = Field(
        description="'retrieve' or 'direct_answer'"
    )

route_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a router. Route to 'retrieve' for company "
     "policy/HR questions, 'direct_answer' for greetings."),
    ("human", "{question}"),
])
structured_llm_router = llm.with_structured_output(
    RouteDecision
)

class RelevanceGrade(BaseModel):
    score: str = Field(description="'yes' or 'no'")

grade_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Grade if this document is relevant to the question. "
     "Binary 'yes' or 'no'."),
    ("human", "Question: {question}\nDocument: {document}"),
])
relevance_grader = grade_prompt | llm.with_structured_output(
    RelevanceGrade
)

generate_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Answer using ONLY the provided context. Be concise."),
    ("human", "Question: {question}\nContext:\n{context}"),
])
generate_chain = generate_prompt | llm | StrOutputParser()

class HallucinationGrade(BaseModel):
    score: str = Field(description="'yes' or 'no'")

hallucination_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Is this generation grounded in the documents? "
     "'yes' or 'no'."),
    ("human",
     "Documents:\n{documents}\nGeneration: {generation}"),
])
hallucination_grader = (
    hallucination_prompt
    | llm.with_structured_output(HallucinationGrade)
)

class AnswerGrade(BaseModel):
    score: str = Field(description="'yes' or 'no'")

answer_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Does this answer address the question? 'yes' or 'no'."),
    ("human",
     "Question: {question}\nAnswer: {generation}"),
])
answer_grader = (
    answer_prompt
    | llm.with_structured_output(AnswerGrade)
)

rewrite_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Rewrite this question for better search results. "
     "Return only the rewritten question."),
    ("human", "Original question: {question}"),
])
rewrite_chain = rewrite_prompt | llm | StrOutputParser()

# --- Node Functions ---
def retrieve(state: RAGState) -> dict:
    docs = retriever.invoke(state["question"])
    return {"documents": docs}

def grade_documents(state: RAGState) -> dict:
    question = state["question"]
    relevant = []
    for doc in state["documents"]:
        result = relevance_grader.invoke({
            "question": question,
            "document": doc.page_content,
        })
        if result.score == "yes":
            relevant.append(doc)
    decision = "relevant" if relevant else "not_relevant"
    return {"documents": relevant,
            "relevance_decision": decision}

def generate(state: RAGState) -> dict:
    context = "\n\n".join(
        d.page_content for d in state["documents"]
    )
    answer = generate_chain.invoke({
        "question": state["question"],
        "context": context,
    })
    return {"generation": answer}

def check_hallucination(state: RAGState) -> dict:
    doc_text = "\n\n".join(
        d.page_content for d in state["documents"]
    )
    result = hallucination_grader.invoke({
        "documents": doc_text,
        "generation": state["generation"],
    })
    check = (
        "grounded" if result.score == "yes"
        else "not_grounded"
    )
    return {"hallucination_check": check}

def check_answer_quality(state: RAGState) -> dict:
    result = answer_grader.invoke({
        "question": state["question"],
        "generation": state["generation"],
    })
    quality = (
        "useful" if result.score == "yes"
        else "not_useful"
    )
    return {"answer_quality": quality}

def rewrite_query(state: RAGState) -> dict:
    rewritten = rewrite_chain.invoke({
        "question": state["question"]
    })
    count = state.get("query_rewrite_count", 0)
    return {"question": rewritten,
            "query_rewrite_count": count + 1}

def direct_answer(state: RAGState) -> dict:
    response = llm.invoke(
        f"Answer briefly: {state['question']}"
    )
    return {"generation": response.content}

# --- Routing Functions ---
def route_question(state: RAGState) -> str:
    result = structured_llm_router.invoke(
        route_prompt.invoke(
            {"question": state["question"]}
        )
    )
    return result.route

def decide_after_grading(state: RAGState) -> str:
    if state["relevance_decision"] == "relevant":
        return "generate"
    count = state.get("query_rewrite_count", 0)
    if count >= 2:
        return "generate"
    return "rewrite"

def decide_after_checks(state: RAGState) -> str:
    if state["hallucination_check"] == "not_grounded":
        return "generate"
    if state["answer_quality"] == "not_useful":
        count = state.get("query_rewrite_count", 0)
        if count >= 2:
            return "finish"
        return "rewrite"
    return "finish"

# --- Build Graph ---
workflow = StateGraph(RAGState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade_documents", grade_documents)
workflow.add_node("generate", generate)
workflow.add_node("check_hallucination", check_hallucination)
workflow.add_node("check_answer_quality", check_answer_quality)
workflow.add_node("rewrite_query", rewrite_query)
workflow.add_node("direct_answer", direct_answer)

workflow.add_conditional_edges(
    START, route_question,
    {"retrieve": "retrieve",
     "direct_answer": "direct_answer"},
)
workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges(
    "grade_documents", decide_after_grading,
    {"generate": "generate", "rewrite": "rewrite_query"},
)
workflow.add_edge("rewrite_query", "retrieve")
workflow.add_edge("generate", "check_hallucination")
workflow.add_edge(
    "check_hallucination", "check_answer_quality"
)
workflow.add_conditional_edges(
    "check_answer_quality", decide_after_checks,
    {"finish": END, "generate": "generate",
     "rewrite": "rewrite_query"},
)
workflow.add_edge("direct_answer", END)

rag_agent = workflow.compile()

# --- Run ---
result = rag_agent.invoke({
    "question": "How many vacation days do employees get?",
    "documents": [],
    "generation": "",
    "query_rewrite_count": 0,
    "relevance_decision": "",
    "hallucination_check": "",
    "answer_quality": "",
})
print(result["generation"])

Summary

You’ve built a complete RAG agent with LangGraph. It goes far beyond simple retrieve-and-generate by adding intelligent routing, document grading, hallucination detection, and self-correction.

The key patterns to remember:

  • Route first — don’t retrieve when you don’t need to
  • Grade documents — never trust raw retriever results blindly
  • Check your answers — hallucination and quality checks cost pennies and save trust
  • Limit your loops — always set a maximum retry count
  • Cite sources — traceability builds user confidence

These patterns compose well. You can add web search, multi-source retrieval, or confidence scoring without restructuring the graph. Each new capability is just another node with the right conditional edges.

Frequently Asked Questions

How much does running a RAG agent cost compared to a simple LLM call?

Each invocation makes multiple LLM calls: routing, grading per document, generation, hallucination check, and quality check. With GPT-4o-mini, a typical query costs roughly 5-10x more than a single call. That’s still under $0.01 per query for most cases. The accuracy improvement usually justifies the cost.

Can I use open-source models instead of OpenAI?

Yes. Replace ChatOpenAI with any LangChain-compatible chat model. Ollama works well for local models — use ChatOllama(model="llama3"). The grading nodes need reliable instruction-following, so test your model before deploying. Smaller models sometimes struggle with structured output.

How do I handle documents that are too long for the context window?

Split documents into chunks before indexing. LangChain’s RecursiveCharacterTextSplitter is the standard choice. Set chunk size to 500-1000 characters with 100-200 character overlap.

python
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150,
)
chunks = splitter.split_documents(documents)

What’s the difference between Corrective RAG and Adaptive RAG?

Corrective RAG focuses on fixing bad retrievals — it grades documents and rewrites queries when results are poor. Adaptive RAG adds intelligent routing — it picks the best retrieval strategy based on question type. Our agent combines both. The LangGraph docs call these “CRAG” and “Adaptive RAG” respectively.

How do I evaluate whether my RAG agent is performing well?

Track three metrics: retrieval precision (percentage of relevant retrieved docs), answer faithfulness (is the answer grounded), and answer relevance (does it address the question). Tools like RAGAS and LangSmith automate evaluation for all three. Start by logging every run and reviewing edge cases.

References

  1. LangGraph documentation — Agentic RAG tutorial. Link
  2. LangGraph documentation — Adaptive RAG tutorial. Link
  3. LangChain blog — Self-Reflective RAG with LangGraph. Link
  4. Yan, S. et al. — Corrective Retrieval Augmented Generation (CRAG). arXiv:2401.15884 (2024).
  5. Jeong, S. et al. — Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. arXiv:2403.14403 (2024).
  6. Lewis, P. et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS (2020). Link
  7. ChromaDB documentation. Link
  8. OpenAI Embeddings documentation. Link
  9. LangChain documentation — Text Splitters. Link
  10. Es, S. et al. — RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217 (2023).

Reviewed: March 2026 | LangGraph version: 0.4+ | LangChain version: 0.3+

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science