Building a RAG Agent with LangGraph — Retrieval-Augmented Generation Done Right
You ask your chatbot about your company’s HR policy. It confidently says employees get 30 vacation days. The actual policy says 20. That’s a hallucination — and a RAG agent built with LangGraph won’t make it. It retrieves the right documents, grades their relevance, and verifies its own answer before responding.
What Is RAG and Why Do Agents Make It Better?
RAG stands for Retrieval-Augmented Generation. You give an LLM access to external documents so it answers questions from your data — not just its training knowledge.
The simplest version works like this: take the user’s question, search a vector database, stuff the top results into the prompt, and generate an answer. That’s “naive RAG.” It works well for straightforward questions over clean documents.
But it breaks down fast. What happens when the retrieved documents aren’t relevant? The LLM generates an answer anyway — often a confident-sounding wrong one. What if the question is ambiguous? Naive RAG doesn’t rephrase or retry.
KEY INSIGHT: Agentic RAG replaces the rigid retrieve-then-generate pipeline with a decision-making graph. The agent chooses whether to retrieve, evaluates what it got, and loops back to try again — just like a human researcher would.
Here’s how they compare:
| Feature | Naive RAG | Agentic RAG |
|---|---|---|
| Retrieval | Always retrieves, one pass | Decides IF and WHEN to retrieve |
| Relevance check | None — uses whatever comes back | Grades each document, discards junk |
| Query refinement | None | Rewrites query if results are poor |
| Fallback sources | None | Web search, alternative indexes |
| Hallucination check | None | Verifies answer against sources |
| Answer quality | No verification | Checks if answer addresses the question |
| Error recovery | Fails silently | Loops back and retries |
The difference is control. Naive RAG is a straight pipe. Agentic RAG is a loop with decision points.
Here’s the full pipeline we’ll build: route the question, retrieve documents, grade relevance, generate an answer, check for hallucinations, and verify answer quality. Six stages, all wired together in a LangGraph StateGraph.
Prerequisites
- Python version: 3.10+
- Required libraries: langgraph (0.4+), langchain (0.3+), langchain-openai, langchain-community, chromadb, tiktoken
- Install:
pip install langgraph langchain langchain-openai langchain-community chromadb tiktoken - API key: An OpenAI API key (set as
OPENAI_API_KEYenvironment variable). See the OpenAI platform to create one. - Previous knowledge: Familiarity with LangGraph basics (nodes, edges, state). See our earlier posts on graph concepts and state management.
- Time to complete: 35-40 minutes
Setting Up the Retrieval Pipeline
Before we build the agent, we need documents to retrieve from. We’ll create a small knowledge base, embed those documents, and store them in ChromaDB — a lightweight vector database that runs locally.
This first block imports everything we’ll use and sets up the API key. We’re pulling in LangGraph’s StateGraph for building the agent, LangChain’s document and embedding classes, and ChromaDB for vector storage.
import os
from typing import List, TypedDict
from langgraph.graph import StateGraph, END, START
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from pydantic import BaseModel, Field
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
Next, we create sample documents representing an HR knowledge base. In a real project, you’d load these from PDFs, databases, or web pages. Each Document holds text content and metadata like the source file.
documents = [
Document(
page_content="Employees receive 20 vacation days per year. "
"After 5 years of service, this increases to 25 days.",
metadata={"source": "hr-policy.pdf", "section": "leave"},
),
Document(
page_content="The company matches 401(k) contributions up to 6% "
"of the employee's salary. Vesting is immediate.",
metadata={"source": "benefits-guide.pdf",
"section": "retirement"},
),
Document(
page_content="Remote work is permitted 3 days per week. Employees "
"must be in-office on Tuesdays and Thursdays.",
metadata={"source": "hr-policy.pdf",
"section": "remote-work"},
),
Document(
page_content="Performance reviews occur twice per year, in June "
"and December. Managers use a 5-point rating scale.",
metadata={"source": "hr-policy.pdf", "section": "reviews"},
),
Document(
page_content="Health insurance covers medical, dental, and vision. "
"The company pays 80% of premiums for employees "
"and 60% for dependents.",
metadata={"source": "benefits-guide.pdf",
"section": "health"},
),
Document(
page_content="New employees complete a 90-day probation period. "
"During probation, either party may terminate with "
"one week's notice.",
metadata={"source": "hr-policy.pdf",
"section": "onboarding"},
),
]
We embed those documents and store them in ChromaDB. The OpenAIEmbeddings model converts text into vectors. ChromaDB indexes those vectors for similarity search. The from_documents method handles embedding and indexing in one call.
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=documents,
embedding=embeddings,
collection_name="hr_knowledge_base",
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
The retriever returns the top 3 most similar documents for any query. We’ll plug this retriever into our agent graph.
TIP: Choose
kbased on your context window budget. Each retrieved document eats tokens. With GPT-4, you have room. With smaller models, keepkat 2-3 to leave space for the system prompt.
Defining the Agent State
Every LangGraph application needs a state schema — a TypedDict that defines what data flows through the graph. Our RAG agent state tracks the question, retrieved documents, the generated answer, and control flags for routing.
class RAGState(TypedDict):
question: str
documents: List[Document]
generation: str
query_rewrite_count: int
relevance_decision: str # "relevant" or "not_relevant"
hallucination_check: str # "grounded" or "not_grounded"
answer_quality: str # "useful" or "not_useful"
Seven fields total. The first three hold data (question, documents, answer). The last four are control signals the agent uses to decide where to go next.
Building the RAG Agent Graph Node by Node
This is where the real work happens. We’ll build six nodes — one for each pipeline stage. Then we wire them together with conditional edges.
Node 1: Query Router
Should this question go to the vector store, or can the LLM answer directly? A question like “How many vacation days do I get?” needs retrieval. “Hi, how are you?” doesn’t.
We use structured output to get a clean routing decision. The RouteDecision Pydantic model forces the LLM to return either “retrieve” or “direct_answer.”
class RouteDecision(BaseModel):
"""Route the question to retrieval or direct answer."""
route: str = Field(
description="Route to 'retrieve' for domain questions "
"or 'direct_answer' for greetings and general chat"
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
route_prompt = ChatPromptTemplate.from_messages([
("system",
"You are a router. Given a user question, decide if it needs "
"document retrieval or can be answered directly.\n"
"- Use 'retrieve' for questions about company policies, "
"benefits, HR topics, or anything domain-specific.\n"
"- Use 'direct_answer' for greetings, general chat, or "
"questions that don't need company documents.\n"
"Respond with only the route."),
("human", "{question}"),
])
structured_llm_router = llm.with_structured_output(RouteDecision)
I prefer gpt-4o-mini for routing and grading calls. It’s fast, cheap, and reliable enough for binary decisions. Save the heavier models for generation.
Node 2: Retrieve Documents
The retrieval node queries the vector store and returns matching documents. It’s the simplest node in the graph.
def retrieve(state: RAGState) -> dict:
"""Retrieve documents from the vector store."""
question = state["question"]
docs = retriever.invoke(question)
return {"documents": docs}
Short and clean. The retriever embeds the question, searches ChromaDB, and returns top-k results. We just pass them into the state.
Node 3: Grade Document Relevance
This node separates agentic RAG from naive RAG. Instead of accepting whatever the retriever returns, we check each document. The LLM reads each document alongside the question and gives a binary “yes” or “no.”
class RelevanceGrade(BaseModel):
"""Binary relevance grade for a retrieved document."""
score: str = Field(
description="'yes' if the document is relevant to the "
"question, 'no' otherwise"
)
grade_prompt = ChatPromptTemplate.from_messages([
("system",
"You are a document relevance grader. Given a user question "
"and a retrieved document, decide if the document contains "
"information relevant to answering the question.\n"
"Give a binary 'yes' or 'no' score."),
("human",
"Question: {question}\n\nDocument: {document}"),
])
relevance_grader = grade_prompt | llm.with_structured_output(
RelevanceGrade
)
The grade_documents function iterates over all retrieved documents, grades each one, and keeps only the relevant ones. If none survive, the agent rewrites the query.
def grade_documents(state: RAGState) -> dict:
"""Grade retrieved documents for relevance."""
question = state["question"]
docs = state["documents"]
relevant_docs = []
for doc in docs:
result = relevance_grader.invoke({
"question": question,
"document": doc.page_content,
})
if result.score == "yes":
relevant_docs.append(doc)
decision = "relevant" if relevant_docs else "not_relevant"
return {
"documents": relevant_docs,
"relevance_decision": decision,
}
KEY INSIGHT: Document grading is the cheapest form of quality control in a RAG pipeline. A single grading call with GPT-4o-mini costs a fraction of a cent. Letting an irrelevant document pollute your generation costs the user’s trust.
Node 4: Generate Answer
When the agent has relevant documents, it generates an answer. The prompt instructs the LLM to use only the provided context. This “grounded generation” approach reduces hallucinations significantly.
generate_prompt = ChatPromptTemplate.from_messages([
("system",
"You are an assistant answering questions using the provided "
"context. Use ONLY the information in the context to answer. "
"If the context doesn't contain enough information, say so. "
"Keep answers concise and direct."),
("human",
"Question: {question}\n\nContext:\n{context}"),
])
generate_chain = generate_prompt | llm | StrOutputParser()
def generate(state: RAGState) -> dict:
"""Generate an answer from relevant documents."""
question = state["question"]
docs = state["documents"]
context = "\n\n".join(doc.page_content for doc in docs)
answer = generate_chain.invoke({
"question": question,
"context": context,
})
return {"generation": answer}
Node 5: Hallucination Check
The generated answer might sound right but say things the documents don’t support. This node compares the generation against source documents and asks: “Is every claim grounded?”
class HallucinationGrade(BaseModel):
"""Check if generation is grounded in the documents."""
score: str = Field(
description="'yes' if the answer is grounded in the "
"documents, 'no' if it contains unsupported claims"
)
hallucination_prompt = ChatPromptTemplate.from_messages([
("system",
"You are a hallucination grader. Given a set of source "
"documents and an LLM generation, determine if the "
"generation is supported by the documents.\n"
"Score 'yes' if all claims are grounded in the documents. "
"Score 'no' if any claim is not supported."),
("human",
"Documents:\n{documents}\n\nGeneration: {generation}"),
])
hallucination_grader = hallucination_prompt | llm.with_structured_output(
HallucinationGrade
)
def check_hallucination(state: RAGState) -> dict:
"""Check if the generation is grounded in documents."""
docs = state["documents"]
generation = state["generation"]
doc_text = "\n\n".join(doc.page_content for doc in docs)
result = hallucination_grader.invoke({
"documents": doc_text,
"generation": generation,
})
return {
"hallucination_check": (
"grounded" if result.score == "yes"
else "not_grounded"
)
}
Node 6: Answer Quality Check
An answer can be perfectly grounded yet still miss the point. This final check asks whether the response actually addresses what the user asked.
class AnswerGrade(BaseModel):
"""Check if the answer addresses the question."""
score: str = Field(
description="'yes' if the answer addresses the question, "
"'no' if it misses the point"
)
answer_prompt = ChatPromptTemplate.from_messages([
("system",
"You are an answer quality grader. Given a user question "
"and an LLM generation, determine if the answer addresses "
"the question.\n"
"Score 'yes' if it is useful and relevant. "
"Score 'no' if it doesn't answer what was asked."),
("human",
"Question: {question}\n\nAnswer: {generation}"),
])
answer_grader = answer_prompt | llm.with_structured_output(
AnswerGrade
)
def check_answer_quality(state: RAGState) -> dict:
"""Check if the generation answers the question."""
question = state["question"]
generation = state["generation"]
result = answer_grader.invoke({
"question": question,
"generation": generation,
})
return {
"answer_quality": (
"useful" if result.score == "yes"
else "not_useful"
)
}
The Query Rewrite Node
When relevance grading or quality checks fail, the agent rephrases the question for better retrieval. It also increments a counter so we don’t loop forever.
rewrite_prompt = ChatPromptTemplate.from_messages([
("system",
"You are a query rewriter. Given a user question that "
"didn't produce relevant search results, rewrite it to "
"improve retrieval. Make it more specific or use "
"different keywords. Return only the rewritten question."),
("human", "Original question: {question}"),
])
rewrite_chain = rewrite_prompt | llm | StrOutputParser()
def rewrite_query(state: RAGState) -> dict:
"""Rewrite the question for better retrieval."""
question = state["question"]
rewritten = rewrite_chain.invoke({"question": question})
count = state.get("query_rewrite_count", 0)
return {
"question": rewritten,
"query_rewrite_count": count + 1,
}
WARNING: Always set a maximum rewrite limit. Without one, a question that genuinely can’t be answered from your documents will loop forever. Two retries is a good default — after that, return what you have or say “I don’t know.”
The Direct Answer Node
For questions that don’t need retrieval — greetings, general knowledge, off-topic chatter — we let the LLM answer directly.
def direct_answer(state: RAGState) -> dict:
"""Answer without retrieval for non-domain questions."""
question = state["question"]
response = llm.invoke(
f"Answer this briefly: {question}"
)
return {"generation": response.content}
Wiring the Graph Together
All the nodes are built. Now we connect them with conditional edges. Three routing functions control the flow: route_question handles initial routing, decide_after_grading checks document relevance, and decide_after_checks manages hallucination and quality results.
def route_question(state: RAGState) -> str:
"""Route based on question type."""
question = state["question"]
result = structured_llm_router.invoke(
route_prompt.invoke({"question": question})
)
return result.route
def decide_after_grading(state: RAGState) -> str:
"""Decide next step based on document relevance."""
if state["relevance_decision"] == "relevant":
return "generate"
count = state.get("query_rewrite_count", 0)
if count >= 2:
return "generate" # proceed with what we have
return "rewrite"
def decide_after_checks(state: RAGState) -> str:
"""Decide based on hallucination and quality checks."""
if state["hallucination_check"] == "not_grounded":
return "generate" # regenerate
if state["answer_quality"] == "not_useful":
count = state.get("query_rewrite_count", 0)
if count >= 2:
return "finish"
return "rewrite"
return "finish"
Notice the correction strategy varies by failure type. A hallucinated answer means the documents were fine but generation went wrong — so we regenerate with the same documents. A low-quality answer means we probably retrieved wrong documents — so we rewrite the query and re-retrieve.
Here’s the graph assembly. Each add_node registers a function. Each add_conditional_edges tells LangGraph how to route between nodes.
workflow = StateGraph(RAGState)
# Add all nodes
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade_documents", grade_documents)
workflow.add_node("generate", generate)
workflow.add_node("check_hallucination", check_hallucination)
workflow.add_node("check_answer_quality", check_answer_quality)
workflow.add_node("rewrite_query", rewrite_query)
workflow.add_node("direct_answer", direct_answer)
# Entry point: route the question
workflow.add_conditional_edges(
START,
route_question,
{
"retrieve": "retrieve",
"direct_answer": "direct_answer",
},
)
# After retrieval, grade documents
workflow.add_edge("retrieve", "grade_documents")
# After grading, decide: generate or rewrite
workflow.add_conditional_edges(
"grade_documents",
decide_after_grading,
{
"generate": "generate",
"rewrite": "rewrite_query",
},
)
# After rewriting, retrieve again
workflow.add_edge("rewrite_query", "retrieve")
# After generation, check hallucination
workflow.add_edge("generate", "check_hallucination")
# After hallucination check, check answer quality
workflow.add_edge("check_hallucination", "check_answer_quality")
# After quality check, decide: finish or retry
workflow.add_conditional_edges(
"check_answer_quality",
decide_after_checks,
{
"finish": END,
"generate": "generate",
"rewrite": "rewrite_query",
},
)
# Direct answers go straight to END
workflow.add_edge("direct_answer", END)
# Compile the graph
rag_agent = workflow.compile()
That’s the complete graph. Six processing nodes, three routing functions, and conditional edges forming a self-correcting loop.
Running the RAG Agent in LangGraph
Let’s test with a domain question that triggers retrieval. The invoke method runs the full graph and returns the final state.
result = rag_agent.invoke({
"question": "How many vacation days do employees get?",
"documents": [],
"generation": "",
"query_rewrite_count": 0,
"relevance_decision": "",
"hallucination_check": "",
"answer_quality": "",
})
print(result["generation"])
Employees receive 20 vacation days per year. After 5 years of service, this increases to 25 days.
The agent retrieved the leave policy document, graded it as relevant, generated a grounded answer, and passed both the hallucination and quality checks.
A greeting that should skip retrieval entirely:
result = rag_agent.invoke({
"question": "Hello, how are you today?",
"documents": [],
"generation": "",
"query_rewrite_count": 0,
"relevance_decision": "",
"hallucination_check": "",
"answer_quality": "",
})
print(result["generation"])
Hello! I'm doing well, thanks for asking. How can I help you today?
This time the router sent the question straight to direct_answer, skipping the entire retrieval pipeline.
TIP: Use
rag_agent.get_graph().draw_mermaid()to visualize your graph. It generates a Mermaid diagram showing all nodes and edges — invaluable for debugging flow issues.
Self-Corrective RAG Agent — The Loop That Fixes Itself
The self-corrective pattern is the most powerful idea here. When something goes wrong — irrelevant results, hallucinated answer, or a response that misses the point — the agent adjusts and retries.
The correction strategy depends on where the failure happened:
| Failure Point | Correction Strategy |
|---|---|
| No relevant documents | Rewrite query, then re-retrieve |
| Hallucinated answer | Re-generate with same documents |
| Answer misses the question | Rewrite query, re-retrieve, re-generate |
| Max retries exceeded | Return best-effort answer with disclaimer |
Why the different strategies? A hallucinated answer means the documents were fine but the LLM drifted. Re-generating usually fixes it. A low-quality answer means the wrong documents were retrieved. You need to go further back in the pipeline and search again.
KEY INSIGHT: Self-corrective RAG matches how experts actually research. They don’t stop at the first search result. They evaluate, refine their terms, and verify their conclusions. Building this loop into your agent makes it dramatically more reliable.
Adaptive RAG Agent — Routing to the Right Strategy
The agent we built handles one retrieval source. Real-world systems often need multiple sources depending on the question. Adaptive RAG extends the routing concept.
Instead of a binary “retrieve or don’t” decision, adaptive RAG routes to different strategies:
- Vector search — for semantic similarity questions (“What’s our leave policy?”)
- Web search — for current events not in your documents
- SQL query — for structured data (“How many employees joined last quarter?”)
- Direct LLM — for general knowledge questions
Here’s how to add web search as a fallback using Tavily, a search API designed for LLMs.
# pip install tavily-python
# os.environ["TAVILY_API_KEY"] = "your-tavily-key"
from langchain_community.tools.tavily_search import (
TavilySearchResults,
)
web_search_tool = TavilySearchResults(max_results=3)
class AdaptiveRouteDecision(BaseModel):
"""Route to the appropriate retrieval strategy."""
route: str = Field(
description="One of: 'vectorstore', 'web_search', "
"'direct_answer'"
)
adaptive_route_prompt = ChatPromptTemplate.from_messages([
("system",
"You are a query router for a company knowledge system.\n"
"Route to 'vectorstore' for company policy, benefits, "
"and HR questions.\n"
"Route to 'web_search' for current events, industry "
"trends, or information not in company docs.\n"
"Route to 'direct_answer' for greetings and general chat."
),
("human", "{question}"),
])
The web search node converts results into Document objects so they flow through the same grading and generation pipeline. Regardless of where documents come from, downstream processing stays the same.
def web_search(state: RAGState) -> dict:
"""Search the web as a fallback retrieval source."""
question = state["question"]
results = web_search_tool.invoke({"query": question})
web_docs = [
Document(
page_content=r["content"],
metadata={"source": r["url"]},
)
for r in results
]
return {"documents": web_docs}
RAG Agent with Source Citations
Let’s make the agent production-ready by adding source citations. Users need to verify answers, and citations give them a clear path to the original document.
The prompt tells the LLM to reference which documents it used. The function formats each document with its source metadata so the LLM can cite them properly.
citation_prompt = ChatPromptTemplate.from_messages([
("system",
"You are a company knowledge assistant. Answer the "
"question using ONLY the provided context.\n"
"Rules:\n"
"1. If the context answers the question, provide a "
"clear, concise response.\n"
"2. After your answer, list the sources you used.\n"
"3. If the context doesn't contain enough information, "
"say 'I don't have enough information to answer this "
"question' and suggest who to contact.\n"
"Format sources as: [Source: filename, section]"),
("human",
"Question: {question}\n\nContext:\n{context}"),
])
def generate_with_citations(state: RAGState) -> dict:
"""Generate an answer with source citations."""
question = state["question"]
docs = state["documents"]
context_parts = []
for doc in docs:
source = doc.metadata.get("source", "unknown")
section = doc.metadata.get("section", "")
context_parts.append(
f"[From {source}, section: {section}]\n"
f"{doc.page_content}"
)
context = "\n\n".join(context_parts)
chain = citation_prompt | llm | StrOutputParser()
answer = chain.invoke({
"question": question,
"context": context,
})
return {"generation": answer}
This produces answers like: “Employees receive 20 vacation days per year, increasing to 25 after 5 years. [Source: hr-policy.pdf, section: leave]”. Users can verify the answer against the original document.
WARNING: Don’t trust the LLM to get citations right 100% of the time. It sometimes attributes information to the wrong source. For critical applications, validate citations programmatically by checking which document actually contains the claimed text.
RAG vs Fine-Tuning — When to Use Which
Should you use RAG or fine-tune the model on your data? They solve different problems.
| Factor | RAG | Fine-Tuning |
|---|---|---|
| Best for | Factual Q&A over specific documents | Changing the model’s style or domain knowledge |
| Data freshness | Always current — update the vector store | Static — retrain to update |
| Cost | API calls per query + embedding storage | One-time training cost, cheaper inference |
| Hallucination risk | Lower — answers grounded in documents | Higher — generates from learned patterns |
| Setup complexity | Moderate — needs vector store | High — needs training pipeline |
Use RAG when users ask questions about specific documents that change over time. Use fine-tuning when you want the model to follow a particular style or domain vocabulary. Many production systems combine both: fine-tune for the domain’s language, then use RAG for factual grounding.
Common Mistakes and How to Fix Them
Mistake 1: Skipping the relevance grading step
# WRONG: Directly use all retrieved docs — no filtering
def naive_generate(state):
docs = state["documents"]
context = "\n".join(d.page_content for d in docs)
return generate_chain.invoke({
"question": state["question"],
"context": context,
})
Why it’s wrong: The retriever returns documents by similarity score, not actual relevance. A document about “company day celebrations” might score high for “how many vacation days” because of the word “days.” Without grading, that noise pollutes your context.
Fix: Use the grade_documents node to filter before generating.
Mistake 2: No recursion limit on the correction loop
# WRONG: loops forever if docs don't exist
def decide_after_grading(state):
if state["relevance_decision"] == "not_relevant":
return "rewrite" # no exit condition!
return "generate"
Why it’s wrong: If the user asks about a topic not in your documents, the agent rewrites and retrieves endlessly. Each loop costs API calls and time.
Fix: Track query_rewrite_count in the state. After 2 retries, proceed with what you have or return “I don’t know.”
Mistake 3: Mismatched embedding models
# WRONG: indexing and querying with different models
index_embeddings = OpenAIEmbeddings(
model="text-embedding-3-large"
)
vectorstore = Chroma.from_documents(docs, index_embeddings)
query_embeddings = OpenAIEmbeddings(
model="text-embedding-3-small"
)
retriever = vectorstore.as_retriever()
# Similarity scores will be meaningless!
Why it’s wrong: Different embedding models produce vectors in different spaces. Comparing a query vector from one model against document vectors from another gives meaningless similarity scores.
Fix: Always use the same embedding model for both indexing and querying.
Practice Exercise
Build an extended version of the RAG agent that adds a “confidence score” to each answer. The agent should rate its confidence as “high” (multiple relevant sources agree), “medium” (one source found), or “low” (answer is a best guess).
Click to see the solution
class ConfidenceState(TypedDict):
question: str
documents: List[Document]
generation: str
query_rewrite_count: int
relevance_decision: str
hallucination_check: str
answer_quality: str
confidence: str
def assess_confidence(state: ConfidenceState) -> dict:
"""Assess answer confidence based on source coverage."""
docs = state["documents"]
hallucination = state["hallucination_check"]
if hallucination == "not_grounded":
return {"confidence": "low"}
relevant_count = len(docs)
if relevant_count >= 2:
return {"confidence": "high"}
elif relevant_count == 1:
return {"confidence": "medium"}
else:
return {"confidence": "low"}
Add this node after `check_answer_quality` and before `END`. The confidence score tells users how much to trust the answer.
{
type: 'exercise',
id: 'rag-confidence-ex1',
title: 'Exercise 1: Add Confidence Scoring',
difficulty: 'intermediate',
exerciseType: 'write',
instructions: 'Write a function `assess_confidence` that takes the agent state and returns a confidence level. Return "high" if 2+ relevant documents were found and the answer is grounded, "medium" if exactly 1 relevant document was found, and "low" otherwise.',
starterCode: 'def assess_confidence(state: dict) -> dict:\n """Assess answer confidence based on sources."""\n docs = state["documents"]\n hallucination = state["hallucination_check"]\n \n # Your code here\n # Return {"confidence": "high"/"medium"/"low"}\n pass',
testCases: [
{ id: 'tc1', input: 'result = assess_confidence({"documents": [doc1, doc2], "hallucination_check": "grounded"})\nprint(result["confidence"])', expectedOutput: 'high', description: '2 docs + grounded = high confidence' },
{ id: 'tc2', input: 'result = assess_confidence({"documents": [doc1], "hallucination_check": "grounded"})\nprint(result["confidence"])', expectedOutput: 'medium', description: '1 doc + grounded = medium confidence' },
{ id: 'tc3', input: 'result = assess_confidence({"documents": [], "hallucination_check": "not_grounded"})\nprint(result["confidence"])', expectedOutput: 'low', description: 'no docs or not grounded = low', hidden: true },
],
hints: [
'Check hallucination_check first — if "not_grounded", confidence is always "low"',
'Then count docs: len(docs) >= 2 means "high", == 1 means "medium", else "low"',
],
solution: 'def assess_confidence(state: dict) -> dict:\n docs = state["documents"]\n hallucination = state["hallucination_check"]\n if hallucination == "not_grounded":\n return {"confidence": "low"}\n if len(docs) >= 2:\n return {"confidence": "high"}\n elif len(docs) == 1:\n return {"confidence": "medium"}\n return {"confidence": "low"}',
solutionExplanation: 'The function checks grounding first. If the answer is not grounded, confidence is automatically low. Then it uses document count as a proxy — more corroborating sources means higher confidence.',
xpReward: 15,
}
{
type: 'exercise',
id: 'rag-routing-ex2',
title: 'Exercise 2: Build a Three-Way Router',
difficulty: 'intermediate',
exerciseType: 'write',
instructions: 'Write a `route_question` function that returns "vectorstore" for company/HR questions, "web_search" for current events or news, and "direct_answer" for greetings. Use simple keyword matching (no LLM needed).',
starterCode: 'def route_question(question: str) -> str:\n """Route question to the right retrieval strategy."""\n question_lower = question.lower()\n \n # Your code here\n # Return "vectorstore", "web_search", or "direct_answer"\n pass',
testCases: [
{ id: 'tc1', input: 'print(route_question("What is the vacation policy?"))', expectedOutput: 'vectorstore', description: 'HR question routes to vectorstore' },
{ id: 'tc2', input: 'print(route_question("What are the latest AI news?"))', expectedOutput: 'web_search', description: 'News question routes to web search' },
{ id: 'tc3', input: 'print(route_question("Hello!"))', expectedOutput: 'direct_answer', description: 'Greeting routes to direct answer' },
],
hints: [
'Check for HR-related keywords like "policy", "benefits", "vacation", "salary" for vectorstore routing',
'Check for news keywords like "latest", "news", "current", "today" for web_search. Default to "direct_answer".',
],
solution: 'def route_question(question: str) -> str:\n question_lower = question.lower()\n hr_keywords = ["policy", "benefits", "vacation", "salary", "hr", "leave", "insurance"]\n web_keywords = ["latest", "news", "current", "today", "recent", "trending"]\n if any(kw in question_lower for kw in hr_keywords):\n return "vectorstore"\n if any(kw in question_lower for kw in web_keywords):\n return "web_search"\n return "direct_answer"',
solutionExplanation: 'This keyword-based router checks domain terms first, then news terms, and defaults to direct answer. In production, you would use an LLM for more nuanced routing.',
xpReward: 15,
}
Complete Code
Click to expand the full script (copy-paste and run)
# Complete code: Building a RAG Agent with LangGraph
# Requires: pip install langgraph langchain langchain-openai
# langchain-community chromadb tiktoken
# Python 3.10+
import os
from typing import List, TypedDict
from langgraph.graph import StateGraph, END, START
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from pydantic import BaseModel, Field
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
# --- Documents and Vector Store ---
documents = [
Document(
page_content="Employees receive 20 vacation days per year. "
"After 5 years of service, this increases to 25 days.",
metadata={"source": "hr-policy.pdf", "section": "leave"},
),
Document(
page_content="The company matches 401(k) contributions up to "
"6% of the employee's salary. Vesting is immediate.",
metadata={"source": "benefits-guide.pdf",
"section": "retirement"},
),
Document(
page_content="Remote work is permitted 3 days per week. "
"Employees must be in-office Tuesdays and Thursdays.",
metadata={"source": "hr-policy.pdf",
"section": "remote-work"},
),
Document(
page_content="Performance reviews occur twice per year, in "
"June and December. Managers use a 5-point rating scale.",
metadata={"source": "hr-policy.pdf",
"section": "reviews"},
),
Document(
page_content="Health insurance covers medical, dental, and "
"vision. The company pays 80% of premiums for employees "
"and 60% for dependents.",
metadata={"source": "benefits-guide.pdf",
"section": "health"},
),
Document(
page_content="New employees complete a 90-day probation "
"period. During probation, either party may terminate "
"with one week's notice.",
metadata={"source": "hr-policy.pdf",
"section": "onboarding"},
),
]
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=documents,
embedding=embeddings,
collection_name="hr_knowledge_base",
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# --- State Definition ---
class RAGState(TypedDict):
question: str
documents: List[Document]
generation: str
query_rewrite_count: int
relevance_decision: str
hallucination_check: str
answer_quality: str
# --- LLM and Graders ---
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
class RouteDecision(BaseModel):
route: str = Field(
description="'retrieve' or 'direct_answer'"
)
route_prompt = ChatPromptTemplate.from_messages([
("system",
"You are a router. Route to 'retrieve' for company "
"policy/HR questions, 'direct_answer' for greetings."),
("human", "{question}"),
])
structured_llm_router = llm.with_structured_output(
RouteDecision
)
class RelevanceGrade(BaseModel):
score: str = Field(description="'yes' or 'no'")
grade_prompt = ChatPromptTemplate.from_messages([
("system",
"Grade if this document is relevant to the question. "
"Binary 'yes' or 'no'."),
("human", "Question: {question}\nDocument: {document}"),
])
relevance_grader = grade_prompt | llm.with_structured_output(
RelevanceGrade
)
generate_prompt = ChatPromptTemplate.from_messages([
("system",
"Answer using ONLY the provided context. Be concise."),
("human", "Question: {question}\nContext:\n{context}"),
])
generate_chain = generate_prompt | llm | StrOutputParser()
class HallucinationGrade(BaseModel):
score: str = Field(description="'yes' or 'no'")
hallucination_prompt = ChatPromptTemplate.from_messages([
("system",
"Is this generation grounded in the documents? "
"'yes' or 'no'."),
("human",
"Documents:\n{documents}\nGeneration: {generation}"),
])
hallucination_grader = (
hallucination_prompt
| llm.with_structured_output(HallucinationGrade)
)
class AnswerGrade(BaseModel):
score: str = Field(description="'yes' or 'no'")
answer_prompt = ChatPromptTemplate.from_messages([
("system",
"Does this answer address the question? 'yes' or 'no'."),
("human",
"Question: {question}\nAnswer: {generation}"),
])
answer_grader = (
answer_prompt
| llm.with_structured_output(AnswerGrade)
)
rewrite_prompt = ChatPromptTemplate.from_messages([
("system",
"Rewrite this question for better search results. "
"Return only the rewritten question."),
("human", "Original question: {question}"),
])
rewrite_chain = rewrite_prompt | llm | StrOutputParser()
# --- Node Functions ---
def retrieve(state: RAGState) -> dict:
docs = retriever.invoke(state["question"])
return {"documents": docs}
def grade_documents(state: RAGState) -> dict:
question = state["question"]
relevant = []
for doc in state["documents"]:
result = relevance_grader.invoke({
"question": question,
"document": doc.page_content,
})
if result.score == "yes":
relevant.append(doc)
decision = "relevant" if relevant else "not_relevant"
return {"documents": relevant,
"relevance_decision": decision}
def generate(state: RAGState) -> dict:
context = "\n\n".join(
d.page_content for d in state["documents"]
)
answer = generate_chain.invoke({
"question": state["question"],
"context": context,
})
return {"generation": answer}
def check_hallucination(state: RAGState) -> dict:
doc_text = "\n\n".join(
d.page_content for d in state["documents"]
)
result = hallucination_grader.invoke({
"documents": doc_text,
"generation": state["generation"],
})
check = (
"grounded" if result.score == "yes"
else "not_grounded"
)
return {"hallucination_check": check}
def check_answer_quality(state: RAGState) -> dict:
result = answer_grader.invoke({
"question": state["question"],
"generation": state["generation"],
})
quality = (
"useful" if result.score == "yes"
else "not_useful"
)
return {"answer_quality": quality}
def rewrite_query(state: RAGState) -> dict:
rewritten = rewrite_chain.invoke({
"question": state["question"]
})
count = state.get("query_rewrite_count", 0)
return {"question": rewritten,
"query_rewrite_count": count + 1}
def direct_answer(state: RAGState) -> dict:
response = llm.invoke(
f"Answer briefly: {state['question']}"
)
return {"generation": response.content}
# --- Routing Functions ---
def route_question(state: RAGState) -> str:
result = structured_llm_router.invoke(
route_prompt.invoke(
{"question": state["question"]}
)
)
return result.route
def decide_after_grading(state: RAGState) -> str:
if state["relevance_decision"] == "relevant":
return "generate"
count = state.get("query_rewrite_count", 0)
if count >= 2:
return "generate"
return "rewrite"
def decide_after_checks(state: RAGState) -> str:
if state["hallucination_check"] == "not_grounded":
return "generate"
if state["answer_quality"] == "not_useful":
count = state.get("query_rewrite_count", 0)
if count >= 2:
return "finish"
return "rewrite"
return "finish"
# --- Build Graph ---
workflow = StateGraph(RAGState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade_documents", grade_documents)
workflow.add_node("generate", generate)
workflow.add_node("check_hallucination", check_hallucination)
workflow.add_node("check_answer_quality", check_answer_quality)
workflow.add_node("rewrite_query", rewrite_query)
workflow.add_node("direct_answer", direct_answer)
workflow.add_conditional_edges(
START, route_question,
{"retrieve": "retrieve",
"direct_answer": "direct_answer"},
)
workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges(
"grade_documents", decide_after_grading,
{"generate": "generate", "rewrite": "rewrite_query"},
)
workflow.add_edge("rewrite_query", "retrieve")
workflow.add_edge("generate", "check_hallucination")
workflow.add_edge(
"check_hallucination", "check_answer_quality"
)
workflow.add_conditional_edges(
"check_answer_quality", decide_after_checks,
{"finish": END, "generate": "generate",
"rewrite": "rewrite_query"},
)
workflow.add_edge("direct_answer", END)
rag_agent = workflow.compile()
# --- Run ---
result = rag_agent.invoke({
"question": "How many vacation days do employees get?",
"documents": [],
"generation": "",
"query_rewrite_count": 0,
"relevance_decision": "",
"hallucination_check": "",
"answer_quality": "",
})
print(result["generation"])
Summary
You’ve built a complete RAG agent with LangGraph. It goes far beyond simple retrieve-and-generate by adding intelligent routing, document grading, hallucination detection, and self-correction.
The key patterns to remember:
- Route first — don’t retrieve when you don’t need to
- Grade documents — never trust raw retriever results blindly
- Check your answers — hallucination and quality checks cost pennies and save trust
- Limit your loops — always set a maximum retry count
- Cite sources — traceability builds user confidence
These patterns compose well. You can add web search, multi-source retrieval, or confidence scoring without restructuring the graph. Each new capability is just another node with the right conditional edges.
Frequently Asked Questions
How much does running a RAG agent cost compared to a simple LLM call?
Each invocation makes multiple LLM calls: routing, grading per document, generation, hallucination check, and quality check. With GPT-4o-mini, a typical query costs roughly 5-10x more than a single call. That’s still under $0.01 per query for most cases. The accuracy improvement usually justifies the cost.
Can I use open-source models instead of OpenAI?
Yes. Replace ChatOpenAI with any LangChain-compatible chat model. Ollama works well for local models — use ChatOllama(model="llama3"). The grading nodes need reliable instruction-following, so test your model before deploying. Smaller models sometimes struggle with structured output.
How do I handle documents that are too long for the context window?
Split documents into chunks before indexing. LangChain’s RecursiveCharacterTextSplitter is the standard choice. Set chunk size to 500-1000 characters with 100-200 character overlap.
from langchain_text_splitters import (
RecursiveCharacterTextSplitter,
)
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=150,
)
chunks = splitter.split_documents(documents)
What’s the difference between Corrective RAG and Adaptive RAG?
Corrective RAG focuses on fixing bad retrievals — it grades documents and rewrites queries when results are poor. Adaptive RAG adds intelligent routing — it picks the best retrieval strategy based on question type. Our agent combines both. The LangGraph docs call these “CRAG” and “Adaptive RAG” respectively.
How do I evaluate whether my RAG agent is performing well?
Track three metrics: retrieval precision (percentage of relevant retrieved docs), answer faithfulness (is the answer grounded), and answer relevance (does it address the question). Tools like RAGAS and LangSmith automate evaluation for all three. Start by logging every run and reviewing edge cases.
References
- LangGraph documentation — Agentic RAG tutorial. Link
- LangGraph documentation — Adaptive RAG tutorial. Link
- LangChain blog — Self-Reflective RAG with LangGraph. Link
- Yan, S. et al. — Corrective Retrieval Augmented Generation (CRAG). arXiv:2401.15884 (2024).
- Jeong, S. et al. — Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. arXiv:2403.14403 (2024).
- Lewis, P. et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS (2020). Link
- ChromaDB documentation. Link
- OpenAI Embeddings documentation. Link
- LangChain documentation — Text Splitters. Link
- Es, S. et al. — RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217 (2023).
Reviewed: March 2026 | LangGraph version: 0.4+ | LangChain version: 0.3+
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →