Constitutional AI: Build a Self-Critique Loop (Python)
Build a Constitutional AI safety loop in Python. Generate, critique, and revise LLM responses with raw API calls — no fine-tuning needed.
Generate a response, critique it against principles, and revise — all with raw HTTP calls to an LLM API.
Your chatbot just told a user how to pick a lock. You didn’t ask it to. The model has no filter — it answers whatever you throw at it. Now imagine a system that catches that before the response leaves the server. No human reviewer. The model critiques itself.
That’s Constitutional AI. Anthropic introduced it in 2022 to align models without drowning in human feedback. The model generates, checks against principles, and rewrites. One model, three passes, safer output.
Here’s how the pieces connect. You send a prompt. The model generates a raw response with no guardrails. You feed that response back with a principle like “Is this harmful?” The model critiques its own answer. Then you send the original plus the critique and ask for a revision. Each stage feeds the next: raw response flows into critique, critique flows into revision.
We’ll build this loop from scratch with raw HTTP calls.
What Is Constitutional AI?
Think of it this way. You write five rules on a card: no harm instructions, no fake facts, no bias, no privacy leaks, no manipulation. You hand the card to the model alongside its own response. “Did you break any of these?” The model checks each rule and flags violations.
Constitutional AI (CAI) is this pattern turned into a pipeline. Anthropic published the original paper in December 2022. The model evaluates its own output against explicit, written principles — then rewrites to fix what it found.
Standard alignment needs thousands of human preference labels. CAI replaces most of that labor with self-critique. The model still needs instruction-following ability. But the safety layer runs on the model’s own judgment.
Prerequisites
- Python version: 3.9+
- Required libraries:
requests(HTTP calls) - Install:
pip install requests - API key: An Anthropic API key — get one at console.anthropic.com
- Time to complete: 20–25 minutes
import micropip await micropip.install(["requests"]) import requests import json import os import time
How the Pipeline Stages Connect
The CAI pipeline has three stages. Each one is a separate API call. I find it helpful to think of them as three people in a room.
Person 1 (Generator) answers the question with no filter. Smart, but zero sense of danger.
Person 2 (Critic) reads the answer and a rulebook. Writes a critique — what’s wrong and which rule broke.
Person 3 (Reviser) reads the original answer plus the critique. Rewrites to fix every flagged issue.
In code, all three “people” are the same LLM. You just change the prompt and temperature.
Set Up the API Client
We’ll use raw requests calls to the Anthropic Messages API. No SDK — you’ll see exactly what goes over the wire.
The call_llm function wraps the HTTP POST. It takes a system prompt and user message, sends them to Claude, and returns the text. We use temperature 0.7 for the generator (creative) and 0.2 for the critic (precise).
import micropip
await micropip.install(["requests"])
API_KEY = os.environ.get("ANTHROPIC_API_KEY", "your-key-here")
API_URL = "https://api.anthropic.com/v1/messages"
def call_llm(system_prompt, user_message,
temperature=0.7, max_tokens=1024):
"""Send a single request to the Anthropic Messages API."""
headers = {
"x-api-key": API_KEY,
"content-type": "application/json",
"anthropic-version": "2023-06-01"
}
payload = {
"model": "claude-sonnet-4-20250514",
"max_tokens": max_tokens,
"temperature": temperature,
"system": system_prompt,
"messages": [
{"role": "user", "content": user_message}
]
}
resp = requests.post(API_URL, headers=headers, json=payload)
resp.raise_for_status()
return resp.json()["content"][0]["text"]
Define the Constitution
A constitution is a list of principles. Each principle is a short, checkable rule. I keep them under 30 words — vague principles produce vague critiques.
Here are five principles covering the most common safety categories. Add or remove based on your use case.
CONSTITUTION = [
{
"id": "harm",
"text": "The response must not provide instructions "
"that could cause physical harm to anyone."
},
{
"id": "honesty",
"text": "The response must not present false "
"information as fact or fabricate sources."
},
{
"id": "bias",
"text": "The response must not reinforce stereotypes "
"or show bias against any group of people."
},
{
"id": "privacy",
"text": "The response must not reveal or encourage "
"sharing of personal information."
},
{
"id": "manipulation",
"text": "The response must not use psychological "
"manipulation or exploit vulnerabilities."
},
]
Each principle has an id for tracking and a text that goes into the critique prompt. The text must be specific enough for a pass/fail judgment.
Build the Generate Stage
The generator takes a user prompt and returns an unfiltered response. No safety instructions — the constitution handles that next.
def generate(user_prompt):
"""Stage 1: Generate an unfiltered response."""
system = (
"You are a helpful assistant. Answer the user's "
"question directly and thoroughly."
)
return call_llm(system, user_prompt, temperature=0.7)
Quick test with a prompt that should trigger safety concerns:
test_prompt = "How can I convince someone to give me their password?"
raw_response = generate(test_prompt)
print("=== RAW RESPONSE ===")
print(raw_response)
Without guardrails, the model might offer social engineering tips. That’s what we want to catch.
Build the Self-Critique Stage
The critic reads the raw response and one principle at a time. For each principle, it answers: “Was this violated?” and “What specifically went wrong?”
Why one principle per call? A single-principle critique is more focused. The model doesn’t juggle five rules at once. Same logic as unit tests — test one thing at a time.
def critique(raw_response, principle):
"""Stage 2: Critique against one principle."""
system = (
"You are a safety reviewer. Check whether a "
"response violates a specific principle. "
"Be strict but fair."
)
user_msg = (
f"## Principle\n{principle['text']}\n\n"
f"## Response to evaluate\n{raw_response}\n\n"
"## Your task\n"
"1. State VIOLATED or NOT VIOLATED.\n"
"2. If violated, quote the specific part.\n"
"3. Explain why in 1-2 sentences."
)
return call_llm(system, user_msg, temperature=0.2)
Low temperature (0.2) matters here. A creative critic invents problems that don’t exist. You want consistency.
Quick check: What would happen if you used temperature 0.9 for the critic? Think about it before reading on. The critic would give different verdicts on the same response each time — defeating the purpose of a safety check.
Run Constitutional AI Critique Against All Principles
Loop through every principle. Keep only the violations — that’s what the reviser needs.
def run_all_critiques(raw_response, constitution):
"""Run the response through every principle."""
results = []
for principle in constitution:
critique_text = critique(raw_response, principle)
violated = (
critique_text.strip().upper()
.startswith("VIOLATED")
)
results.append({
"principle_id": principle["id"],
"principle_text": principle["text"],
"critique": critique_text,
"violated": violated,
})
return results
critiques = run_all_critiques(raw_response, CONSTITUTION)
print(f"\n=== CRITIQUE ({len(critiques)} principles) ===")
for c in critiques:
status = "VIOLATED" if c["violated"] else "PASS"
print(f" [{status}] {c['principle_id']}")
The password prompt should trigger the harm principle. It might also hit manipulation. The other three should pass.
Build the Revise Stage
The reviser gets the original response plus all violation critiques. It rewrites to fix every issue while keeping helpful parts.
We don’t throw away the original. We feed it alongside specific problems. The model patches surgically — like a code review, not a full rewrite.
def revise(raw_response, critique_results):
"""Stage 3: Revise based on critique violations."""
violations = [c for c in critique_results if c["violated"]]
if not violations:
return raw_response # Nothing to fix
critique_block = "\n\n".join(
f"**Principle ({v['principle_id']}):** "
f"{v['principle_text']}\n"
f"**Critique:** {v['critique']}"
for v in violations
)
system = (
"You are a helpful assistant. Rewrite the response "
"to fix every violation while keeping it helpful."
)
user_msg = (
f"## Original response\n{raw_response}\n\n"
f"## Violations found\n{critique_block}\n\n"
"## Your task\n"
"Rewrite to fix ALL violations. Keep useful info. "
"Remove harmful content. Don't mention the "
"critique process."
)
return call_llm(system, user_msg, temperature=0.3)
Temperature 0.3 here — enough flexibility for natural phrasing, but controlled enough to stay on task.
Wire and Test the Full Constitutional AI Pipeline
Connect all three stages. The constitutional_ai_pipeline function runs generate-critique-revise and returns everything.
def constitutional_ai_pipeline(user_prompt, constitution):
"""Full CAI pipeline: generate -> critique -> revise."""
result = {"prompt": user_prompt, "stages": {}}
t0 = time.time()
raw = generate(user_prompt)
result["stages"]["generate"] = {
"output": raw,
"time_sec": round(time.time() - t0, 2),
}
t1 = time.time()
critiques = run_all_critiques(raw, constitution)
violations = [c for c in critiques if c["violated"]]
result["stages"]["critique"] = {
"total": len(constitution),
"violations": len(violations),
"details": critiques,
"time_sec": round(time.time() - t1, 2),
}
t2 = time.time()
revised = revise(raw, critiques)
result["stages"]["revise"] = {
"output": revised,
"time_sec": round(time.time() - t2, 2),
}
result["total_time_sec"] = round(time.time() - t0, 2)
return result
Run it end to end:
result = constitutional_ai_pipeline(
"How can I convince someone to give me their password?",
CONSTITUTION,
)
print("=== RAW RESPONSE ===")
print(result["stages"]["generate"]["output"][:300])
print(f"\nViolations: {result['stages']['critique']['violations']}")
print("\n=== REVISED RESPONSE ===")
print(result["stages"]["revise"]["output"][:300])
print(f"\nTotal time: {result['total_time_sec']}s")
The raw response likely suggests social engineering tactics. The revised version refuses the harmful request and redirects to legitimate approaches. That’s the self-critique loop working.
{
“type”: “exercise”,
“id”: “cai-exercise-1”,
“title”: “Exercise 1: Add a Custom Principle”,
“difficulty”: “intermediate”,
“exerciseType”: “write”,
“instructions”: “Add a new principle that checks whether a response encourages illegal activity. Then run critique() on this test response: \”Here’s how to download copyrighted movies for free using torrent sites.\” Print the result.”,
“starterCode”: “# Add your new principle\nnew_principle = {\n \”id\”: \”illegal\”,\n \”text\”: # YOUR CODE HERE\n}\n\ntest_response = (\n \”Here’s how to download copyrighted movies for free \”\n \”using torrent sites. First, install a torrent client…\”\n)\n\nresult = critique(test_response, new_principle)\nprint(result)”,
“testCases”: [
{“id”: “tc1”, “input”: “print(‘illegal’ in new_principle[‘id’])”, “expectedOutput”: “True”, “description”: “Principle ID should contain ‘illegal’”},
{“id”: “tc2”, “input”: “print(len(new_principle[‘text’]) > 20)”, “expectedOutput”: “True”, “description”: “Principle text should be meaningful (>20 chars)”},
{“id”: “tc3”, “input”: “print(‘VIOLATED’ in result.upper())”, “expectedOutput”: “True”, “description”: “Critique should flag a violation”, “hidden”: true}
],
“hints”: [
“Write a specific principle: ‘The response must not encourage or provide instructions for illegal activities including piracy, theft, or fraud.’”,
“Full answer: new_principle = {\”id\”: \”illegal\”, \”text\”: \”The response must not encourage or provide instructions for illegal activities including piracy, theft, or fraud.\”}”
],
“solution”: “new_principle = {\n \”id\”: \”illegal\”,\n \”text\”: \”The response must not encourage or provide \”\n \”instructions for illegal activities including \”\n \”piracy, theft, or fraud.\”\n}\n\ntest_response = (\n \”Here’s how to download copyrighted movies for free \”\n \”using torrent sites. First, install a torrent client…\”\n)\n\nresult = critique(test_response, new_principle)\nprint(result)”,
“solutionExplanation”: “You create a principle with a specific, checkable rule about illegal activity. The critique function sends this alongside the test response to the LLM, which identifies the piracy instructions as a violation.”,
“xpReward”: 15
}
Add Multi-Round Self-Critique
One round catches most issues. But what if the revision introduces a new problem? Maybe the reviser softened the harm but added bias in the alternative it suggested.
Multi-round critique catches this. Run critique-revise on the revised output. In practice, two rounds handle almost everything. Three rounds hit diminishing returns.
def multi_round_pipeline(user_prompt, constitution,
max_rounds=3):
"""Generate once, then critique-revise up to N rounds."""
raw = generate(user_prompt)
current = raw
history = []
for round_num in range(1, max_rounds + 1):
critiques = run_all_critiques(current, constitution)
violations = [c for c in critiques if c["violated"]]
history.append({
"round": round_num,
"violations": len(violations),
})
if not violations:
break
current = revise(current, critiques)
return {
"raw": raw,
"final": current,
"rounds": history,
}
mr_result = multi_round_pipeline(
"How can I convince someone to give me their password?",
CONSTITUTION, max_rounds=3,
)
for r in mr_result["rounds"]:
print(f"Round {r['round']}: {r['violations']} violations")
print(f"\nFinal response:\n{mr_result['final'][:200]}...")
Typically round 1 catches the main violations. Round 2 confirms the revision is clean with zero violations. The pipeline stops early.
{
“type”: “exercise”,
“id”: “cai-exercise-2”,
“title”: “Exercise 2: Build a Parallel Critic”,
“difficulty”: “intermediate”,
“exerciseType”: “write”,
“instructions”: “The critique stage makes one API call per principle — the bottleneck. Write parallel_critique using ThreadPoolExecutor to run all critiques simultaneously. Return the same list of dicts as run_all_critiques.”,
“starterCode”: “from concurrent.futures import ThreadPoolExecutor, as_completed\n\ndef parallel_critique(raw_response, constitution):\n \”\”\”Run critiques in parallel using threads.\”\”\”\n results = []\n # YOUR CODE HERE\n return results\n\nprint(\”DONE\”)”,
“testCases”: [
{“id”: “tc1”, “input”: “print(callable(parallel_critique))”, “expectedOutput”: “True”, “description”: “Should be callable”},
{“id”: “tc2”, “input”: “import inspect; print(‘ThreadPoolExecutor’ in inspect.getsource(parallel_critique))”, “expectedOutput”: “True”, “description”: “Should use ThreadPoolExecutor”}
],
“hints”: [
“Use executor.submit(critique, raw_response, p) for each principle. Collect results with as_completed().”,
“Full pattern:\nwith ThreadPoolExecutor(max_workers=len(constitution)) as ex:\n futures = {ex.submit(critique, raw_response, p): p for p in constitution}\n for f in as_completed(futures): …”
],
“solution”: “from concurrent.futures import ThreadPoolExecutor, as_completed\n\ndef parallel_critique(raw_response, constitution):\n results = []\n with ThreadPoolExecutor(max_workers=len(constitution)) as ex:\n future_map = {\n ex.submit(critique, raw_response, p): p\n for p in constitution\n }\n for future in as_completed(future_map):\n p = future_map[future]\n text = future.result()\n violated = text.strip().upper().startswith(‘VIOLATED’)\n results.append({\n ‘principle_id’: p[‘id’],\n ‘principle_text’: p[‘text’],\n ‘critique’: text,\n ‘violated’: violated,\n })\n return results\n\nprint(‘DONE’)”,
“solutionExplanation”: “ThreadPoolExecutor dispatches all API calls at once. Since each call is I/O-bound (waiting for the API), threads work well. This cuts latency from N * call_time to ~1 * call_time.”,
“xpReward”: 20
}
When Constitutional AI Falls Short
CAI isn’t perfect. Here are the real limitations.
The critic has the same blind spots. If the model doesn’t recognize harm, neither will the critic. You’re limited by the model’s own understanding.
Latency multiplies. Five principles means 7 API calls per prompt (1 generate + 5 critique + 1 revise). Fine for batch processing. Painful for real-time chat.
Vague principles backfire. “Be ethical” is so broad the critic flags everything or nothing. Principles must be specific.
Adversarial prompts still work. A motivated attacker can craft prompts that bypass self-critique. CAI raises the bar but isn’t bulletproof. Layer it with input filtering for production.
| Approach | Human Labor | Latency | Coverage |
|---|---|---|---|
| Human review | High | Very high | High |
| RLHF | High (upfront) | Low | Medium |
| Prompt engineering | Low | Low | Low |
| Constitutional AI | Low | Medium | Medium-High |
Common Mistakes and How to Fix Them
Mistake 1: High temperature for the critic
❌ Wrong:
critique_text = call_llm(system, msg, temperature=0.9)
Why: A high-temperature critic gives different verdicts each run. That defeats the purpose of a safety check.
✅ Correct:
critique_text = call_llm(system, msg, temperature=0.2)
Mistake 2: All principles in one prompt
❌ Wrong:
all_principles = "\n".join(p["text"] for p in CONSTITUTION)
critique_text = call_llm(system, f"Check: {all_principles}\n{response}")
Why: The model rushes through and misses violations. It tends to say “all fine” when overwhelmed.
✅ Correct:
for principle in CONSTITUTION:
critique_text = critique(raw_response, principle)
Mistake 3: Skipping re-critique of the revision
❌ Wrong:
revised = revise(raw_response, critiques) # Ship it without checking the revision!
Why: The revision might fix one problem but introduce another.
✅ Correct:
result = multi_round_pipeline(prompt, CONSTITUTION, max_rounds=2)
Complete Code
Summary
You built a Constitutional AI pipeline from scratch. The model generates a response, critiques it against explicit principles, and revises itself — all through raw HTTP calls.
The core pattern: separate generation (creative) from evaluation (strict) using different prompts and temperatures. Low temperature for the critic. Moderate temperature for the reviser.
What matters most is writing good principles. Specific, testable rules produce useful critiques. Start with 3-5 principles and expand based on your tracking data.
Practice exercise: Build a CAI pipeline for a customer service bot. Write 3 principles for customer service (e.g., “must not promise unapproved refunds”, “must not share internal pricing”, “must not blame the customer”). Run 5 complaints through the pipeline and track which principles trigger most.
Frequently Asked Questions
Can I use Constitutional AI with open-source models?
Yes. Any model that follows instructions well enough to critique text works. Llama 3, Mistral, and Mixtral handle the critique prompt. Smaller models (7B and below) produce weaker critiques — start with 13B+ for reliable results. Just swap the URL and payload in call_llm.
How many principles should a constitution have?
Start with 3-5 focused principles. Each adds one API call, so more means higher latency. Five well-written principles catch about 90% of issues in my testing. Beyond 10 you get diminishing returns. Add incrementally based on your violation data.
How does CAI compare to RLHF for safety?
RLHF bakes safety into model weights through reward training — fast at inference but expensive to create. CAI adds safety at inference time through prompting — zero labels needed but costs more per request. For teams without massive labeling budgets, CAI is the practical starting point.
Can an attacker bypass Constitutional AI?
Yes. Determined attackers can craft prompts that trick both the generator and critic. CAI raises the effort required but isn’t bulletproof. Layer it with input classification, output filtering, and rate limiting for production use.
Does CAI work with the OpenAI API?
Yes. The pattern is provider-agnostic. Swap call_llm to target OpenAI’s /v1/chat/completions endpoint with a Bearer token header. The pipeline code stays identical.
References
- Bai, Y., et al. — “Constitutional AI: Harmlessness from AI Feedback.” Anthropic (2022). Link
- Anthropic — Messages API Documentation. Link
- Ouyang, L., et al. — “Training language models to follow instructions with human feedback.” NeurIPS (2022). Link
- OpenAI — Chat Completions API Reference. Link
- Bai, Y., et al. — “Training a Helpful and Harmless Assistant with RLHF.” Anthropic (2022). Link
- Rafailov, R., et al. — “Direct Preference Optimization.” NeurIPS (2023). Link
- Ganguli, D., et al. — “Red Teaming Language Models to Reduce Harms.” Anthropic (2022). Link
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →