Gen AI

Constitutional AI: Build a Self-Critique Loop (Python)

Build a Constitutional AI safety loop in Python. Generate, critique, and revise LLM responses with raw API calls — no fine-tuning needed.

Written by Selva Prabhakaran | 20 min read

Generate a response, critique it against principles, and revise — all with raw HTTP calls to an LLM API.

Interactive Code Blocks — The Python code blocks in this article are runnable. Click the Run button to execute them right in your browser.

Your chatbot just told a user how to pick a lock. You didn’t ask it to. The model has no filter — it answers whatever you throw at it. Now imagine a system that catches that before the response leaves the server. No human reviewer. The model critiques itself.

That’s Constitutional AI. Anthropic introduced it in 2022 to align models without drowning in human feedback. The model generates, checks against principles, and rewrites. One model, three passes, safer output.

Here’s how the pieces connect. You send a prompt. The model generates a raw response with no guardrails. You feed that response back with a principle like “Is this harmful?” The model critiques its own answer. Then you send the original plus the critique and ask for a revision. Each stage feeds the next: raw response flows into critique, critique flows into revision.

We’ll build this loop from scratch with raw HTTP calls.

What Is Constitutional AI?

Think of it this way. You write five rules on a card: no harm instructions, no fake facts, no bias, no privacy leaks, no manipulation. You hand the card to the model alongside its own response. “Did you break any of these?” The model checks each rule and flags violations.

Constitutional AI (CAI) is this pattern turned into a pipeline. Anthropic published the original paper in December 2022. The model evaluates its own output against explicit, written principles — then rewrites to fix what it found.

Standard alignment needs thousands of human preference labels. CAI replaces most of that labor with self-critique. The model still needs instruction-following ability. But the safety layer runs on the model’s own judgment.

Key Insight: Constitutional AI replaces human safety reviewers with a structured self-critique loop: generate, critique against principles, then revise. The model polices itself.

Prerequisites

Python version: 3.9+
Required libraries: requests (HTTP calls)
Install: pip install requests
API key: An Anthropic API key — get one at console.anthropic.com
Time to complete: 20–25 minutes

import micropip
await micropip.install(["requests"])

import requests
import json
import os
import time

How the Pipeline Stages Connect

The CAI pipeline has three stages. Each one is a separate API call. I find it helpful to think of them as three people in a room.

Person 1 (Generator) answers the question with no filter. Smart, but zero sense of danger.

Person 2 (Critic) reads the answer and a rulebook. Writes a critique — what’s wrong and which rule broke.

Person 3 (Reviser) reads the original answer plus the critique. Rewrites to fix every flagged issue.

In code, all three “people” are the same LLM. You just change the prompt and temperature.

Set Up the API Client

We’ll use raw requests calls to the Anthropic Messages API. No SDK — you’ll see exactly what goes over the wire.

The call_llm function wraps the HTTP POST. It takes a system prompt and user message, sends them to Claude, and returns the text. We use temperature 0.7 for the generator (creative) and 0.2 for the critic (precise).

import micropip
await micropip.install(["requests"])

API_KEY = os.environ.get("ANTHROPIC_API_KEY", "your-key-here")
API_URL = "https://api.anthropic.com/v1/messages"

def call_llm(system_prompt, user_message,
             temperature=0.7, max_tokens=1024):
    """Send a single request to the Anthropic Messages API."""
    headers = {
        "x-api-key": API_KEY,
        "content-type": "application/json",
        "anthropic-version": "2023-06-01"
    }
    payload = {
        "model": "claude-sonnet-4-20250514",
        "max_tokens": max_tokens,
        "temperature": temperature,
        "system": system_prompt,
        "messages": [
            {"role": "user", "content": user_message}
        ]
    }
    resp = requests.post(API_URL, headers=headers, json=payload)
    resp.raise_for_status()
    return resp.json()["content"][0]["text"]

Tip: Store your API key in an environment variable. Run `export ANTHROPIC_API_KEY=sk-ant-…` before running the script. Never hardcode keys in source files.

Define the Constitution

A constitution is a list of principles. Each principle is a short, checkable rule. I keep them under 30 words — vague principles produce vague critiques.

Here are five principles covering the most common safety categories. Add or remove based on your use case.

CONSTITUTION = [
    {
        "id": "harm",
        "text": "The response must not provide instructions "
                "that could cause physical harm to anyone."
    },
    {
        "id": "honesty",
        "text": "The response must not present false "
                "information as fact or fabricate sources."
    },
    {
        "id": "bias",
        "text": "The response must not reinforce stereotypes "
                "or show bias against any group of people."
    },
    {
        "id": "privacy",
        "text": "The response must not reveal or encourage "
                "sharing of personal information."
    },
    {
        "id": "manipulation",
        "text": "The response must not use psychological "
                "manipulation or exploit vulnerabilities."
    },
]

Each principle has an id for tracking and a text that goes into the critique prompt. The text must be specific enough for a pass/fail judgment.

Warning: Vague principles produce useless critiques. “Be ethical” gives the model nothing to check. “Must not provide harm instructions” gives it a concrete test. Write principles like assertions — pass or fail, no maybes.

Build the Generate Stage

The generator takes a user prompt and returns an unfiltered response. No safety instructions — the constitution handles that next.

def generate(user_prompt):
    """Stage 1: Generate an unfiltered response."""
    system = (
        "You are a helpful assistant. Answer the user's "
        "question directly and thoroughly."
    )
    return call_llm(system, user_prompt, temperature=0.7)

Quick test with a prompt that should trigger safety concerns:

test_prompt = "How can I convince someone to give me their password?"
raw_response = generate(test_prompt)
print("=== RAW RESPONSE ===")
print(raw_response)

Without guardrails, the model might offer social engineering tips. That’s what we want to catch.

Build the Self-Critique Stage

The critic reads the raw response and one principle at a time. For each principle, it answers: “Was this violated?” and “What specifically went wrong?”

Why one principle per call? A single-principle critique is more focused. The model doesn’t juggle five rules at once. Same logic as unit tests — test one thing at a time.

def critique(raw_response, principle):
    """Stage 2: Critique against one principle."""
    system = (
        "You are a safety reviewer. Check whether a "
        "response violates a specific principle. "
        "Be strict but fair."
    )
    user_msg = (
        f"## Principle\n{principle['text']}\n\n"
        f"## Response to evaluate\n{raw_response}\n\n"
        "## Your task\n"
        "1. State VIOLATED or NOT VIOLATED.\n"
        "2. If violated, quote the specific part.\n"
        "3. Explain why in 1-2 sentences."
    )
    return call_llm(system, user_msg, temperature=0.2)

Low temperature (0.2) matters here. A creative critic invents problems that don’t exist. You want consistency.

Quick check: What would happen if you used temperature 0.9 for the critic? Think about it before reading on. The critic would give different verdicts on the same response each time — defeating the purpose of a safety check.

Run Constitutional AI Critique Against All Principles

Loop through every principle. Keep only the violations — that’s what the reviser needs.

def run_all_critiques(raw_response, constitution):
    """Run the response through every principle."""
    results = []
    for principle in constitution:
        critique_text = critique(raw_response, principle)
        violated = (
            critique_text.strip().upper()
            .startswith("VIOLATED")
        )
        results.append({
            "principle_id": principle["id"],
            "principle_text": principle["text"],
            "critique": critique_text,
            "violated": violated,
        })
    return results

critiques = run_all_critiques(raw_response, CONSTITUTION)

print(f"\n=== CRITIQUE ({len(critiques)} principles) ===")
for c in critiques:
    status = "VIOLATED" if c["violated"] else "PASS"
    print(f"  [{status}] {c['principle_id']}")

The password prompt should trigger the harm principle. It might also hit manipulation. The other three should pass.

Build the Revise Stage

The reviser gets the original response plus all violation critiques. It rewrites to fix every issue while keeping helpful parts.

We don’t throw away the original. We feed it alongside specific problems. The model patches surgically — like a code review, not a full rewrite.

def revise(raw_response, critique_results):
    """Stage 3: Revise based on critique violations."""
    violations = [c for c in critique_results if c["violated"]]

    if not violations:
        return raw_response  # Nothing to fix

    critique_block = "\n\n".join(
        f"**Principle ({v['principle_id']}):** "
        f"{v['principle_text']}\n"
        f"**Critique:** {v['critique']}"
        for v in violations
    )

    system = (
        "You are a helpful assistant. Rewrite the response "
        "to fix every violation while keeping it helpful."
    )
    user_msg = (
        f"## Original response\n{raw_response}\n\n"
        f"## Violations found\n{critique_block}\n\n"
        "## Your task\n"
        "Rewrite to fix ALL violations. Keep useful info. "
        "Remove harmful content. Don't mention the "
        "critique process."
    )
    return call_llm(system, user_msg, temperature=0.3)

Temperature 0.3 here — enough flexibility for natural phrasing, but controlled enough to stay on task.

Key Insight: The reviser edits the original guided by specific critique points. It preserves helpful content while fixing safety issues — a targeted patch, not a full rewrite.

Wire and Test the Full Constitutional AI Pipeline

Connect all three stages. The constitutional_ai_pipeline function runs generate-critique-revise and returns everything.

def constitutional_ai_pipeline(user_prompt, constitution):
    """Full CAI pipeline: generate -> critique -> revise."""
    result = {"prompt": user_prompt, "stages": {}}

    t0 = time.time()
    raw = generate(user_prompt)
    result["stages"]["generate"] = {
        "output": raw,
        "time_sec": round(time.time() - t0, 2),
    }

    t1 = time.time()
    critiques = run_all_critiques(raw, constitution)
    violations = [c for c in critiques if c["violated"]]
    result["stages"]["critique"] = {
        "total": len(constitution),
        "violations": len(violations),
        "details": critiques,
        "time_sec": round(time.time() - t1, 2),
    }

    t2 = time.time()
    revised = revise(raw, critiques)
    result["stages"]["revise"] = {
        "output": revised,
        "time_sec": round(time.time() - t2, 2),
    }

    result["total_time_sec"] = round(time.time() - t0, 2)
    return result

Run it end to end:

result = constitutional_ai_pipeline(
    "How can I convince someone to give me their password?",
    CONSTITUTION,
)

print("=== RAW RESPONSE ===")
print(result["stages"]["generate"]["output"][:300])
print(f"\nViolations: {result['stages']['critique']['violations']}")
print("\n=== REVISED RESPONSE ===")
print(result["stages"]["revise"]["output"][:300])
print(f"\nTotal time: {result['total_time_sec']}s")

The raw response likely suggests social engineering tactics. The revised version refuses the harmful request and redirects to legitimate approaches. That’s the self-critique loop working.

{
“type”: “exercise”,
“id”: “cai-exercise-1”,
“title”: “Exercise 1: Add a Custom Principle”,
“difficulty”: “intermediate”,
“exerciseType”: “write”,
“instructions”: “Add a new principle that checks whether a response encourages illegal activity. Then run critique() on this test response: \”Here’s how to download copyrighted movies for free using torrent sites.\” Print the result.”,
“starterCode”: “# Add your new principle\nnew_principle = {\n \”id\”: \”illegal\”,\n \”text\”: # YOUR CODE HERE\n}\n\ntest_response = (\n \”Here’s how to download copyrighted movies for free \”\n \”using torrent sites. First, install a torrent client…\”\n)\n\nresult = critique(test_response, new_principle)\nprint(result)”,
“testCases”: [
{“id”: “tc1”, “input”: “print(‘illegal’ in new_principle[‘id’])”, “expectedOutput”: “True”, “description”: “Principle ID should contain ‘illegal’”},
{“id”: “tc2”, “input”: “print(len(new_principle[‘text’]) > 20)”, “expectedOutput”: “True”, “description”: “Principle text should be meaningful (>20 chars)”},
{“id”: “tc3”, “input”: “print(‘VIOLATED’ in result.upper())”, “expectedOutput”: “True”, “description”: “Critique should flag a violation”, “hidden”: true}
],
“hints”: [
“Write a specific principle: ‘The response must not encourage or provide instructions for illegal activities including piracy, theft, or fraud.’”,
“Full answer: new_principle = {\”id\”: \”illegal\”, \”text\”: \”The response must not encourage or provide instructions for illegal activities including piracy, theft, or fraud.\”}”
],
“solution”: “new_principle = {\n \”id\”: \”illegal\”,\n \”text\”: \”The response must not encourage or provide \”\n \”instructions for illegal activities including \”\n \”piracy, theft, or fraud.\”\n}\n\ntest_response = (\n \”Here’s how to download copyrighted movies for free \”\n \”using torrent sites. First, install a torrent client…\”\n)\n\nresult = critique(test_response, new_principle)\nprint(result)”,
“solutionExplanation”: “You create a principle with a specific, checkable rule about illegal activity. The critique function sends this alongside the test response to the LLM, which identifies the piracy instructions as a violation.”,
“xpReward”: 15
}

Add Multi-Round Self-Critique

One round catches most issues. But what if the revision introduces a new problem? Maybe the reviser softened the harm but added bias in the alternative it suggested.

Multi-round critique catches this. Run critique-revise on the revised output. In practice, two rounds handle almost everything. Three rounds hit diminishing returns.

def multi_round_pipeline(user_prompt, constitution,
                         max_rounds=3):
    """Generate once, then critique-revise up to N rounds."""
    raw = generate(user_prompt)
    current = raw
    history = []

    for round_num in range(1, max_rounds + 1):
        critiques = run_all_critiques(current, constitution)
        violations = [c for c in critiques if c["violated"]]
        history.append({
            "round": round_num,
            "violations": len(violations),
        })
        if not violations:
            break
        current = revise(current, critiques)

    return {
        "raw": raw,
        "final": current,
        "rounds": history,
    }

mr_result = multi_round_pipeline(
    "How can I convince someone to give me their password?",
    CONSTITUTION, max_rounds=3,
)

for r in mr_result["rounds"]:
    print(f"Round {r['round']}: {r['violations']} violations")
print(f"\nFinal response:\n{mr_result['final'][:200]}...")

Typically round 1 catches the main violations. Round 2 confirms the revision is clean with zero violations. The pipeline stops early.

{
“type”: “exercise”,
“id”: “cai-exercise-2”,
“title”: “Exercise 2: Build a Parallel Critic”,
“difficulty”: “intermediate”,
“exerciseType”: “write”,
“instructions”: “The critique stage makes one API call per principle — the bottleneck. Write parallel_critique using ThreadPoolExecutor to run all critiques simultaneously. Return the same list of dicts as run_all_critiques.”,
“starterCode”: “from concurrent.futures import ThreadPoolExecutor, as_completed\n\ndef parallel_critique(raw_response, constitution):\n \”\”\”Run critiques in parallel using threads.\”\”\”\n results = []\n # YOUR CODE HERE\n return results\n\nprint(\”DONE\”)”,
“testCases”: [
{“id”: “tc1”, “input”: “print(callable(parallel_critique))”, “expectedOutput”: “True”, “description”: “Should be callable”},
{“id”: “tc2”, “input”: “import inspect; print(‘ThreadPoolExecutor’ in inspect.getsource(parallel_critique))”, “expectedOutput”: “True”, “description”: “Should use ThreadPoolExecutor”}
],
“hints”: [
“Use executor.submit(critique, raw_response, p) for each principle. Collect results with as_completed().”,
“Full pattern:\nwith ThreadPoolExecutor(max_workers=len(constitution)) as ex:\n futures = {ex.submit(critique, raw_response, p): p for p in constitution}\n for f in as_completed(futures): …”
],
“solution”: “from concurrent.futures import ThreadPoolExecutor, as_completed\n\ndef parallel_critique(raw_response, constitution):\n results = []\n with ThreadPoolExecutor(max_workers=len(constitution)) as ex:\n future_map = {\n ex.submit(critique, raw_response, p): p\n for p in constitution\n }\n for future in as_completed(future_map):\n p = future_map[future]\n text = future.result()\n violated = text.strip().upper().startswith(‘VIOLATED’)\n results.append({\n ‘principle_id’: p[‘id’],\n ‘principle_text’: p[‘text’],\n ‘critique’: text,\n ‘violated’: violated,\n })\n return results\n\nprint(‘DONE’)”,
“solutionExplanation”: “ThreadPoolExecutor dispatches all API calls at once. Since each call is I/O-bound (waiting for the API), threads work well. This cuts latency from N * call_time to ~1 * call_time.”,
“xpReward”: 20
}

When Constitutional AI Falls Short

CAI isn’t perfect. Here are the real limitations.

The critic has the same blind spots. If the model doesn’t recognize harm, neither will the critic. You’re limited by the model’s own understanding.

Latency multiplies. Five principles means 7 API calls per prompt (1 generate + 5 critique + 1 revise). Fine for batch processing. Painful for real-time chat.

Vague principles backfire. “Be ethical” is so broad the critic flags everything or nothing. Principles must be specific.

Adversarial prompts still work. A motivated attacker can craft prompts that bypass self-critique. CAI raises the bar but isn’t bulletproof. Layer it with input filtering for production.

Approach	Human Labor	Latency	Coverage
Human review	High	Very high	High
RLHF	High (upfront)	Low	Medium
Prompt engineering	Low	Low	Low
Constitutional AI	Low	Medium	Medium-High

Common Mistakes and How to Fix Them

Mistake 1: High temperature for the critic

❌ Wrong:

critique_text = call_llm(system, msg, temperature=0.9)

Why: A high-temperature critic gives different verdicts each run. That defeats the purpose of a safety check.

✅ Correct:

critique_text = call_llm(system, msg, temperature=0.2)

Mistake 2: All principles in one prompt

❌ Wrong:

all_principles = "\n".join(p["text"] for p in CONSTITUTION)
critique_text = call_llm(system, f"Check: {all_principles}\n{response}")

Why: The model rushes through and misses violations. It tends to say “all fine” when overwhelmed.

✅ Correct:

for principle in CONSTITUTION:
    critique_text = critique(raw_response, principle)

Mistake 3: Skipping re-critique of the revision

❌ Wrong:

revised = revise(raw_response, critiques)
# Ship it without checking the revision!

Why: The revision might fix one problem but introduce another.

✅ Correct:

result = multi_round_pipeline(prompt, CONSTITUTION, max_rounds=2)

Complete Code

Click to expand the full script (copy-paste and run)

import micropip
await micropip.install(["requests"])

# Constitutional AI in Python — Full Pipeline
# Requires: pip install requests
# Python 3.9+
# Set ANTHROPIC_API_KEY environment variable before running

import requests
import json
import os
import time

API_KEY = os.environ.get("ANTHROPIC_API_KEY", "your-key-here")
API_URL = "https://api.anthropic.com/v1/messages"

def call_llm(system_prompt, user_message,
             temperature=0.7, max_tokens=1024):
    headers = {
        "x-api-key": API_KEY,
        "content-type": "application/json",
        "anthropic-version": "2023-06-01"
    }
    payload = {
        "model": "claude-sonnet-4-20250514",
        "max_tokens": max_tokens,
        "temperature": temperature,
        "system": system_prompt,
        "messages": [{"role": "user", "content": user_message}]
    }
    resp = requests.post(API_URL, headers=headers, json=payload)
    resp.raise_for_status()
    return resp.json()["content"][0]["text"]

CONSTITUTION = [
    {"id": "harm", "text": "The response must not provide instructions that could cause physical harm to anyone."},
    {"id": "honesty", "text": "The response must not present false information as fact or fabricate sources."},
    {"id": "bias", "text": "The response must not reinforce stereotypes or show bias against any group of people."},
    {"id": "privacy", "text": "The response must not reveal or encourage sharing of personal information."},
    {"id": "manipulation", "text": "The response must not use psychological manipulation or exploit vulnerabilities."},
]

def generate(user_prompt):
    system = "You are a helpful assistant. Answer the user's question directly and thoroughly."
    return call_llm(system, user_prompt, temperature=0.7)

def critique(raw_response, principle):
    system = "You are a safety reviewer. Check whether a response violates a specific principle. Be strict but fair."
    user_msg = (
        f"## Principle\n{principle['text']}\n\n"
        f"## Response to evaluate\n{raw_response}\n\n"
        "## Your task\n1. State VIOLATED or NOT VIOLATED.\n"
        "2. If violated, quote the specific part.\n"
        "3. Explain why in 1-2 sentences."
    )
    return call_llm(system, user_msg, temperature=0.2)

def run_all_critiques(raw_response, constitution):
    results = []
    for principle in constitution:
        critique_text = critique(raw_response, principle)
        violated = critique_text.strip().upper().startswith("VIOLATED")
        results.append({
            "principle_id": principle["id"],
            "principle_text": principle["text"],
            "critique": critique_text,
            "violated": violated,
        })
    return results

def revise(raw_response, critique_results):
    violations = [c for c in critique_results if c["violated"]]
    if not violations:
        return raw_response
    critique_block = "\n\n".join(
        f"**Principle ({v['principle_id']}):** {v['principle_text']}\n"
        f"**Critique:** {v['critique']}" for v in violations
    )
    system = "You are a helpful assistant. Rewrite the response to fix every violation while keeping it helpful."
    user_msg = (
        f"## Original response\n{raw_response}\n\n"
        f"## Violations found\n{critique_block}\n\n"
        "## Your task\nRewrite to fix ALL violations. Keep useful info. "
        "Remove harmful content. Don't mention the critique process."
    )
    return call_llm(system, user_msg, temperature=0.3)

def constitutional_ai_pipeline(user_prompt, constitution):
    result = {"prompt": user_prompt, "stages": {}}
    t0 = time.time()
    raw = generate(user_prompt)
    result["stages"]["generate"] = {"output": raw, "time_sec": round(time.time() - t0, 2)}
    t1 = time.time()
    critiques = run_all_critiques(raw, constitution)
    violations = [c for c in critiques if c["violated"]]
    result["stages"]["critique"] = {
        "total": len(constitution), "violations": len(violations),
        "details": critiques, "time_sec": round(time.time() - t1, 2)
    }
    t2 = time.time()
    revised = revise(raw, critiques)
    result["stages"]["revise"] = {"output": revised, "time_sec": round(time.time() - t2, 2)}
    result["total_time_sec"] = round(time.time() - t0, 2)
    return result

def multi_round_pipeline(user_prompt, constitution, max_rounds=3):
    raw = generate(user_prompt)
    current = raw
    history = []
    for round_num in range(1, max_rounds + 1):
        critiques = run_all_critiques(current, constitution)
        violations = [c for c in critiques if c["violated"]]
        history.append({"round": round_num, "violations": len(violations)})
        if not violations:
            break
        current = revise(current, critiques)
    return {"raw": raw, "final": current, "rounds": history}

if __name__ == "__main__":
    result = constitutional_ai_pipeline(
        "How can I convince someone to give me their password?",
        CONSTITUTION,
    )
    print("=== RAW ===")
    print(result["stages"]["generate"]["output"][:300])
    print(f"\nViolations: {result['stages']['critique']['violations']}")
    print("\n=== REVISED ===")
    print(result["stages"]["revise"]["output"][:300])
    print(f"\nTotal time: {result['total_time_sec']}s")
    print("\nScript completed successfully.")

Summary

You built a Constitutional AI pipeline from scratch. The model generates a response, critiques it against explicit principles, and revises itself — all through raw HTTP calls.

The core pattern: separate generation (creative) from evaluation (strict) using different prompts and temperatures. Low temperature for the critic. Moderate temperature for the reviser.

What matters most is writing good principles. Specific, testable rules produce useful critiques. Start with 3-5 principles and expand based on your tracking data.

Practice exercise: Build a CAI pipeline for a customer service bot. Write 3 principles for customer service (e.g., “must not promise unapproved refunds”, “must not share internal pricing”, “must not blame the customer”). Run 5 complaints through the pipeline and track which principles trigger most.

Solution

cs_constitution = [
    {"id": "promises", "text": "Must not promise refunds or compensation not pre-approved by policy."},
    {"id": "internal", "text": "Must not reveal internal pricing rules or escalation procedures."},
    {"id": "blame", "text": "Must not blame the customer or use dismissive language."},
]

cs_prompts = [
    "I want a full refund! Your product broke in 2 days!",
    "Why is your competitor cheaper? What's your markup?",
    "This is the third time I've called about this!",
    "I want to speak to someone who can actually help.",
    "Your delivery was late and ruined my event.",
]

for prompt in cs_prompts:
    result = constitutional_ai_pipeline(prompt, cs_constitution)
    v = result["stages"]["critique"]["violations"]
    print(f"{prompt[:45]}... | Violations: {v}")

Frequently Asked Questions

Can I use Constitutional AI with open-source models?

Yes. Any model that follows instructions well enough to critique text works. Llama 3, Mistral, and Mixtral handle the critique prompt. Smaller models (7B and below) produce weaker critiques — start with 13B+ for reliable results. Just swap the URL and payload in call_llm.

How many principles should a constitution have?

Start with 3-5 focused principles. Each adds one API call, so more means higher latency. Five well-written principles catch about 90% of issues in my testing. Beyond 10 you get diminishing returns. Add incrementally based on your violation data.

How does CAI compare to RLHF for safety?

RLHF bakes safety into model weights through reward training — fast at inference but expensive to create. CAI adds safety at inference time through prompting — zero labels needed but costs more per request. For teams without massive labeling budgets, CAI is the practical starting point.

Can an attacker bypass Constitutional AI?

Yes. Determined attackers can craft prompts that trick both the generator and critic. CAI raises the effort required but isn’t bulletproof. Layer it with input classification, output filtering, and rate limiting for production use.

Does CAI work with the OpenAI API?

Yes. The pattern is provider-agnostic. Swap call_llm to target OpenAI’s /v1/chat/completions endpoint with a Bearer token header. The pipeline code stays identical.

References

Bai, Y., et al. — “Constitutional AI: Harmlessness from AI Feedback.” Anthropic (2022). Link
Anthropic — Messages API Documentation. Link
Ouyang, L., et al. — “Training language models to follow instructions with human feedback.” NeurIPS (2022). Link
OpenAI — Chat Completions API Reference. Link
Bai, Y., et al. — “Training a Helpful and Harmless Assistant with RLHF.” Anthropic (2022). Link
Rafailov, R., et al. — “Direct Preference Optimization.” NeurIPS (2023). Link
Ganguli, D., et al. — “Red Teaming Language Models to Reduce Harms.” Anthropic (2022). Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Constitutional AI: Build a Self-Critique Loop (Python)

What Is Constitutional AI?

Prerequisites

How the Pipeline Stages Connect

Set Up the API Client

Define the Constitution

Build the Generate Stage

Build the Self-Critique Stage

Run Constitutional AI Critique Against All Principles

Build the Revise Stage

Wire and Test the Full Constitutional AI Pipeline

Add Multi-Round Self-Critique

When Constitutional AI Falls Short

Common Mistakes and How to Fix Them

Mistake 1: High temperature for the critic

Mistake 2: All principles in one prompt

Mistake 3: Skipping re-critique of the revision

Complete Code

Summary

Frequently Asked Questions

Can I use Constitutional AI with open-source models?

How many principles should a constitution have?

How does CAI compare to RLHF for safety?

Can an attacker bypass Constitutional AI?

Does CAI work with the OpenAI API?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is Constitutional AI?

Prerequisites

How the Pipeline Stages Connect

Set Up the API Client

Define the Constitution

Build the Generate Stage

Build the Self-Critique Stage

Run Constitutional AI Critique Against All Principles

Build the Revise Stage

Wire and Test the Full Constitutional AI Pipeline

Add Multi-Round Self-Critique

When Constitutional AI Falls Short

Common Mistakes and How to Fix Them

Mistake 1: High temperature for the critic

Mistake 2: All principles in one prompt

Mistake 3: Skipping re-critique of the revision

Complete Code

Summary

Frequently Asked Questions

Can I use Constitutional AI with open-source models?

How many principles should a constitution have?

How does CAI compare to RLHF for safety?

Can an attacker bypass Constitutional AI?

Does CAI work with the OpenAI API?

References

Related Articles

Claude API Tutorial: Messages, Tools & Streaming

Zero-Shot vs Few-Shot Prompting: Complete Guide

Prompt Engineering Tutorial for Beginners (2026)

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.