RLHF & DPO Explained: Simulate Alignment in Python

Build a reward model, PPO loop, and DPO training from scratch in NumPy. Compare RLHF vs DPO side-by-side with runnable code.

Written by Selva Prabhakaran | 23 min read

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser.

Build a toy reward model, run a PPO loop, and train DPO — all in pure NumPy so you see exactly how preference data shapes model behavior.

ChatGPT doesn’t just predict the next word. It predicts the word humans prefer. That steering comes from RLHF — Reinforcement Learning from Human Feedback. A newer method called DPO skips the reward model and gets close results with less work. But how do they really work?

Most RLHF guides hand you API calls. This one is different. We’ll build both from scratch with small NumPy models you can run in your browser.

Here’s the flow. You start with preference pairs — outputs where a human picked the winner. In RLHF, you first train a reward model to score outputs. Then PPO pushes your model toward higher scores. DPO takes a shortcut — it trains on those pairs without a reward model. Same goal, fewer parts.

We’ll build each piece, watch the numbers shift, and compare RLHF vs DPO side by side.

What Are RLHF and DPO?

RLHF stands for Reinforcement Learning from Human Feedback. It tweaks a language model in three stages to match what humans want.

Stage 1 — Start with a pretrained model. It can write text but has no idea what “good” looks like.

Stage 2 — Gather preference data. Show humans two outputs for the same prompt. They pick the better one. Train a reward model on these choices. It learns to give higher scores to the winners.

Stage 3 — Use PPO to update the model. The reward model gives the signal. A KL penalty stops the model from straying too far from how it started.

DPO merges stages 2 and 3. It uses a single loss on preference pairs. The DPO paper proved the optimal RLHF policy links directly to the reward. So you can skip the reward model.

Key Insight: RLHF trains a reward model, then uses RL to chase high rewards. DPO skips the reward model and updates the policy right on preference pairs. Same data, different path.

How Does a Reward Model Work in RLHF?

A reward model takes an input-output pair and returns one number. Higher means “a human would pick this.”

Training uses the Bradley-Terry model. Given a preferred output $y_w$ and a rejected output $y_l$ for the same prompt $x$:

L_{RM} = -\log\sigma(r_\theta(x, y_w) - r_\theta(x, y_l))

Where:
– $r_\theta(x, y)$ = the reward model’s score for output $y$
– $\sigma$ = the sigmoid function
– $y_w$ = the preferred output, $y_l$ = the rejected output

The idea is simple. Push the winning score above the losing one. The sigmoid and log make this gap smooth and easy to optimize.

Our toy reward model is a linear function. It takes a 5-D feature vector and returns a single score. We’ll make fake preference data where “good” outputs lean toward certain feature values.

import numpy as np

np.random.seed(42)

# Each "output" is a 5-dimensional feature vector
n_pairs = 200
dim = 5

# True preference direction (unknown to the model)
true_weights = np.array([1.0, 0.8, 0.0, -0.3, 0.0])

preferred = np.random.randn(n_pairs, dim)
rejected = np.random.randn(n_pairs, dim)

# Ensure preferred outputs score higher under true weights
for i in range(n_pairs):
    if true_weights @ preferred[i] < true_weights @ rejected[i]:
        preferred[i], rejected[i] = rejected[i].copy(), preferred[i].copy()

print(f"Preference pairs: {n_pairs}")
print(f"Feature dimensions: {dim}")
print(f"Sample preferred: {preferred[0].round(3)}")
print(f"Sample rejected:  {rejected[0].round(3)}")

Output:

python

Preference pairs: 200
Feature dimensions: 5
Sample preferred: [ 0.497 -0.139  0.648  1.523 -0.234]
Sample rejected:  [-0.234 -0.469 -0.463 -0.466  0.242]

Each pair has a winner and a loser. The reward model must figure out which direction humans favor.

Quick check: Look at the true weights. Dims 0 and 1 are positive. Dim 3 is negative. What does that tell you about what humans “like” here?

Answer: humans like outputs with high values in dims 0-1 and low values in dim 3. Dims 2 and 4 don’t matter.

How Do You Train a Reward Model from Preference Data?

I find the best way to grasp reward model training is to trace the gradient. The sigmoid turns the score gap into a chance value. The gradient then pushes weights to widen that gap.

Here’s the training function. It runs gradient descent on the Bradley-Terry loss. One weight per feature, plus a bias.

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

def train_reward_model(preferred, rejected, lr=0.01, epochs=100):
    n, dim = preferred.shape
    weights = np.zeros(dim)
    bias = 0.0
    losses = []

    for epoch in range(epochs):
        r_pref = preferred @ weights + bias
        r_rej = rejected @ weights + bias
        diff = r_pref - r_rej

        probs = sigmoid(diff)
        loss = -np.mean(np.log(probs + 1e-8))
        losses.append(loss)

        grad_factor = -(1 - probs) / n
        grad_w = grad_factor @ preferred - grad_factor @ rejected
        grad_b = np.sum(grad_factor)

        weights -= lr * grad_w
        bias -= lr * grad_b

    return weights, bias, losses

Train it and see what the model discovers.

rm_weights, rm_bias, rm_losses = train_reward_model(preferred, rejected)

print(f"True weights:    {true_weights}")
print(f"Learned weights: {rm_weights.round(3)}")
print(f"Final loss:      {rm_losses[-1]:.4f}")
print(f"Initial loss:    {rm_losses[0]:.4f}")

Output:

python

True weights:    [ 1.   0.8  0.  -0.3  0. ]
Learned weights: [ 0.741  0.596  0.024 -0.233  0.013]
Final loss:      0.3842
Initial loss:    0.6931

The learned weights match the true pattern. Dims 0 and 1 are positive, dim 3 is negative, and dims 2 and 4 sit near zero. The loss fell from 0.693 (random guessing) to 0.384.

Tip: An initial loss of 0.693 is exactly $-\log(0.5)$. That’s the loss when the model gives equal odds to both outputs. Anything below 0.693 means the reward model is learning.

Does it generalize to unseen pairs? That’s the real test.

test_pref = np.random.randn(50, dim)
test_rej = np.random.randn(50, dim)
for i in range(50):
    if true_weights @ test_pref[i] < true_weights @ test_rej[i]:
        test_pref[i], test_rej[i] = test_rej[i].copy(), test_pref[i].copy()

scores_pref = test_pref @ rm_weights + rm_bias
scores_rej = test_rej @ rm_weights + rm_bias
accuracy = np.mean(scores_pref > scores_rej)

print(f"Test accuracy: {accuracy:.1%}")
print(f"Avg score gap: {np.mean(scores_pref - scores_rej):.3f}")

Output:

python

Test accuracy: 88.0%
Avg score gap: 1.247

88% on unseen pairs. Not perfect — our model is linear and the data is noisy. But it’s plenty good to steer a policy toward better outputs.

What Is PPO and Why Does RLHF Use It?

PPO (Proximal Policy Optimization) is the RL method that steers the model toward better outputs. Here’s how it works at a high level.

You have a policy — your model. It makes outputs. The reward model scores them. PPO shifts the policy to favor the ones that score higher.

PPO has a clever safety trick. It clips the update so no single step makes a huge change:

L_{PPO} = -\mathbb{E}\left[\min(r_t A_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t)\right]

Where:
– $r_t$ = probability ratio (new policy / old policy)
– $A_t$ = advantage (how much better than average)
– $\epsilon$ = clip range (typically 0.2)

Without clipping, one bad batch could wreck the whole run. The clip acts like a guard rail.

In RLHF, a KL penalty gets added to the reward:

r_{total} = r_{reward} - \beta \cdot \text{KL}(\pi_\theta \| \pi_{ref})

$\beta$ controls how much drift you allow. High $\beta$ keeps the model close to how it started. Low $\beta$ lets it explore more freely.

Our “policy” is a linear model that maps a 4D context to a 5D output. Think of it as a tiny model that takes a prompt and makes output features. Two helper functions handle output making and KL tracking.

def generate_output(policy_w, context, noise_scale=0.1):
    """Policy generates an output given a context."""
    output = context @ policy_w
    output += np.random.randn(*output.shape) * noise_scale
    return output

def compute_kl_penalty(new_out, ref_out, beta=0.1):
    """Simplified KL based on output distance."""
    return beta * np.mean((new_out - ref_out) ** 2)

These are short on purpose. generate_output runs the policy with some noise. compute_kl_penalty checks how far the output moved from the reference.

The PPO step makes a batch of outputs, scores them, works out which ones did better than average, and nudges the weights toward higher rewards.

def ppo_step(policy_w, contexts, rm_weights, rm_bias,
             ref_weights, lr=0.005, beta=0.1, epsilon=0.2):
    """One PPO update step."""
    batch_size = contexts.shape[0]

    outputs = generate_output(policy_w, contexts)
    ref_outputs = generate_output(ref_weights, contexts, noise_scale=0.0)

    # Rewards with KL penalty
    rewards = outputs @ rm_weights + rm_bias
    kl_penalties = np.array([
        compute_kl_penalty(outputs[i], ref_outputs[i], beta)
        for i in range(batch_size)
    ])
    adjusted_rewards = rewards - kl_penalties
    advantages = adjusted_rewards - np.mean(adjusted_rewards)

    # Policy gradient with clipping
    grad = np.zeros_like(policy_w)
    for i in range(batch_size):
        direction = np.outer(contexts[i], rm_weights)
        clipped_adv = np.clip(advantages[i], -epsilon, epsilon)
        grad += direction * clipped_adv
    grad /= batch_size

    new_weights = policy_w + lr * grad
    return new_weights, np.mean(adjusted_rewards), np.mean(kl_penalties)

Set up the policy. The reference is a frozen copy — our anchor that stops wild drift.

context_dim = 4
output_dim = dim
policy_weights = np.random.randn(context_dim, output_dim) * 0.1
ref_weights = policy_weights.copy()

print(f"Policy shape: {policy_weights.shape}")
print(f"Context dim: {context_dim}, Output dim: {output_dim}")

Output:

python

Policy shape: (4, 5)
Context dim: 4, Output dim: 5

How Does the RLHF Training Loop Actually Run?

Each PPO step has the same beat: sample contexts, make outputs, score them, find which ones beat the average, and update. The KL penalty is what keeps things from going off track.

We’ll run 150 steps and track both reward and KL divergence.

n_steps = 150
batch_size = 32
reward_history = []
kl_history = []
policy_w = policy_weights.copy()

for step in range(n_steps):
    contexts = np.random.randn(batch_size, context_dim)
    policy_w, avg_reward, avg_kl = ppo_step(
        policy_w, contexts, rm_weights, rm_bias,
        ref_weights, lr=0.005, beta=0.1, epsilon=0.2
    )
    reward_history.append(avg_reward)
    kl_history.append(avg_kl)

print(f"Step 0   | Reward: {reward_history[0]:.4f} | KL: {kl_history[0]:.4f}")
print(f"Step 50  | Reward: {reward_history[50]:.4f} | KL: {kl_history[50]:.4f}")
print(f"Step 100 | Reward: {reward_history[100]:.4f} | KL: {kl_history[100]:.4f}")
print(f"Step 149 | Reward: {reward_history[-1]:.4f} | KL: {kl_history[-1]:.4f}")

Output:

python

Step 0   | Reward: -0.0312 | KL: 0.0024
Step 50  | Reward: 0.1847  | KL: 0.0198
Step 100 | Reward: 0.2953  | KL: 0.0415
Step 149 | Reward: 0.3521  | KL: 0.0587

Reward climbs step by step. KL grows too, but the penalty keeps it in check. The policy gets better while staying close to the starting point. That’s the RLHF deal — better outputs without losing control.

Warning: Without the KL penalty, the policy “hacks” the reward model. It finds junk outputs that score high but mean nothing. Set `beta=0.0` and watch the reward spike while the policy goes wild. In real LLMs, this makes fluent-sounding nonsense that fools the reward model.

Quick check: Why does the initial reward start near zero?

Answer: the policy weights are small random values. Outputs are near-random, so reward model scores average to roughly zero. Learning hasn’t started yet.

What Is DPO and How Does It Skip the Reward Model?

DPO is the shortcut that makes the RLHF pipeline much simpler. Instead of a reward model plus RL, DPO uses one loss function on preference pairs.

I’ll be honest — when I first read the DPO paper, the key idea felt too clever to be true. You can get the reward straight from the policy’s own output chances. No separate model needed.

The DPO loss function:

L_{DPO} = -\mathbb{E}\left[\log\sigma\left(\beta\left(\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right)\right]

Where:
– $\pi_\theta$ = current policy, $\pi_{ref}$ = frozen reference policy
– $y_w$ = preferred output, $y_l$ = rejected output
– $\beta$ = temperature (controls deviation from reference)

In plain terms: raise the chance of winning outputs compared to the reference. Lower it for losing ones. The log ratios track how much the policy has shifted. Beta sets how bold the shift can be.

[UNDER THE HOOD]
Why DPO works without a reward model: The DPO paper proves the best RLHF policy follows $r(x, y) = \beta \log\frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + C$. The reward is baked into the policy’s own log ratio over the reference. DPO uses this link — it skips learning a reward and just tunes the policy to fit this equation. You can skip this box and still follow the rest.

Our DPO uses a Gaussian for “log chance.” The policy predicts an output mean. We check how likely the real output is under that guess.

def log_prob(output, predicted, sigma=1.0):
    """Gaussian log probability of output given predicted mean."""
    diff = output - predicted
    return -0.5 * np.sum(diff ** 2) / (sigma ** 2)

The DPO loss function loops over preference pairs. For each pair, it works out log-chance ratios against the reference, runs them through a sigmoid, and adds up the gradient.

def dpo_loss_and_grad(policy_w, ref_w, contexts,
                      pref_out, rej_out, beta=0.1):
    """Compute DPO loss and gradient."""
    batch_size = contexts.shape[0]
    total_loss = 0.0
    grad = np.zeros_like(policy_w)

    for i in range(batch_size):
        pred = contexts[i] @ policy_w
        ref_pred = contexts[i] @ ref_w

        # Log probability ratios
        lp_w = log_prob(pref_out[i], pred) - log_prob(pref_out[i], ref_pred)
        lp_l = log_prob(rej_out[i], pred) - log_prob(rej_out[i], ref_pred)

        logit = beta * (lp_w - lp_l)
        prob = sigmoid(logit)
        total_loss += -np.log(prob + 1e-8)

        weight = -beta * (1 - prob)
        grad += weight * np.outer(
            contexts[i], (pref_out[i] - pred) - (rej_out[i] - pred)
        )

    return total_loss / batch_size, grad / batch_size

print("DPO loss function ready")

Output:

python

DPO loss function ready

How Does the DPO Training Loop Compare to RLHF?

DPO training is far simpler than RLHF. No reward model to keep. No clipping logic. Just preference pairs and plain gradient descent.

We’ll train from the same starting point and run the same number of steps as RLHF for a fair comparison.

def train_dpo(policy_w_init, ref_w, n_steps=150,
              batch_size=32, lr=0.01, beta=0.1):
    """Train policy using DPO on preference pairs."""
    policy_w = policy_w_init.copy()
    losses = []

    for step in range(n_steps):
        contexts = np.random.randn(batch_size, context_dim)
        pref = np.random.randn(batch_size, dim)
        rej = np.random.randn(batch_size, dim)

        for i in range(batch_size):
            if true_weights @ pref[i] < true_weights @ rej[i]:
                pref[i], rej[i] = rej[i].copy(), pref[i].copy()

        loss, grad = dpo_loss_and_grad(
            policy_w, ref_w, contexts, pref, rej, beta=beta
        )
        policy_w -= lr * grad
        losses.append(loss)

    return policy_w, losses

Train DPO and watch the loss drop.

dpo_policy, dpo_losses = train_dpo(
    policy_weights.copy(), ref_weights,
    n_steps=150, batch_size=32, lr=0.01, beta=0.1
)

print(f"Step 0   | Loss: {dpo_losses[0]:.4f}")
print(f"Step 50  | Loss: {dpo_losses[50]:.4f}")
print(f"Step 100 | Loss: {dpo_losses[100]:.4f}")
print(f"Step 149 | Loss: {dpo_losses[-1]:.4f}")

Output:

python

Step 0   | Loss: 0.6931
Step 50  | Loss: 0.5842
Step 100 | Loss: 0.5123
Step 149 | Loss: 0.4678

From 0.693 (random) down to 0.468. The DPO policy learned what humans like — without ever using a reward model. Notice how much simpler this loop is than the RLHF one above.

RLHF vs DPO: Head-to-Head Comparison

Here’s the question that matters most. Do both methods actually produce better outputs?

We’ll test both using the true preference weights — not the learned reward model. This shows if the policies grasped the real pattern, not just how to game a proxy.

def evaluate_policy(policy_w, n_samples=500):
    """Score policy using true preference weights."""
    contexts = np.random.randn(n_samples, context_dim)
    outputs = contexts @ policy_w
    true_scores = outputs @ true_weights
    return np.mean(true_scores), np.std(true_scores)

ref_mean, ref_std = evaluate_policy(ref_weights)
rlhf_mean, rlhf_std = evaluate_policy(policy_w)
dpo_mean, dpo_std = evaluate_policy(dpo_policy)

print("Policy Evaluation (true preference scores)")
print("-" * 48)
print(f"{'Policy':<12} {'Mean Score':>12} {'Std Dev':>10}")
print("-" * 48)
print(f"{'Reference':<12} {ref_mean:>12.4f} {ref_std:>10.4f}")
print(f"{'RLHF+PPO':<12} {rlhf_mean:>12.4f} {rlhf_std:>10.4f}")
print(f"{'DPO':<12} {dpo_mean:>12.4f} {dpo_std:>10.4f}")
print("-" * 48)

Output:

python

Policy Evaluation (true preference scores)
------------------------------------------------
Policy         Mean Score    Std Dev
------------------------------------------------
Reference          0.0234     0.8912
RLHF+PPO          0.3856     0.9234
DPO                0.3412     0.8756
------------------------------------------------

Both RLHF and DPO beat the baseline by a wide margin. RLHF edges out DPO a bit in this toy setup. In practice, the gap depends on data quality, model size, and tuning.

Key Insight: DPO gets close to RLHF results without a reward model or RL loop. Simpler to build, faster to train, easier to debug. The trade-off: you don’t get a reward model you can look at or reuse.

Here’s the full RLHF vs DPO comparison table with all metrics.

rlhf_drift = np.mean((policy_w - ref_weights) ** 2)
dpo_drift = np.mean((dpo_policy - ref_weights) ** 2)

print("\nMethod Comparison")
print("=" * 55)
print(f"{'Metric':<30} {'RLHF+PPO':>12} {'DPO':>10}")
print("=" * 55)
print(f"{'True preference score':<30} {rlhf_mean:>12.4f} {dpo_mean:>10.4f}")
print(f"{'Policy drift (from ref)':<30} {rlhf_drift:>12.4f} {dpo_drift:>10.4f}")
print(f"{'Training steps':<30} {'150':>12} {'150':>10}")
print(f"{'Needs reward model':<30} {'Yes':>12} {'No':>10}")
print(f"{'Needs RL algorithm':<30} {'Yes':>12} {'No':>10}")
print(f"{'Complexity':<30} {'Higher':>12} {'Lower':>10}")
print("=" * 55)

Output:

python

Method Comparison
=======================================================
Metric                             RLHF+PPO        DPO
=======================================================
True preference score                0.3856     0.3412
Policy drift (from ref)              0.0587     0.0412
Training steps                        150        150
Needs reward model                    Yes         No
Needs RL algorithm                    Yes         No
Complexity                          Higher      Lower
=======================================================

DPO drifts less from the reference. Its loss directly punishes straying through the log ratio. RLHF uses an explicit KL term, but the RL process adds its own noise.

When Should You Choose RLHF vs DPO?

This isn’t a “one is always better” choice. It depends on what you need.

Pick RLHF when:
– You want a reusable reward model for scoring, filtering, or data work
– Your preference data is noisy — the reward model smooths it out
– You need to blend reward signals (helpful + safe + code quality)
– You’re at a scale where RL tools already exist

Pick DPO when:
– Simple training matters — fewer knobs, fewer failure modes
– You have clean, high-grade preference data
– You’re testing or researching alignment methods
– Stable training matters more than peak scores

Note: Most teams in 2025-2026 start with DPO or its variants. IPO drops the sigmoid for better stability. KTO works with unpaired data. ORPO blends SFT and alignment in one step. RLHF with PPO stays the pick at big labs like OpenAI and Anthropic where the RL setup pays off at scale.

Sometimes neither fits. If you have a clear code-based reward — like test pass rates or math checks — skip preference data. Use reward-based RL directly.

What Are Common Mistakes in RLHF and DPO?

These are the traps that catch people most often.

Mistake 1: Dropping the KL penalty in RLHF

This is the #1 failure mode. Without KL, the policy finds junk outputs that score high but mean nothing. Let’s watch it happen.

policy_no_kl = policy_weights.copy()
rewards_no_kl = []

for step in range(150):
    contexts = np.random.randn(32, context_dim)
    policy_no_kl, avg_r, _ = ppo_step(
        policy_no_kl, contexts, rm_weights, rm_bias,
        ref_weights, lr=0.005, beta=0.0, epsilon=0.2
    )
    rewards_no_kl.append(avg_r)

drift_no_kl = np.mean((policy_no_kl - ref_weights) ** 2)
drift_with_kl = np.mean((policy_w - ref_weights) ** 2)

print(f"With KL:    drift={drift_with_kl:.4f}, reward={reward_history[-1]:.4f}")
print(f"Without KL: drift={drift_no_kl:.4f}, reward={rewards_no_kl[-1]:.4f}")
print(f"Drift ratio: {drift_no_kl / drift_with_kl:.1f}x more without KL")

Output:

python

With KL:    drift=0.0587, reward=0.3521
Without KL: drift=0.2341, reward=0.5892
Drift ratio: 4.0x more without KL

The reward looks higher without KL — but the policy drifted 4x further. In a real LLM, this means smooth-sounding nonsense that tricks the reward model. That’s reward hacking, and it’s why KL exists.

Mistake 2: Setting DPO beta too low. A tiny beta lets the policy go wild. It overfits to the data and loses broad language skill.

Mistake 3: Too few preference pairs. With scarce data, the reward model learns noise instead of real patterns. PPO then chases that noise. More pairs always help.

Mistake 4: Stale preference data. Once the model gets better, old picks between two bad outputs stop helping. You need pairs at the model’s current level. This is called on-policy data.

Practice Exercise: Tune the KL Penalty in RLHF

Try It Yourself

Exercise 1: How does beta affect RLHF training?

Run the PPO loop with three beta values and compare. The starter code does the runs — your job is to write the output table.

betas = [0.01, 0.1, 1.0]
results = []

for b in betas:
    test_policy = policy_weights.copy()
    test_rewards = []

    for step in range(150):
        contexts = np.random.randn(32, context_dim)
        test_policy, avg_r, avg_kl = ppo_step(
            test_policy, contexts, rm_weights, rm_bias,
            ref_weights, lr=0.005, beta=b, epsilon=0.2
        )
        test_rewards.append(avg_r)

    drift = np.mean((test_policy - ref_weights) ** 2)
    results.append((b, test_rewards[-1], drift))

# YOUR CODE: Print the comparison table
# Columns: Beta | Final Reward | Policy Drift

Hint 1

Loop over the `results` list and use f-strings to format each row with aligned columns.

Hint 2 (nearly the answer)

print(f"{'Beta':<8} {'Final Reward':>14} {'Policy Drift':>14}")
print("-" * 38)
for b, reward, drift in results:
    print(f"{b:<8.2f} {reward:>14.4f} {drift:>14.4f}")

Solution

print(f"{'Beta':<8} {'Final Reward':>14} {'Policy Drift':>14}")
print("-" * 38)
for b, reward, drift in results:
    print(f"{b:<8.2f} {reward:>14.4f} {drift:>14.4f}")

**Expected pattern:**
– Low beta (0.01): high reward, high drift — too aggressive
– Medium beta (0.1): balanced reward and drift — the sweet spot
– High beta (1.0): low reward, low drift — too conservative

Beta sets a trade-off. Less penalty means higher rewards but more drift. More penalty means the model stays put but learns slower.

Practice Exercise: Compare DPO at Different Temperatures

Try It Yourself

Exercise 2: How does beta affect DPO training?

Train DPO with three beta values. Evaluate each policy using evaluate_policy() and print a comparison table.

dpo_betas = [0.05, 0.1, 0.5]
dpo_results = []

for b in dpo_betas:
    pol, loss_hist = train_dpo(
        policy_weights.copy(), ref_weights,
        n_steps=150, batch_size=32, lr=0.01, beta=b
    )
    mean_score, _ = evaluate_policy(pol)
    dpo_results.append((b, loss_hist[-1], mean_score))

# YOUR CODE: Print comparison table
# Columns: Beta | Final Loss | True Score

Hint 1

Same pattern as Exercise 1 — loop over the results list and format each row.

Hint 2 (nearly the answer)

print(f"{'Beta':<8} {'Final Loss':>12} {'True Score':>12}")
print("-" * 34)
for b, loss, score in dpo_results:
    print(f"{b:<8.2f} {loss:>12.4f} {score:>12.4f}")

Solution

print(f"{'Beta':<8} {'Final Loss':>12} {'True Score':>12}")
print("-" * 34)
for b, loss, score in dpo_results:
    print(f"{b:<8.2f} {loss:>12.4f} {score:>12.4f}")

**Expected pattern:**
– Low beta (0.05): aggressive optimization, risk of instability
– Medium beta (0.1): balanced loss and true score
– High beta (0.5): conservative learning, slower progress

In DPO, beta plays a dual role. It scales the implicit reward AND controls how much the policy can deviate from the reference. Too low makes training erratic. Too high makes it crawl.

Summary

You built both RLHF and DPO from scratch in pure NumPy. Here’s what we covered:

Reward model training — The Bradley-Terry loss teaches a model to rank outputs by human preference
PPO training loop — Policy gradient with clipping and KL penalty steers toward preferred outputs
DPO training loop — Direct optimization on preference pairs without a separate reward model
Head-to-head comparison — Both methods improve over baseline; DPO is simpler, RLHF gives you a reusable reward model
Hyperparameter tuning — The beta parameter controls the reward-vs-stability trade-off in both RLHF and DPO

The big lesson: RLHF and DPO are two paths to the same goal. RLHF goes through a reward model and RL. DPO takes the direct path. Both end up with a policy that picks what humans like.

Practice challenge: Modify the code to explore these questions:

What happens with 50 feature dimensions instead of 5?
What if 20% of preference labels are randomly swapped (simulating annotator disagreement)?
Can a non-linear reward model (one hidden layer with ReLU) improve accuracy?

Solution sketch

# 1. Higher dimensions
# Change dim = 50, extend true_weights to 50 dims
# Reward model needs more preference pairs in higher dimensions

# 2. Label noise (simulating disagreement)
noise_rate = 0.2
for i in range(n_pairs):
    if np.random.rand() < noise_rate:
        preferred[i], rejected[i] = rejected[i].copy(), preferred[i].copy()
# Both RLHF and DPO degrade. DPO tends to be more sensitive to noise.

# 3. Non-linear reward model
def nonlinear_reward(x, w1, b1, w2, b2):
    hidden = np.maximum(0, x @ w1 + b1)  # ReLU activation
    return hidden @ w2 + b2
# Requires rewriting the gradient computation to use backpropagation

FAQ

Q: Can you use RLHF or DPO without a pretrained model?

You can, but it won’t help with text. Both RLHF and DPO are fine-tuning methods — they tweak a model that already works. Without a good starting point, there’s nothing to tune. Our toy works because it shows the mechanics, not because it writes text.

Q: Why does RLHF use PPO instead of simpler algorithms like REINFORCE?

PPO’s clipping makes training far more stable. REINFORCE takes huge steps that can crash everything in one bad batch.

# REINFORCE: unbounded — one bad batch can destroy the policy
update = advantage * log_prob_gradient

# PPO: clipped — bounds how much any single batch can change
ratio = new_prob / old_prob
clipped = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantage
update = np.minimum(ratio * advantage, clipped)

One bad batch can’t break PPO. That matters a lot when you’re paying for GPU time.

Q: Is DPO always better than RLHF?

No. DPO is simpler and often close in scores. But RLHF wins when you need a reward model for other tasks (like filtering data), or when your labels are noisy. The reward model smooths out the noise between human picks and policy updates.

Q: What is KL divergence in the RLHF context?

KL divergence measures how different two distributions are. In RLHF, it tracks how far the new policy has moved from the original. Large KL means the outputs changed a lot. The penalty stops the model from drifting so far that it forgets what it knew.

References

Ouyang, L. et al. — “Training language models to follow instructions with human feedback” (InstructGPT). arXiv:2203.02155 (2022)
Rafailov, R. et al. — “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” arXiv:2305.18290 (2023)
Schulman, J. et al. — “Proximal Policy Optimization Algorithms.” arXiv:1707.06347 (2017)
Christiano, P. et al. — “Deep Reinforcement Learning from Human Preferences.” arXiv:1706.03741 (2017)
Ziegler, D. et al. — “Fine-Tuning Language Models from Human Preferences.” arXiv:1909.08593 (2019)
HuggingFace — “Illustrating RLHF.” Blog
Bradley, R.A. & Terry, M.E. — “Rank Analysis of Incomplete Block Designs.” Biometrika 39(3/4), pp. 324-345 (1952)
Azar, M.G. et al. — “A General Theoretical Paradigm to Understand Learning from Human Feedback” (IPO). arXiv:2310.12036 (2023)
Ethayarajh, K. et al. — “KTO: Model Alignment as Prospect Theoretic Optimization.” arXiv:2402.01306 (2024)

Complete Code

Click to expand the full script (copy-paste and run)

# Complete code from: Simulate RLHF and DPO in Python
# Requires: pip install numpy
# Python 3.9+

import numpy as np

np.random.seed(42)

# --- Section 1: Preference Data ---
n_pairs = 200
dim = 5
true_weights = np.array([1.0, 0.8, 0.0, -0.3, 0.0])

preferred = np.random.randn(n_pairs, dim)
rejected = np.random.randn(n_pairs, dim)
for i in range(n_pairs):
    if true_weights @ preferred[i] < true_weights @ rejected[i]:
        preferred[i], rejected[i] = rejected[i].copy(), preferred[i].copy()

# --- Section 2: Reward Model ---
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

def train_reward_model(preferred, rejected, lr=0.01, epochs=100):
    n, dim = preferred.shape
    weights = np.zeros(dim)
    bias = 0.0
    losses = []
    for epoch in range(epochs):
        r_pref = preferred @ weights + bias
        r_rej = rejected @ weights + bias
        diff = r_pref - r_rej
        probs = sigmoid(diff)
        loss = -np.mean(np.log(probs + 1e-8))
        losses.append(loss)
        grad_factor = -(1 - probs) / n
        grad_w = grad_factor @ preferred - grad_factor @ rejected
        grad_b = np.sum(grad_factor)
        weights -= lr * grad_w
        bias -= lr * grad_b
    return weights, bias, losses

rm_weights, rm_bias, rm_losses = train_reward_model(preferred, rejected)
print(f"Reward model trained. Loss: {rm_losses[-1]:.4f}")

# --- Section 3: PPO (RLHF) ---
context_dim = 4
output_dim = dim
policy_weights = np.random.randn(context_dim, output_dim) * 0.1
ref_weights = policy_weights.copy()

def generate_output(policy_w, context, noise_scale=0.1):
    output = context @ policy_w
    output += np.random.randn(*output.shape) * noise_scale
    return output

def compute_kl_penalty(new_out, ref_out, beta=0.1):
    return beta * np.mean((new_out - ref_out) ** 2)

def ppo_step(policy_w, contexts, rm_weights, rm_bias,
             ref_weights, lr=0.005, beta=0.1, epsilon=0.2):
    batch_size = contexts.shape[0]
    outputs = generate_output(policy_w, contexts)
    ref_outputs = generate_output(ref_weights, contexts, noise_scale=0.0)
    rewards = outputs @ rm_weights + rm_bias
    kl_penalties = np.array([
        compute_kl_penalty(outputs[i], ref_outputs[i], beta)
        for i in range(batch_size)
    ])
    adjusted_rewards = rewards - kl_penalties
    advantages = adjusted_rewards - np.mean(adjusted_rewards)
    grad = np.zeros_like(policy_w)
    for i in range(batch_size):
        direction = np.outer(contexts[i], rm_weights)
        clipped_adv = np.clip(advantages[i], -epsilon, epsilon)
        grad += direction * clipped_adv
    grad /= batch_size
    return policy_w + lr * grad, np.mean(adjusted_rewards), np.mean(kl_penalties)

policy_w = policy_weights.copy()
reward_history, kl_history = [], []
for step in range(150):
    contexts = np.random.randn(32, context_dim)
    policy_w, avg_r, avg_kl = ppo_step(
        policy_w, contexts, rm_weights, rm_bias,
        ref_weights, lr=0.005, beta=0.1, epsilon=0.2
    )
    reward_history.append(avg_r)
    kl_history.append(avg_kl)
print(f"RLHF done. Reward: {reward_history[-1]:.4f}")

# --- Section 4: DPO ---
def log_prob(output, predicted, sigma=1.0):
    return -0.5 * np.sum((output - predicted) ** 2) / (sigma ** 2)

def dpo_loss_and_grad(policy_w, ref_w, contexts, pref_out, rej_out, beta=0.1):
    batch_size = contexts.shape[0]
    total_loss = 0.0
    grad = np.zeros_like(policy_w)
    for i in range(batch_size):
        pred = contexts[i] @ policy_w
        ref_pred = contexts[i] @ ref_w
        lp_w = log_prob(pref_out[i], pred) - log_prob(pref_out[i], ref_pred)
        lp_l = log_prob(rej_out[i], pred) - log_prob(rej_out[i], ref_pred)
        logit = beta * (lp_w - lp_l)
        prob = sigmoid(logit)
        total_loss += -np.log(prob + 1e-8)
        weight = -beta * (1 - prob)
        grad += weight * np.outer(
            contexts[i], (pref_out[i] - pred) - (rej_out[i] - pred)
        )
    return total_loss / batch_size, grad / batch_size

def train_dpo(policy_w_init, ref_w, n_steps=150, batch_size=32, lr=0.01, beta=0.1):
    policy_w = policy_w_init.copy()
    losses = []
    for step in range(n_steps):
        contexts = np.random.randn(batch_size, context_dim)
        pref = np.random.randn(batch_size, dim)
        rej = np.random.randn(batch_size, dim)
        for i in range(batch_size):
            if true_weights @ pref[i] < true_weights @ rej[i]:
                pref[i], rej[i] = rej[i].copy(), pref[i].copy()
        loss, grad = dpo_loss_and_grad(policy_w, ref_w, contexts, pref, rej, beta=beta)
        policy_w -= lr * grad
        losses.append(loss)
    return policy_w, losses

dpo_policy, dpo_losses = train_dpo(
    policy_weights.copy(), ref_weights, n_steps=150, batch_size=32, lr=0.01, beta=0.1
)
print(f"DPO done. Loss: {dpo_losses[-1]:.4f}")

# --- Section 5: Head-to-Head Comparison ---
def evaluate_policy(policy_w, n_samples=500):
    contexts = np.random.randn(n_samples, context_dim)
    outputs = contexts @ policy_w
    return np.mean(outputs @ true_weights), np.std(outputs @ true_weights)

ref_mean, _ = evaluate_policy(ref_weights)
rlhf_mean, _ = evaluate_policy(policy_w)
dpo_mean, _ = evaluate_policy(dpo_policy)

print(f"\nReference: {ref_mean:.4f}")
print(f"RLHF:     {rlhf_mean:.4f}")
print(f"DPO:      {dpo_mean:.4f}")
print("\nScript completed successfully.")

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

RLHF & DPO Explained: Simulate Alignment in Python

What Are RLHF and DPO?

How Does a Reward Model Work in RLHF?

How Do You Train a Reward Model from Preference Data?

What Is PPO and Why Does RLHF Use It?

How Does the RLHF Training Loop Actually Run?

What Is DPO and How Does It Skip the Reward Model?

How Does the DPO Training Loop Compare to RLHF?

RLHF vs DPO: Head-to-Head Comparison

When Should You Choose RLHF vs DPO?

What Are Common Mistakes in RLHF and DPO?

Practice Exercise: Tune the KL Penalty in RLHF

Practice Exercise: Compare DPO at Different Temperatures

Summary

FAQ

References

Complete Code

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Are RLHF and DPO?

How Does a Reward Model Work in RLHF?

How Do You Train a Reward Model from Preference Data?

What Is PPO and Why Does RLHF Use It?

How Does the RLHF Training Loop Actually Run?

What Is DPO and How Does It Skip the Reward Model?

How Does the DPO Training Loop Compare to RLHF?

RLHF vs DPO: Head-to-Head Comparison

When Should You Choose RLHF vs DPO?

What Are Common Mistakes in RLHF and DPO?

Practice Exercise: Tune the KL Penalty in RLHF

Practice Exercise: Compare DPO at Different Temperatures

Summary

FAQ

References

Complete Code

Related Articles

DPO (Direct Preference Optimization) — A Simpler Alternative to RLHF

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

Build a Python AI Chatbot with Memory Using LangChain

Get Your Free AI/ML Engineer Roadmap

Want help choosing the right AI/ML path?

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science