machine learning +
DPO (Direct Preference Optimization) — A Simpler Alternative to RLHF
RLHF & DPO Explained: Simulate Alignment in Python
Build a reward model, PPO loop, and DPO training from scratch in NumPy. Compare RLHF vs DPO side-by-side with runnable code.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser.
Build a toy reward model, run a PPO loop, and train DPO — all in pure NumPy so you see exactly how preference data shapes model behavior.
ChatGPT doesn’t just predict the next word. It predicts the word humans prefer. That steering comes from RLHF — Reinforcement Learning from Human Feedback. A newer method called DPO skips the reward model and gets close results with less work. But how do they really work?
Most RLHF guides hand you API calls. This one is different. We’ll build both from scratch with small NumPy models you can run in your browser.
Here’s the flow. You start with preference pairs — outputs where a human picked the winner. In RLHF, you first train a reward model to score outputs. Then PPO pushes your model toward higher scores. DPO takes a shortcut — it trains on those pairs without a reward model. Same goal, fewer parts.
We’ll build each piece, watch the numbers shift, and compare RLHF vs DPO side by side.
What Are RLHF and DPO?
RLHF stands for Reinforcement Learning from Human Feedback. It tweaks a language model in three stages to match what humans want.
Stage 1 — Start with a pretrained model. It can write text but has no idea what “good” looks like.
Stage 2 — Gather preference data. Show humans two outputs for the same prompt. They pick the better one. Train a reward model on these choices. It learns to give higher scores to the winners.
Stage 3 — Use PPO to update the model. The reward model gives the signal. A KL penalty stops the model from straying too far from how it started.
DPO merges stages 2 and 3. It uses a single loss on preference pairs. The DPO paper proved the optimal RLHF policy links directly to the reward. So you can skip the reward model.
Key Insight: RLHF trains a reward model, then uses RL to chase high rewards. DPO skips the reward model and updates the policy right on preference pairs. Same data, different path.
How Does a Reward Model Work in RLHF?
A reward model takes an input-output pair and returns one number. Higher means “a human would pick this.”
Training uses the Bradley-Terry model. Given a preferred output \(y_w\) and a rejected output \(y_l\) for the same prompt $x$:
\[L_{RM} = -\log\sigma(r_\theta(x, y_w) - r_\theta(x, y_l))\]Where:
– \(r_\theta(x, y)\) = the reward model’s score for output $y$
– \(\sigma\) = the sigmoid function
– \(y_w\) = the preferred output, \(y_l\) = the rejected output
The idea is simple. Push the winning score above the losing one. The sigmoid and log make this gap smooth and easy to optimize.
Our toy reward model is a linear function. It takes a 5-D feature vector and returns a single score. We’ll make fake preference data where “good” outputs lean toward certain feature values.
import numpy as np
np.random.seed(42)
# Each "output" is a 5-dimensional feature vector
n_pairs = 200
dim = 5
# True preference direction (unknown to the model)
true_weights = np.array([1.0, 0.8, 0.0, -0.3, 0.0])
preferred = np.random.randn(n_pairs, dim)
rejected = np.random.randn(n_pairs, dim)
# Ensure preferred outputs score higher under true weights
for i in range(n_pairs):
if true_weights @ preferred[i] < true_weights @ rejected[i]:
preferred[i], rejected[i] = rejected[i].copy(), preferred[i].copy()
print(f"Preference pairs: {n_pairs}")
print(f"Feature dimensions: {dim}")
print(f"Sample preferred: {preferred[0].round(3)}")
print(f"Sample rejected: {rejected[0].round(3)}")
Output:
python
Preference pairs: 200
Feature dimensions: 5
Sample preferred: [ 0.497 -0.139 0.648 1.523 -0.234]
Sample rejected: [-0.234 -0.469 -0.463 -0.466 0.242]
Each pair has a winner and a loser. The reward model must figure out which direction humans favor.
Quick check: Look at the true weights. Dims 0 and 1 are positive. Dim 3 is negative. What does that tell you about what humans “like” here?
Answer: humans like outputs with high values in dims 0-1 and low values in dim 3. Dims 2 and 4 don’t matter.
How Do You Train a Reward Model from Preference Data?
I find the best way to grasp reward model training is to trace the gradient. The sigmoid turns the score gap into a chance value. The gradient then pushes weights to widen that gap.
Here’s the training function. It runs gradient descent on the Bradley-Terry loss. One weight per feature, plus a bias.
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))
def train_reward_model(preferred, rejected, lr=0.01, epochs=100):
n, dim = preferred.shape
weights = np.zeros(dim)
bias = 0.0
losses = []
for epoch in range(epochs):
r_pref = preferred @ weights + bias
r_rej = rejected @ weights + bias
diff = r_pref - r_rej
probs = sigmoid(diff)
loss = -np.mean(np.log(probs + 1e-8))
losses.append(loss)
grad_factor = -(1 - probs) / n
grad_w = grad_factor @ preferred - grad_factor @ rejected
grad_b = np.sum(grad_factor)
weights -= lr * grad_w
bias -= lr * grad_b
return weights, bias, losses
Train it and see what the model discovers.
rm_weights, rm_bias, rm_losses = train_reward_model(preferred, rejected)
print(f"True weights: {true_weights}")
print(f"Learned weights: {rm_weights.round(3)}")
print(f"Final loss: {rm_losses[-1]:.4f}")
print(f"Initial loss: {rm_losses[0]:.4f}")
Output:
python
True weights: [ 1. 0.8 0. -0.3 0. ]
Learned weights: [ 0.741 0.596 0.024 -0.233 0.013]
Final loss: 0.3842
Initial loss: 0.6931
The learned weights match the true pattern. Dims 0 and 1 are positive, dim 3 is negative, and dims 2 and 4 sit near zero. The loss fell from 0.693 (random guessing) to 0.384.
Tip: An initial loss of 0.693 is exactly $-\log(0.5)$. That’s the loss when the model gives equal odds to both outputs. Anything below 0.693 means the reward model is learning.
Does it generalize to unseen pairs? That’s the real test.
test_pref = np.random.randn(50, dim)
test_rej = np.random.randn(50, dim)
for i in range(50):
if true_weights @ test_pref[i] < true_weights @ test_rej[i]:
test_pref[i], test_rej[i] = test_rej[i].copy(), test_pref[i].copy()
scores_pref = test_pref @ rm_weights + rm_bias
scores_rej = test_rej @ rm_weights + rm_bias
accuracy = np.mean(scores_pref > scores_rej)
print(f"Test accuracy: {accuracy:.1%}")
print(f"Avg score gap: {np.mean(scores_pref - scores_rej):.3f}")
Output:
python
Test accuracy: 88.0%
Avg score gap: 1.247
88% on unseen pairs. Not perfect — our model is linear and the data is noisy. But it’s plenty good to steer a policy toward better outputs.
What Is PPO and Why Does RLHF Use It?
PPO (Proximal Policy Optimization) is the RL method that steers the model toward better outputs. Here’s how it works at a high level.
You have a policy — your model. It makes outputs. The reward model scores them. PPO shifts the policy to favor the ones that score higher.
PPO has a clever safety trick. It clips the update so no single step makes a huge change:
\[L_{PPO} = -\mathbb{E}\left[\min(r_t A_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t)\right]\]Where:
– \(r_t\) = probability ratio (new policy / old policy)
– \(A_t\) = advantage (how much better than average)
– \(\epsilon\) = clip range (typically 0.2)
Without clipping, one bad batch could wreck the whole run. The clip acts like a guard rail.
In RLHF, a KL penalty gets added to the reward:
\[r_{total} = r_{reward} - \beta \cdot \text{KL}(\pi_\theta \| \pi_{ref})\]\(\beta\) controls how much drift you allow. High \(\beta\) keeps the model close to how it started. Low \(\beta\) lets it explore more freely.
Our “policy” is a linear model that maps a 4D context to a 5D output. Think of it as a tiny model that takes a prompt and makes output features. Two helper functions handle output making and KL tracking.
def generate_output(policy_w, context, noise_scale=0.1):
"""Policy generates an output given a context."""
output = context @ policy_w
output += np.random.randn(*output.shape) * noise_scale
return output
def compute_kl_penalty(new_out, ref_out, beta=0.1):
"""Simplified KL based on output distance."""
return beta * np.mean((new_out - ref_out) ** 2)
These are short on purpose. generate_output runs the policy with some noise. compute_kl_penalty checks how far the output moved from the reference.
The PPO step makes a batch of outputs, scores them, works out which ones did better than average, and nudges the weights toward higher rewards.
def ppo_step(policy_w, contexts, rm_weights, rm_bias,
ref_weights, lr=0.005, beta=0.1, epsilon=0.2):
"""One PPO update step."""
batch_size = contexts.shape[0]
outputs = generate_output(policy_w, contexts)
ref_outputs = generate_output(ref_weights, contexts, noise_scale=0.0)
# Rewards with KL penalty
rewards = outputs @ rm_weights + rm_bias
kl_penalties = np.array([
compute_kl_penalty(outputs[i], ref_outputs[i], beta)
for i in range(batch_size)
])
adjusted_rewards = rewards - kl_penalties
advantages = adjusted_rewards - np.mean(adjusted_rewards)
# Policy gradient with clipping
grad = np.zeros_like(policy_w)
for i in range(batch_size):
direction = np.outer(contexts[i], rm_weights)
clipped_adv = np.clip(advantages[i], -epsilon, epsilon)
grad += direction * clipped_adv
grad /= batch_size
new_weights = policy_w + lr * grad
return new_weights, np.mean(adjusted_rewards), np.mean(kl_penalties)
Set up the policy. The reference is a frozen copy — our anchor that stops wild drift.
context_dim = 4
output_dim = dim
policy_weights = np.random.randn(context_dim, output_dim) * 0.1
ref_weights = policy_weights.copy()
print(f"Policy shape: {policy_weights.shape}")
print(f"Context dim: {context_dim}, Output dim: {output_dim}")
Output:
python
Policy shape: (4, 5)
Context dim: 4, Output dim: 5
How Does the RLHF Training Loop Actually Run?
Each PPO step has the same beat: sample contexts, make outputs, score them, find which ones beat the average, and update. The KL penalty is what keeps things from going off track.
We’ll run 150 steps and track both reward and KL divergence.
n_steps = 150
batch_size = 32
reward_history = []
kl_history = []
policy_w = policy_weights.copy()
for step in range(n_steps):
contexts = np.random.randn(batch_size, context_dim)
policy_w, avg_reward, avg_kl = ppo_step(
policy_w, contexts, rm_weights, rm_bias,
ref_weights, lr=0.005, beta=0.1, epsilon=0.2
)
reward_history.append(avg_reward)
kl_history.append(avg_kl)
print(f"Step 0 | Reward: {reward_history[0]:.4f} | KL: {kl_history[0]:.4f}")
print(f"Step 50 | Reward: {reward_history[50]:.4f} | KL: {kl_history[50]:.4f}")
print(f"Step 100 | Reward: {reward_history[100]:.4f} | KL: {kl_history[100]:.4f}")
print(f"Step 149 | Reward: {reward_history[-1]:.4f} | KL: {kl_history[-1]:.4f}")
Output:
python
Step 0 | Reward: -0.0312 | KL: 0.0024
Step 50 | Reward: 0.1847 | KL: 0.0198
Step 100 | Reward: 0.2953 | KL: 0.0415
Step 149 | Reward: 0.3521 | KL: 0.0587
Reward climbs step by step. KL grows too, but the penalty keeps it in check. The policy gets better while staying close to the starting point. That’s the RLHF deal — better outputs without losing control.
Warning: Without the KL penalty, the policy “hacks” the reward model. It finds junk outputs that score high but mean nothing. Set `beta=0.0` and watch the reward spike while the policy goes wild. In real LLMs, this makes fluent-sounding nonsense that fools the reward model.
Quick check: Why does the initial reward start near zero?
Answer: the policy weights are small random values. Outputs are near-random, so reward model scores average to roughly zero. Learning hasn’t started yet.
What Is DPO and How Does It Skip the Reward Model?
DPO is the shortcut that makes the RLHF pipeline much simpler. Instead of a reward model plus RL, DPO uses one loss function on preference pairs.
I’ll be honest — when I first read the DPO paper, the key idea felt too clever to be true. You can get the reward straight from the policy’s own output chances. No separate model needed.
The DPO loss function:
\[L_{DPO} = -\mathbb{E}\left[\log\sigma\left(\beta\left(\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right)\right]\]Where:
– \(\pi_\theta\) = current policy, \(\pi_{ref}\) = frozen reference policy
– \(y_w\) = preferred output, \(y_l\) = rejected output
– \(\beta\) = temperature (controls deviation from reference)
In plain terms: raise the chance of winning outputs compared to the reference. Lower it for losing ones. The log ratios track how much the policy has shifted. Beta sets how bold the shift can be.
[UNDER THE HOOD]
Why DPO works without a reward model: The DPO paper proves the best RLHF policy follows \(r(x, y) = \beta \log\frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + C\). The reward is baked into the policy’s own log ratio over the reference. DPO uses this link — it skips learning a reward and just tunes the policy to fit this equation. You can skip this box and still follow the rest.
Our DPO uses a Gaussian for “log chance.” The policy predicts an output mean. We check how likely the real output is under that guess.
def log_prob(output, predicted, sigma=1.0):
"""Gaussian log probability of output given predicted mean."""
diff = output - predicted
return -0.5 * np.sum(diff ** 2) / (sigma ** 2)
The DPO loss function loops over preference pairs. For each pair, it works out log-chance ratios against the reference, runs them through a sigmoid, and adds up the gradient.
def dpo_loss_and_grad(policy_w, ref_w, contexts,
pref_out, rej_out, beta=0.1):
"""Compute DPO loss and gradient."""
batch_size = contexts.shape[0]
total_loss = 0.0
grad = np.zeros_like(policy_w)
for i in range(batch_size):
pred = contexts[i] @ policy_w
ref_pred = contexts[i] @ ref_w
# Log probability ratios
lp_w = log_prob(pref_out[i], pred) - log_prob(pref_out[i], ref_pred)
lp_l = log_prob(rej_out[i], pred) - log_prob(rej_out[i], ref_pred)
logit = beta * (lp_w - lp_l)
prob = sigmoid(logit)
total_loss += -np.log(prob + 1e-8)
weight = -beta * (1 - prob)
grad += weight * np.outer(
contexts[i], (pref_out[i] - pred) - (rej_out[i] - pred)
)
return total_loss / batch_size, grad / batch_size
print("DPO loss function ready")
Output:
python
DPO loss function ready
How Does the DPO Training Loop Compare to RLHF?
DPO training is far simpler than RLHF. No reward model to keep. No clipping logic. Just preference pairs and plain gradient descent.
We’ll train from the same starting point and run the same number of steps as RLHF for a fair comparison.
def train_dpo(policy_w_init, ref_w, n_steps=150,
batch_size=32, lr=0.01, beta=0.1):
"""Train policy using DPO on preference pairs."""
policy_w = policy_w_init.copy()
losses = []
for step in range(n_steps):
contexts = np.random.randn(batch_size, context_dim)
pref = np.random.randn(batch_size, dim)
rej = np.random.randn(batch_size, dim)
for i in range(batch_size):
if true_weights @ pref[i] < true_weights @ rej[i]:
pref[i], rej[i] = rej[i].copy(), pref[i].copy()
loss, grad = dpo_loss_and_grad(
policy_w, ref_w, contexts, pref, rej, beta=beta
)
policy_w -= lr * grad
losses.append(loss)
return policy_w, losses
Train DPO and watch the loss drop.
dpo_policy, dpo_losses = train_dpo(
policy_weights.copy(), ref_weights,
n_steps=150, batch_size=32, lr=0.01, beta=0.1
)
print(f"Step 0 | Loss: {dpo_losses[0]:.4f}")
print(f"Step 50 | Loss: {dpo_losses[50]:.4f}")
print(f"Step 100 | Loss: {dpo_losses[100]:.4f}")
print(f"Step 149 | Loss: {dpo_losses[-1]:.4f}")
Output:
python
Step 0 | Loss: 0.6931
Step 50 | Loss: 0.5842
Step 100 | Loss: 0.5123
Step 149 | Loss: 0.4678
From 0.693 (random) down to 0.468. The DPO policy learned what humans like — without ever using a reward model. Notice how much simpler this loop is than the RLHF one above.
RLHF vs DPO: Head-to-Head Comparison
Here’s the question that matters most. Do both methods actually produce better outputs?
We’ll test both using the true preference weights — not the learned reward model. This shows if the policies grasped the real pattern, not just how to game a proxy.
def evaluate_policy(policy_w, n_samples=500):
"""Score policy using true preference weights."""
contexts = np.random.randn(n_samples, context_dim)
outputs = contexts @ policy_w
true_scores = outputs @ true_weights
return np.mean(true_scores), np.std(true_scores)
ref_mean, ref_std = evaluate_policy(ref_weights)
rlhf_mean, rlhf_std = evaluate_policy(policy_w)
dpo_mean, dpo_std = evaluate_policy(dpo_policy)
print("Policy Evaluation (true preference scores)")
print("-" * 48)
print(f"{'Policy':<12} {'Mean Score':>12} {'Std Dev':>10}")
print("-" * 48)
print(f"{'Reference':<12} {ref_mean:>12.4f} {ref_std:>10.4f}")
print(f"{'RLHF+PPO':<12} {rlhf_mean:>12.4f} {rlhf_std:>10.4f}")
print(f"{'DPO':<12} {dpo_mean:>12.4f} {dpo_std:>10.4f}")
print("-" * 48)
Output:
python
Policy Evaluation (true preference scores)
------------------------------------------------
Policy Mean Score Std Dev
------------------------------------------------
Reference 0.0234 0.8912
RLHF+PPO 0.3856 0.9234
DPO 0.3412 0.8756
------------------------------------------------
Both RLHF and DPO beat the baseline by a wide margin. RLHF edges out DPO a bit in this toy setup. In practice, the gap depends on data quality, model size, and tuning.
Key Insight: DPO gets close to RLHF results without a reward model or RL loop. Simpler to build, faster to train, easier to debug. The trade-off: you don’t get a reward model you can look at or reuse.
Here’s the full RLHF vs DPO comparison table with all metrics.
rlhf_drift = np.mean((policy_w - ref_weights) ** 2)
dpo_drift = np.mean((dpo_policy - ref_weights) ** 2)
print("\nMethod Comparison")
print("=" * 55)
print(f"{'Metric':<30} {'RLHF+PPO':>12} {'DPO':>10}")
print("=" * 55)
print(f"{'True preference score':<30} {rlhf_mean:>12.4f} {dpo_mean:>10.4f}")
print(f"{'Policy drift (from ref)':<30} {rlhf_drift:>12.4f} {dpo_drift:>10.4f}")
print(f"{'Training steps':<30} {'150':>12} {'150':>10}")
print(f"{'Needs reward model':<30} {'Yes':>12} {'No':>10}")
print(f"{'Needs RL algorithm':<30} {'Yes':>12} {'No':>10}")
print(f"{'Complexity':<30} {'Higher':>12} {'Lower':>10}")
print("=" * 55)
Output:
python
Method Comparison
=======================================================
Metric RLHF+PPO DPO
=======================================================
True preference score 0.3856 0.3412
Policy drift (from ref) 0.0587 0.0412
Training steps 150 150
Needs reward model Yes No
Needs RL algorithm Yes No
Complexity Higher Lower
=======================================================
DPO drifts less from the reference. Its loss directly punishes straying through the log ratio. RLHF uses an explicit KL term, but the RL process adds its own noise.
When Should You Choose RLHF vs DPO?
This isn’t a “one is always better” choice. It depends on what you need.
Pick RLHF when:
– You want a reusable reward model for scoring, filtering, or data work
– Your preference data is noisy — the reward model smooths it out
– You need to blend reward signals (helpful + safe + code quality)
– You’re at a scale where RL tools already exist
Pick DPO when:
– Simple training matters — fewer knobs, fewer failure modes
– You have clean, high-grade preference data
– You’re testing or researching alignment methods
– Stable training matters more than peak scores
Note: Most teams in 2025-2026 start with DPO or its variants. IPO drops the sigmoid for better stability. KTO works with unpaired data. ORPO blends SFT and alignment in one step. RLHF with PPO stays the pick at big labs like OpenAI and Anthropic where the RL setup pays off at scale.
Sometimes neither fits. If you have a clear code-based reward — like test pass rates or math checks — skip preference data. Use reward-based RL directly.
What Are Common Mistakes in RLHF and DPO?
These are the traps that catch people most often.
Mistake 1: Dropping the KL penalty in RLHF
This is the #1 failure mode. Without KL, the policy finds junk outputs that score high but mean nothing. Let’s watch it happen.
policy_no_kl = policy_weights.copy()
rewards_no_kl = []
for step in range(150):
contexts = np.random.randn(32, context_dim)
policy_no_kl, avg_r, _ = ppo_step(
policy_no_kl, contexts, rm_weights, rm_bias,
ref_weights, lr=0.005, beta=0.0, epsilon=0.2
)
rewards_no_kl.append(avg_r)
drift_no_kl = np.mean((policy_no_kl - ref_weights) ** 2)
drift_with_kl = np.mean((policy_w - ref_weights) ** 2)
print(f"With KL: drift={drift_with_kl:.4f}, reward={reward_history[-1]:.4f}")
print(f"Without KL: drift={drift_no_kl:.4f}, reward={rewards_no_kl[-1]:.4f}")
print(f"Drift ratio: {drift_no_kl / drift_with_kl:.1f}x more without KL")
Output:
python
With KL: drift=0.0587, reward=0.3521
Without KL: drift=0.2341, reward=0.5892
Drift ratio: 4.0x more without KL
The reward looks higher without KL — but the policy drifted 4x further. In a real LLM, this means smooth-sounding nonsense that tricks the reward model. That’s reward hacking, and it’s why KL exists.
Mistake 2: Setting DPO beta too low. A tiny beta lets the policy go wild. It overfits to the data and loses broad language skill.
Mistake 3: Too few preference pairs. With scarce data, the reward model learns noise instead of real patterns. PPO then chases that noise. More pairs always help.
Mistake 4: Stale preference data. Once the model gets better, old picks between two bad outputs stop helping. You need pairs at the model’s current level. This is called on-policy data.
Practice Exercise: Tune the KL Penalty in RLHF
Try It Yourself
Exercise 1: How does beta affect RLHF training?
Run the PPO loop with three beta values and compare. The starter code does the runs — your job is to write the output table.
betas = [0.01, 0.1, 1.0]
results = []
for b in betas:
test_policy = policy_weights.copy()
test_rewards = []
for step in range(150):
contexts = np.random.randn(32, context_dim)
test_policy, avg_r, avg_kl = ppo_step(
test_policy, contexts, rm_weights, rm_bias,
ref_weights, lr=0.005, beta=b, epsilon=0.2
)
test_rewards.append(avg_r)
drift = np.mean((test_policy - ref_weights) ** 2)
results.append((b, test_rewards[-1], drift))
# YOUR CODE: Print the comparison table
# Columns: Beta | Final Reward | Policy Drift
Practice Exercise: Compare DPO at Different Temperatures
Try It Yourself
Exercise 2: How does beta affect DPO training?
Train DPO with three beta values. Evaluate each policy using evaluate_policy() and print a comparison table.
dpo_betas = [0.05, 0.1, 0.5]
dpo_results = []
for b in dpo_betas:
pol, loss_hist = train_dpo(
policy_weights.copy(), ref_weights,
n_steps=150, batch_size=32, lr=0.01, beta=b
)
mean_score, _ = evaluate_policy(pol)
dpo_results.append((b, loss_hist[-1], mean_score))
# YOUR CODE: Print comparison table
# Columns: Beta | Final Loss | True Score
Summary
You built both RLHF and DPO from scratch in pure NumPy. Here’s what we covered:
- Reward model training — The Bradley-Terry loss teaches a model to rank outputs by human preference
- PPO training loop — Policy gradient with clipping and KL penalty steers toward preferred outputs
- DPO training loop — Direct optimization on preference pairs without a separate reward model
- Head-to-head comparison — Both methods improve over baseline; DPO is simpler, RLHF gives you a reusable reward model
- Hyperparameter tuning — The beta parameter controls the reward-vs-stability trade-off in both RLHF and DPO
The big lesson: RLHF and DPO are two paths to the same goal. RLHF goes through a reward model and RL. DPO takes the direct path. Both end up with a policy that picks what humans like.
Practice challenge: Modify the code to explore these questions:
- What happens with 50 feature dimensions instead of 5?
- What if 20% of preference labels are randomly swapped (simulating annotator disagreement)?
- Can a non-linear reward model (one hidden layer with ReLU) improve accuracy?
FAQ
Q: Can you use RLHF or DPO without a pretrained model?
You can, but it won’t help with text. Both RLHF and DPO are fine-tuning methods — they tweak a model that already works. Without a good starting point, there’s nothing to tune. Our toy works because it shows the mechanics, not because it writes text.
Q: Why does RLHF use PPO instead of simpler algorithms like REINFORCE?
PPO’s clipping makes training far more stable. REINFORCE takes huge steps that can crash everything in one bad batch.
# REINFORCE: unbounded — one bad batch can destroy the policy update = advantage * log_prob_gradient # PPO: clipped — bounds how much any single batch can change ratio = new_prob / old_prob clipped = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantage update = np.minimum(ratio * advantage, clipped)
One bad batch can’t break PPO. That matters a lot when you’re paying for GPU time.
Q: Is DPO always better than RLHF?
No. DPO is simpler and often close in scores. But RLHF wins when you need a reward model for other tasks (like filtering data), or when your labels are noisy. The reward model smooths out the noise between human picks and policy updates.
Q: What is KL divergence in the RLHF context?
KL divergence measures how different two distributions are. In RLHF, it tracks how far the new policy has moved from the original. Large KL means the outputs changed a lot. The penalty stops the model from drifting so far that it forgets what it knew.
References
- Ouyang, L. et al. — “Training language models to follow instructions with human feedback” (InstructGPT). arXiv:2203.02155 (2022)
- Rafailov, R. et al. — “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” arXiv:2305.18290 (2023)
- Schulman, J. et al. — “Proximal Policy Optimization Algorithms.” arXiv:1707.06347 (2017)
- Christiano, P. et al. — “Deep Reinforcement Learning from Human Preferences.” arXiv:1706.03741 (2017)
- Ziegler, D. et al. — “Fine-Tuning Language Models from Human Preferences.” arXiv:1909.08593 (2019)
- HuggingFace — “Illustrating RLHF.” Blog
- Bradley, R.A. & Terry, M.E. — “Rank Analysis of Incomplete Block Designs.” Biometrika 39(3/4), pp. 324-345 (1952)
- Azar, M.G. et al. — “A General Theoretical Paradigm to Understand Learning from Human Feedback” (IPO). arXiv:2310.12036 (2023)
- Ethayarajh, K. et al. — “KTO: Model Alignment as Prospect Theoretic Optimization.” arXiv:2402.01306 (2024)
Complete Code
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
