import torch
import torch.nn as nn
class RewardHead(nn.Module):
"""
Minimal reward model head: maps hidden states to a scalar reward.
In practice this sits on top of a frozen or fine-tuned transformer.
"""
def __init__(self, hidden_dim: int):
super().__init__()
self.proj = nn.Linear(hidden_dim, 1)
def forward(self, last_hidden_state: torch.Tensor) -> torch.Tensor:
# Use the last token's hidden state as the sequence representation
return self.proj(last_hidden_state[:, -1, :]).squeeze(-1)
def reward_model_loss(
reward_chosen: torch.Tensor,
reward_rejected: torch.Tensor,
) -> torch.Tensor:
"""
Bradley-Terry pairwise ranking loss.
reward_chosen: scalar rewards for preferred responses
reward_rejected: scalar rewards for rejected responses
"""
return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()Grounding Language Models in Human Values: The RLHF–PPO Alignment Pipeline
Why Alignment After Pre-training?
In Scaling Stability: Proximal Policy Optimization from Theory to LLM Alignment, we followed the evolution from REINFORCE through TRPO to PPO and saw how clipped surrogate objectives make policy updates stable. That post ended with a preview: PPO was the algorithm behind InstructGPT’s RLHF stage.
This post picks up that thread. The question is no longer “how does PPO work?” but “how do you use PPO to teach a language model what humans actually want?”
Pre-training gives you a model that’s fluent, knowledgeable, and dangerously undirected. A model trained on internet-scale text has absorbed both the best explanations and the worst toxicity the corpus contains. Supervised fine-tuning (SFT) on curated demonstrations helps (the model learns to follow instructions), but it doesn’t solve the problem. SFT teaches format and style; it doesn’t teach the model which outputs humans genuinely prefer when there are multiple plausible completions.
RLHF closes that gap. Instead of hand-crafting a loss function for “helpfulness” or “honesty,” RLHF learns a reward function from comparative human judgments and then uses reinforcement learning, specifically PPO, to optimize the model against that learned reward.
Why reinforcement learning and not just more supervised training? Supervised learning needs a correct label for every input. But for open-ended generation, there’s no single right answer to a prompt like “explain quantum computing to a five-year-old.” There are many good responses and many bad ones, and the quality depends on nuance, context, and human taste. RL is built for exactly this kind of problem: an agent takes actions in an environment, gets a scalar reward signal telling it how well it did, and gradually improves its policy. In RLHF, the language model is the agent, each generated token is an action, and the reward model’s score is the feedback signal at the end of the response. We’ll make this mapping concrete in Section 4.
Post road map: We start with the alignment objective and the three-stage RLHF pipeline. Then we unpack reward model training, the PPO-RLHF training loop (with a full sequence diagram), the KL penalty that keeps optimization stable, and the failure mode it guards against: reward hacking. We close with Constitutional AI, post-PPO methods (DPO, GRPO), and a brief bridge to grounding and reasoning via RAG and ReAct.
1. The Alignment Objective: Helpful, Honest, Harmless
Anthropic framed alignment around three properties: helpful, honest, and harmless (HHH). This has become a widely used shorthand in the field (Askell et al., 2021). These properties aren’t a formal specification; they’re high-level design goals that shape how alignment data is collected and how reward models are trained.
- Helpful: The model should attempt to assist the user, follow instructions accurately, and provide useful information.
- Honest: The model should not fabricate facts, should express uncertainty when appropriate, and should not present itself as having capabilities it lacks.
- Harmless: The model should refuse requests for dangerous content, avoid reinforcing biases, and decline to assist with harmful activities.
The challenge is that these goals pull in different directions. A perfectly helpful model that answers every question without filtering isn’t harmless. A model that refuses everything to be safe isn’t helpful. Alignment is the engineering discipline of navigating these trade-offs so they reflect human judgment, and RLHF is how that judgment gets into the training loop.
2. The Three-Stage RLHF Pipeline
The RLHF pipeline most teams use for language models goes back to Ziegler et al. (2019) and Stiennon et al. (2020), and was scaled up by Ouyang et al. (2022) for InstructGPT. It has three stages:
Stage 1: Supervised Fine-Tuning (SFT). Start from a pre-trained language model and fine-tune it on a dataset of human-written demonstrations. These demonstrations show the model what good instruction-following looks like. The result is a model that can produce reasonable responses but hasn’t been optimized for preference alignment yet.
Stage 2: Reward Model Training. Collect a dataset of human preference comparisons: given a prompt, show annotators two (or more) model responses and ask which one is better. Train a reward model that takes a prompt and response and outputs a scalar score predicting human preference. This reward model becomes the objective for the next stage.
Stage 3: PPO Fine-Tuning. Use the reward model as the environment’s reward signal and optimize the SFT model with PPO. The policy (the language model) generates responses, the reward model scores them, and PPO updates the policy to produce higher-scoring responses. A KL penalty prevents the model from drifting too far from its SFT starting point.
This three-stage description is a simplification. Production systems often add iterative reward model refreshes, rejection sampling stages, and multi-task mixing. But the core idea (demonstrate, compare, optimize) still holds.
3. Reward Model Training
The reward model is the bridge between human judgment and the RL training loop. Its quality directly determines what the policy learns to optimize.
How Preference Data is Collected
Human annotators are given a prompt and two (or more) candidate responses generated by the SFT model. They indicate which response they prefer, sometimes with additional labels for quality dimensions (helpfulness, factual accuracy, safety). We use ranking instead of absolute scoring because humans are much more consistent at comparing two things than at assigning numbers on an absolute scale.
The Bradley-Terry Model
The standard approach models the probability that response is preferred over response as:
where is the logistic function and is the reward model’s scalar output. The reward model is trained to maximize the log-likelihood of the observed human preferences:
This is a pairwise ranking loss. The reward model learns to assign higher scores to responses that humans prefer, without needing to calibrate the absolute scale of those scores.
Architecture
In practice, the reward model is typically initialized from the same pre-trained model family as the policy (sometimes a smaller variant). The final unembedding layer is replaced with a linear projection to a single scalar output. For InstructGPT, OpenAI used a single 6B parameter reward model to train policies at 1.3B, 6B, and 175B parameter scales. This showed the RM doesn’t need to match the policy in size (Ouyang et al., 2022).
4. PPO for RLHF: The Full Training Loop
In Part 1, PPO optimized a policy against environment rewards in classic RL tasks. In RLHF, the same RL structure applies, but the pieces look different. If you’ve followed the series from tabular methods through DQN to REINFORCE, you’ll recognize every component here, just dressed up in language-model clothing:
| RL Concept | RLHF Equivalent |
|---|---|
| Agent | The language model (policy ) |
| Environment | The prompt database (provides observations) and reward model (provides feedback) |
| State | The prompt plus all tokens generated so far |
| Action | Choosing the next token from the vocabulary |
| Reward | The reward model score , delivered after the full response |
| Episode | One complete prompt response generation |
Two things make this different from, say, CartPole. First, the action space is enormous: the vocabulary can have 30,000+ tokens, so the agent is choosing from tens of thousands of possible actions at every step. Second, the reward is sparse. The model generates an entire response token by token, but only gets a single reward score at the end. PPO handles both of these challenges through the value function baseline and advantage estimation we covered in Part 1.
The loop involves four models working together:
- Active LLM (Policy ): The model being trained. It generates responses and gets updated by PPO.
- Reference LLM (Frozen ): A frozen copy of the SFT model. It provides the baseline log-probabilities used to compute the KL penalty.
- Reward Model (): Scores prompt-response pairs on alignment quality. Frozen during PPO training.
- Value Model (): The critic network that estimates expected return for advantage computation, just as in standard PPO.
The following diagram shows the complete RLHF–PPO training loop:
Step-by-Step Walkthrough
Step 1: Rollout. A prompt is sampled from the prompt database. The active policy generates a complete response autoregressively, one token at a time. During generation, the log-probabilities for each generated token are recorded.
Step 2: Evaluation and KL penalty. The same prompt and generated response go through the frozen reference model to get reference log-probabilities . The reward model scores the full (prompt, response) pair with a scalar reward . The per-token KL divergence between the active and reference policies is computed and used to construct the total reward (Section 5).
Step 3: Advantage estimation. The value model estimates the expected return. The advantage is computed using GAE (covered in Part 1), representing how much better or worse the actual reward was compared to the value baseline.
Step 4: Parameter updates. PPO’s clipped surrogate objective (Part 1, Section 4) updates the active policy parameters . The value model parameters are updated to minimize MSE between predicted and actual returns. The reference model and reward model stay frozen throughout.
def compute_rlhf_reward(
reward_model_score: float,
active_log_probs: torch.Tensor,
ref_log_probs: torch.Tensor,
beta: float = 0.1,
) -> torch.Tensor:
"""
Compute the KL-penalized reward used in RLHF-PPO.
reward_model_score: scalar from the reward model r_phi(x, y)
active_log_probs: per-token log-probs from the active policy
ref_log_probs: per-token log-probs from the frozen reference model
beta: KL penalty coefficient
"""
# Per-token KL approximation: log(pi_active/pi_ref) = log_pi_active - log_pi_ref
per_token_kl = active_log_probs - ref_log_probs
kl_penalty = per_token_kl.sum()
total_reward = reward_model_score - beta * kl_penalty
return total_rewardEnjoying this walkthrough? Subscribe for more posts on RL, LLM alignment, and practical AI engineering.
5. The KL Penalty: Keeping the Policy Grounded
The KL divergence between the active policy and the reference model is the single most important stability mechanism in RLHF. It plays a different role than PPO’s clipping. Clipping constrains how far the policy moves in a single update step; the KL penalty constrains how far the policy drifts from its SFT starting point across the entire training run.
The shaped reward used in InstructGPT is:
This decomposes into two forces:
- pulls the model toward outputs the reward model scores highly.
- pushes back whenever the active policy diverges from the reference model.
The coefficient controls the trade-off. Too low, and the model is free to exploit the reward model aggressively. Too high, and the model barely moves from SFT behavior. In practice, is often tuned adaptively. Some implementations adjust it to target a specific KL budget per batch.
Why Not Just Use PPO Clipping?
PPO clipping limits the size of a single parameter update. But RLHF runs many PPO updates over time. Even with conservative per-step clipping, the policy can gradually drift far from its SFT origin over thousands of steps. The KL penalty acts as a global regularizer that accumulates across all updates, preventing the slow drift that per-step clipping can’t catch.
Practical KL Computation
In practice, the KL divergence is computed at the token level for a generated response :
This is the exact log-ratio for the generated sequence, used as a Monte Carlo estimate of the full KL divergence (which would require averaging over all possible responses). The per-sample approximation is the standard approach in RLHF implementations such as TRL and OpenRLHF.
def compute_per_token_kl(
active_log_probs: torch.Tensor,
ref_log_probs: torch.Tensor,
) -> torch.Tensor:
"""
Token-level log-ratio for a single response.
active_log_probs: shape (seq_len,), log pi_theta(y_t | x, y_<t)
ref_log_probs: shape (seq_len,), log pi_ref(y_t | x, y_<t)
Returns the per-token log-ratio; sum gives the sequence-level estimate.
"""
return active_log_probs - ref_log_probs
def kl_shaped_reward(
rm_score: float,
active_log_probs: torch.Tensor,
ref_log_probs: torch.Tensor,
beta: float = 0.1,
) -> float:
"""
Full KL-shaped reward as used in InstructGPT.
"""
per_token_kl = compute_per_token_kl(active_log_probs, ref_log_probs)
kl_total = per_token_kl.sum().item()
return rm_score - beta * kl_total6. Reward Hacking: When Optimization Outsmarts the Objective
Reward hacking (also called reward model overoptimization) is the defining failure mode of RLHF. It’s a direct instance of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”
The reward model is a learned proxy for human preferences, not human preferences themselves. It was trained on a finite dataset of comparisons and has limited generalization. When PPO pushes hard against this proxy, it can find outputs that score well under the reward model but look obviously bad to a human.
What Reward Hacking Looks Like
Gao et al. (2022) studied this phenomenon systematically using a synthetic setup where a fixed “gold-standard” reward model stood in for human evaluators. They found that:
- As optimization pressure increases, the proxy reward model score rises continuously.
- The gold-standard score (actual quality) initially rises with the proxy score, then peaks and begins to decline.
- The divergence between proxy and gold scores follows predictable scaling laws that depend on reward model size and dataset size.
Concretely, a reward-hacking policy might produce outputs that:
- Repeat the user’s question back in elaborate paraphrases (rewarded for appearing thorough).
- Include excessive hedging and caveats (rewarded for appearing cautious, but uninformative).
- Generate fluent nonsense that has surface patterns the reward model associates with quality.
- Exploit formatting quirks (bullet points, numbered lists) that were correlated with high ratings in training data.
Defenses Against Reward Hacking
KL penalty (Section 5): The primary defense. By penalizing deviation from the reference model, the KL term limits how far the policy can go in exploiting the reward model.
Reward model ensembles: Training multiple reward models and using their agreement (or the minimum of their scores) reduces the chance that the policy finds a single model’s blind spots.
Iterative reward model updates: Periodically retraining the reward model on outputs from the current policy closes the distribution shift between the reward model’s training data and what it evaluates at deployment.
Reward model size: Gao et al. (2022) found that larger reward models are more resistant to overoptimization, with scaling coefficients that decrease as model size grows.
Reward hacking isn’t hypothetical. It was observed in early InstructGPT iterations and shows up routinely in production RLHF pipelines. The KL penalty is necessary but not sufficient. Careful monitoring of reward model scores vs. actual human evaluation scores is essential.
7. Constitutional AI: Scaling Supervision with Self-Critique
Human annotation is expensive and slow. Constitutional AI (CAI), introduced by Bai et al. (2022) at Anthropic, tackles this by replacing some human labor with model self-critique guided by explicit principles, a “constitution.”
The Two-Phase CAI Process
Phase 1: Supervised Self-Critique (SL). The model generates a response to a potentially harmful prompt. It then critiques its own response according to constitutional principles (e.g., “Is this response helpful? Is it honest? Could it cause harm?”). The model revises its response based on that critique, and the revised responses become fine-tuning data.
Phase 2: RL from AI Feedback (RLAIF). Instead of human annotators ranking responses, a separate model evaluates which of two responses better satisfies the constitutional principles. This AI-generated preference data trains a reward model, which then feeds into the standard PPO-RLHF loop.
Why CAI Matters
- Scale: AI feedback is cheaper and faster than human feedback, enabling much larger preference datasets.
- Transparency: The constitutional principles are explicit and auditable, unlike implicit guidelines given to human annotators.
- Chain-of-thought supervision: The critique-and-revision process produces reasoning traces that can improve the model’s ability to explain its decisions.
- Harmlessness without evasiveness: CAI models turned out to be non-evasive. They engage with harmful queries by explaining their objections rather than just refusing to respond (Bai et al., 2022).
# Conceptual illustration of the CAI self-critique loop.
# In practice, this uses full LLM generation with constitutional prompts.
CONSTITUTION = [
"Is this response helpful to the user?",
"Does this response contain any factual inaccuracies?",
"Could this response cause harm if followed?",
"Does this response respect the user's autonomy?",
]
def self_critique_prompt(response: str, principle: str) -> str:
"""
Build a critique prompt from a constitutional principle.
"""
return (
f"Consider the following response:\n\n"
f'"{response}"\n\n'
f"Critique this response based on the following principle: {principle}\n"
f"Then provide a revised response that better satisfies the principle."
)8. Beyond PPO: DPO and GRPO
PPO dominated the RLHF landscape from InstructGPT (2022) through the initial ChatGPT era. By 2023–2025, researchers began exploring simpler alternatives that avoid the complexity of the full PPO training loop.
Direct Preference Optimization (DPO)
Rafailov et al. (2023) showed that the RLHF objective has a closed-form solution: the optimal policy under a KL-constrained reward maximization objective can be expressed directly in terms of the policy’s own log-probabilities, without needing a separate reward model.
The DPO loss is:
DPO eliminates three components of the PPO pipeline:
- No separate reward model training.
- No RL optimization loop (no rollouts, no advantage estimation).
- No value function / critic network.
The result is a simple classification-style loss that can be optimized with standard supervised learning infrastructure. Experiments show that DPO exceeds PPO-based RLHF on sentiment control tasks and matches or improves it on summarization and single-turn dialogue, while being substantially simpler to implement and tune (Rafailov et al., 2023).
Group Relative Policy Optimization (GRPO)
GRPO, introduced in DeepSeekMath (Shao et al., 2024), takes a different approach to simplifying PPO. Instead of eliminating the RL loop entirely (as DPO does), GRPO modifies PPO to remove the critic network. The advantage is estimated by sampling a group of responses to the same prompt and using the relative reward scores within that group:
where is the group size and is the reward for the -th response. This eliminates the need for a separately trained value network while keeping the online RL structure of PPO. The model still generates responses and receives reward feedback in a loop.
GRPO is particularly effective for tasks with verifiable rewards (such as math, where correctness can be checked automatically) and was a key component of DeepSeek-R1’s training pipeline (Guo et al., 2025).
When to Use What?
| Method | Reward Model | Critic/Value Network | RL Loop | Best For |
|---|---|---|---|---|
| PPO-RLHF | Yes | Yes | Yes | General alignment, large-scale production |
| DPO | No | No | No | Preference alignment with limited compute |
| GRPO | Rule-based or learned | No | Yes | Tasks with verifiable rewards (math, code) |
These methods aren’t mutually exclusive. Some production pipelines use DPO for initial alignment and then refine with PPO. Others use GRPO for reasoning tasks and PPO for general instruction-following. The field is still converging on best practices.
9. From Alignment to Reasoning: RAG and ReAct
This section is a conceptual bridge to upcoming posts on grounding and agentic LLM architectures. The RLHF pipeline taught the model how to respond; these techniques address what to respond with and what actions to take.
RLHF aligns a model’s behavior with human values, but alignment alone doesn’t solve factual grounding or multi-step reasoning. Two complementary techniques fill this gap.
Retrieval-Augmented Generation (RAG)
RAG (Lewis et al., 2020) addresses the hallucination problem by giving the model access to external knowledge at inference time. Instead of relying solely on parametric knowledge (what the model memorized during training), RAG retrieves relevant documents from a knowledge base and includes them in the model’s context:
- The user query is embedded into a vector representation.
- A retrieval system finds the most relevant documents from an indexed corpus.
- The retrieved documents are prepended to the prompt.
- The LLM generates a response grounded in the retrieved context.
RAG is complementary to RLHF: alignment teaches the model how to respond (helpful, honest, harmless), while RAG gives it what to respond with (grounded facts). An aligned model without RAG may produce confidently wrong answers; RAG without alignment may retrieve correct information but present it in unhelpful or harmful ways.
ReAct: Reasoning and Acting
ReAct (Yao et al., 2022) goes further by letting the model take actions (calling tools, searching the web, executing code) interleaved with explicit reasoning steps. The model alternates between:
- Thought: Reasoning about what information is needed or what step comes next.
- Action: Invoking an external tool (search, calculator, API call).
- Observation: Processing the tool’s output and incorporating it into subsequent reasoning.
This interleaving of reasoning and acting turns the LLM from a text generator into an agent that can solve multi-step problems. RLHF-aligned models make better ReAct agents because alignment teaches them to follow instructions precisely, admit uncertainty, and avoid confabulation. These are exactly the properties you need for reliable tool use.
10. The Alignment Stack in 2026
A lot has changed since InstructGPT. A modern alignment pipeline might look like this:
- Pre-training on curated, filtered data (reducing toxic content at the source).
- Supervised fine-tuning on high-quality instruction-response pairs.
- Preference optimization via DPO, GRPO, or PPO-RLHF depending on the use case.
- Constitutional AI for scalable safety evaluation.
- RAG for factual grounding.
- Tool use and ReAct for multi-step reasoning.
- Red teaming and evaluation using both human and automated adversarial testing.
PPO is still the theoretical foundation. Understanding how policy optimization, reward shaping, KL constraints, and advantage estimation fit together is essential for reasoning about any of these methods, even the ones that simplify or replace PPO in practice.
What Comes Next
We covered the RLHF pipeline from reward modeling through PPO training, the KL penalty and reward hacking, Constitutional AI, post-PPO methods, and the bridge to grounding and reasoning. Together with Part 1, that wraps up the journey from vanilla REINFORCE to production-scale alignment.
For implementation, the key open-source tools to explore are:
- TRL (Hugging Face): PPO and DPO training for language models with Transformers integration.
- OpenRLHF: Distributed RLHF training framework.
- Gymnasium: The standard RL environment interface (useful for understanding PPO before applying it to LLMs).
References
- Askell, A., Bai, Y., Chen, A., et al. (2021). A General Language Assistant as a Laboratory for Alignment. https://arxiv.org/abs/2112.00861
- Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. https://arxiv.org/abs/2212.08073
- Christiano, P., Leike, J., Brown, T. B., et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS 2017. https://arxiv.org/abs/1706.03741
- Gao, L., Schulman, J., & Hilton, J. (2022). Scaling Laws for Reward Model Overoptimization. https://arxiv.org/abs/2210.10760
- DeepSeek-AI, Guo, D., Yang, D., et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature, 645, 633-638. https://arxiv.org/abs/2501.12948
- Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. https://arxiv.org/abs/2005.11401
- Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155
- Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. https://arxiv.org/abs/2305.18290
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. https://arxiv.org/abs/1707.06347
- Shao, Z., Wang, P., Zhu, Q., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. https://arxiv.org/abs/2402.03300
- Stiennon, N., Ouyang, L., Wu, J., et al. (2020). Learning to summarize from human feedback. NeurIPS 2020. https://arxiv.org/abs/2009.01325
- Yao, S., Zhao, J., Yu, D., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. https://arxiv.org/abs/2210.03629
- Ziegler, D. M., Stiennon, N., Wu, J., et al. (2019). Fine-Tuning Language Models from Human Preferences. https://arxiv.org/abs/1909.08593