import torch
def compute_reinforce_loss(log_probs, returns):
"""
Vanilla policy gradient loss.
log_probs: log(pi(a|s)) for sampled actions
returns: discounted returns
"""
log_probs = torch.tensor(log_probs, dtype=torch.float32)
returns = torch.tensor(returns, dtype=torch.float32)
return -(log_probs * returns).mean()Scaling Stability: Proximal Policy Optimization from Theory to LLM Alignment
Why PPO After REINFORCE?
In REINFORCE: Direct Policy Optimization After Deep Q-Learning, we built a policy-gradient agent from scratch and watched it learn by sampling full trajectories. It worked. It also wobbled. One unlucky episode could undo several useful updates, and the reward curve could look more like noise than progress.
That wobble is not a quirk. It is a structural issue in vanilla policy gradients. This post follows the evolution from that fragile REINFORCE update to Proximal Policy Optimization (PPO), the method that made policy updates far more stable with a surprisingly simple clipping rule.
Why this matters beyond classic control: PPO was the algorithm used in InstructGPT’s RLHF stage (Ouyang et al., 2022). In that setup, the “environment” is human preference data, and policy updates happen on very large language models where unstable updates are expensive and risky.
Post road map: We move through REINFORCE’s variance problem, TRPO’s constrained optimization fix, and PPO’s clipped surrogate objective. Then we connect GAE, actor-critic training, and the RLHF workflow that made PPO historically central. This is Part 1 (concepts and evolution). Part 2 will shift to an implementation focused on how PPO is used in grounding LLMs with RLHF.
1. The Stability Crisis in REINFORCE
REINFORCE updates policy parameters by increasing log-probability of actions that produced high return:
The issue is variance. The return signal is trajectory-level Monte Carlo feedback, so gradient magnitude can swing heavily from episode to episode. Large unlucky updates can push the policy into poor regions of parameter space, hurting exploration and sometimes causing collapse.
On toy grids this is noisy and inconvenient. On very large language models, this can be a severe optimization and alignment failure mode.
2. TRPO: Trust Regions With Higher Complexity
Trust Region Policy Optimization (TRPO) reframed policy improvement as a constrained problem:
This trust region made updates much more reliable. But the optimizer is heavier: it relies on second-order structure, including the Fisher Information Matrix (FIM), which equals the Hessian of the KL divergence with respect to , evaluated at . In practice TRPO uses conjugate gradient and line search to enforce the KL constraint.
That worked well for many continuous-control tasks, but it is operationally complex at large scale. PPO was designed to keep the stability intuition while using first-order optimization.
3. PPO Surrogate Objective and Importance Ratios
The code fragments below are illustrative component-level snippets. Part 2 will show how these PPO ideas are used in practical RLHF grounding pipelines for LLMs.
Look at the TRPO objective from Section 2 again:
PPO starts from the same objective — it does not replace it. Both algorithms optimize the same ratio-weighted advantage. The difference is entirely in how the constraint is enforced. TRPO uses second-order KL machinery; PPO will use clipping (Section 4).
That ratio-weighted objective is called a surrogate because we are not optimizing the true expected return directly. We cannot. would require rolling out new trajectories under the current policy after every parameter change. Instead, we build a local approximation using data already collected under . The probability ratio corrects for the distributional mismatch via importance sampling, giving us a function we can evaluate and differentiate without new rollouts:
The surrogate objective is then:
When , every ratio is exactly 1, and reduces to the ordinary policy gradient. As moves away from , the surrogate stays accurate only as long as the two policies remain close — which is exactly what the trust region (TRPO) or clipping (PPO) is there to enforce.
The practical payoff of the ratio formulation is significant: because the old policy’s rollout data is fixed, PPO can run multiple minibatch gradient steps over the same batch. REINFORCE uses each trajectory once and discards it. PPO reuses data across several epochs, extracting more learning signal per environment interaction.
def compute_ratio(new_log_probs, old_log_probs):
"""
r_t(theta) = pi_new(a|s) / pi_old(a|s)
computed stably from log-probs.
"""
new_log_probs = torch.tensor(new_log_probs, dtype=torch.float32)
old_log_probs = torch.tensor(old_log_probs, dtype=torch.float32)
return torch.exp(new_log_probs - old_log_probs)4. Clipping: PPO’s Core Stabilizer
Without constraints, maximizing can still produce over-large updates. PPO-Clip constrains the effective ratio:
The term creates a pessimistic bound that limits gains from moving too far from the old policy. Typical is .
The original PPO paper also presents an adaptive KL-penalty variant, but PPO-Clip is the dominant practical variant and the one used in this series.
def compute_ppo_clipping_loss(ratio, advantages, eps=0.2):
ratio = torch.tensor(ratio, dtype=torch.float32)
advantages = torch.tensor(advantages, dtype=torch.float32)
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1.0 - eps, 1.0 + eps) * advantages
return -torch.min(surr1, surr2).mean()Enjoying this walkthrough? Subscribe for Part 2 on PPO for LLM grounding with RLHF (PyTorch).
5. Advantage Estimation With GAE
Generalized Advantage Estimation (GAE) provides a practical bias-variance trade-off:
When , behavior is TD-like (lower variance, higher bias). When , it approaches Monte Carlo behavior (lower bias, higher variance). Values like often work well in practice.
def compute_gae(rewards, values, next_values, dones, gamma=0.99, lam=0.95):
rewards = torch.tensor(rewards, dtype=torch.float32)
values = torch.tensor(values, dtype=torch.float32)
next_values = torch.tensor(next_values, dtype=torch.float32)
dones = torch.tensor(dones, dtype=torch.float32)
advantages = torch.zeros_like(rewards)
last_gae = torch.tensor(0.0)
for t in reversed(range(len(rewards))):
delta = rewards[t] + gamma * next_values[t] * (1.0 - dones[t]) - values[t]
last_gae = delta + gamma * lam * (1.0 - dones[t]) * last_gae
advantages[t] = last_gae
return advantages6. Integrated Actor-Critic Objective
So far we have two separate learning problems sitting side by side. The policy (Sections 3–4) decides which actions to take. The value function (Section 5, inside GAE) estimates how good a state is so we can compute advantages. In REINFORCE we only had the policy — advantages came from raw returns or a simple running baseline. PPO needs something better, because GAE explicitly depends on at every timestep.
Actor-critic architecture solves this by training both components together:
- The actor is the policy network . It outputs action probabilities (or distribution parameters in continuous action spaces) and is updated via the clipped surrogate objective.
- The critic is the value network . It outputs a scalar estimate of expected return from a given state and is updated to minimize prediction error against observed returns.
Why not keep them separate? You could, but a common approach in deep RL is to share early layers between them: a single trunk extracts state features, and actor and critic heads branch from it. Joint training through a single objective is simpler to implement and lets gradient information from the value loss improve the shared representations that the policy also relies on.
The combined PPO objective captures all three learning signals in one expression:
This objective is maximized. Here:
- improves policy behavior while constraining updates.
- is value-function mean-squared error, so subtracting it corresponds to minimizing value error.
- is entropy regularization to avoid premature policy collapse.
def ppo_objective_components(
new_log_probs,
old_log_probs,
advantages,
values,
returns,
entropy,
eps=0.2,
c1=0.5,
c2=0.01,
):
new_lp = torch.tensor(new_log_probs, dtype=torch.float32)
old_lp = torch.tensor(old_log_probs, dtype=torch.float32)
ratio = torch.exp(new_lp - old_lp)
surr1 = ratio * torch.tensor(advantages, dtype=torch.float32)
surr2 = torch.clamp(ratio, 1.0 - eps, 1.0 + eps) * torch.tensor(advantages, dtype=torch.float32)
l_clip = torch.min(surr1, surr2).mean()
l_vf = ((torch.tensor(values, dtype=torch.float32) - torch.tensor(returns, dtype=torch.float32)) ** 2).mean()
s_ent = torch.tensor(entropy, dtype=torch.float32).mean()
objective = l_clip - c1 * l_vf + c2 * s_ent
return {
"L_clip": l_clip.item(),
"L_vf": l_vf.item(),
"S_entropy": s_ent.item(),
"objective": objective.item(),
}7. Why PPO Mattered for RLHF
In InstructGPT-style RLHF, a simplified pipeline is:
- Supervised fine-tuning (SFT) on curated demonstrations.
- Reward model training on human preference comparisons.
- PPO optimization against that learned reward, typically with a KL anchor to the SFT policy.
(Simplified view: many production pipelines add iterative rounds, rejection-sampling stages, and reward-model refreshes.)
InstructGPT and similar production pipelines used both PPO clipping and an explicit KL penalty in shaped reward:
Historically, PPO dominated the InstructGPT era because it offered first-order simplicity with trust-region-like stability. By 2024-2026, methods like DPO and GRPO reduced PPO’s dominance in many settings, but PPO remains a foundational reference point for understanding alignment-time policy optimization.
What Comes Next
This part focused on evolution and concepts: REINFORCE instability, TRPO trust regions, PPO clipping, GAE, and RLHF context. Part 2 will focus on how PPO is used to ground LLMs with RLHF, with a PyTorch-based workflow and practical alignment-oriented training considerations.
References
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229-256.
- Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML 2015. https://arxiv.org/abs/1502.05477
- Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR 2016. https://arxiv.org/abs/1506.02438
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. https://arxiv.org/abs/1707.06347
- Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf