Scaling Stability: Proximal Policy Optimization from Theory to LLM Alignment

Reinforcement Learning
PPO
RLHF
Policy Gradients
How PPO evolved from REINFORCE through TRPO, why clipped surrogate objectives stabilize training, and how PPO powers RLHF in modern LLM alignment pipelines.
Author

Ravi Sankar Krothapalli

Published

April 6, 2026

Why PPO After REINFORCE?

In REINFORCE: Direct Policy Optimization After Deep Q-Learning, we built a policy-gradient agent from scratch and watched it learn by sampling full trajectories. It worked. It also wobbled. One unlucky episode could undo several useful updates, and the reward curve could look more like noise than progress.

That wobble is not a quirk. It is a structural issue in vanilla policy gradients. This post follows the evolution from that fragile REINFORCE update to Proximal Policy Optimization (PPO), the method that made policy updates far more stable with a surprisingly simple clipping rule.

Why this matters beyond classic control: PPO was the algorithm used in InstructGPT’s RLHF stage (Ouyang et al., 2022). In that setup, the “environment” is human preference data, and policy updates happen on very large language models where unstable updates are expensive and risky.

Post road map: We move through REINFORCE’s variance problem, TRPO’s constrained optimization fix, and PPO’s clipped surrogate objective. Then we connect GAE, actor-critic training, and the RLHF workflow that made PPO historically central. This is Part 1 (concepts and evolution). Part 2 will shift to an implementation focused on how PPO is used in grounding LLMs with RLHF.


1. The Stability Crisis in REINFORCE

REINFORCE updates policy parameters by increasing log-probability of actions that produced high return:

θJ(θ)=𝔼τπθ[t=0T1θlogπθ(atst)R̂t] \nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \hat{R}_t \right]

The issue is variance. The return signal R̂t\hat{R}_t is trajectory-level Monte Carlo feedback, so gradient magnitude can swing heavily from episode to episode. Large unlucky updates can push the policy into poor regions of parameter space, hurting exploration and sometimes causing collapse.

On toy grids this is noisy and inconvenient. On very large language models, this can be a severe optimization and alignment failure mode.

import torch


def compute_reinforce_loss(log_probs, returns):
    """
    Vanilla policy gradient loss.
    log_probs: log(pi(a|s)) for sampled actions
    returns: discounted returns
    """
    log_probs = torch.tensor(log_probs, dtype=torch.float32)
    returns = torch.tensor(returns, dtype=torch.float32)
    return -(log_probs * returns).mean()

2. TRPO: Trust Regions With Higher Complexity

Trust Region Policy Optimization (TRPO) reframed policy improvement as a constrained problem:

maxθ𝔼t[πθ(atst)πθold(atst)Ât]subject to𝔼t[DKL(πθoldπθ)]δ \max_{\theta}\; \mathbb{E}_t\left[\frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\hat{A}_t\right] \quad \text{subject to} \quad \mathbb{E}_t\left[D_{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta)\right] \leq \delta

This trust region made updates much more reliable. But the optimizer is heavier: it relies on second-order structure, including the Fisher Information Matrix (FIM), which equals the Hessian of the KL divergence with respect to θ\theta, evaluated at θ=θold\theta = \theta_{\text{old}}. In practice TRPO uses conjugate gradient and line search to enforce the KL constraint.

That worked well for many continuous-control tasks, but it is operationally complex at large scale. PPO was designed to keep the stability intuition while using first-order optimization.


3. PPO Surrogate Objective and Importance Ratios

Note

The code fragments below are illustrative component-level snippets. Part 2 will show how these PPO ideas are used in practical RLHF grounding pipelines for LLMs.

Look at the TRPO objective from Section 2 again:

maxθ𝔼t[πθ(atst)πθold(atst)Ât]subject to𝔼t[DKL(πθoldπθ)]δ \max_{\theta}\; \mathbb{E}_t\left[\frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\hat{A}_t\right] \quad \text{subject to} \quad \mathbb{E}_t\left[D_{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta)\right] \leq \delta

PPO starts from the same objective — it does not replace it. Both algorithms optimize the same ratio-weighted advantage. The difference is entirely in how the constraint is enforced. TRPO uses second-order KL machinery; PPO will use clipping (Section 4).

That ratio-weighted objective is called a surrogate because we are not optimizing the true expected return J(θ)J(\theta) directly. We cannot. J(θ)J(\theta) would require rolling out new trajectories under the current policy after every parameter change. Instead, we build a local approximation using data already collected under πθold\pi_{\theta_{\text{old}}}. The probability ratio rt(θ)r_t(\theta) corrects for the distributional mismatch via importance sampling, giving us a function we can evaluate and differentiate without new rollouts:

rt(θ)=πθ(atst)πθold(atst) r_t(\theta) = \frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}

The surrogate objective is then:

LCPI(θ)=𝔼̂t[rt(θ)Ât] L^{CPI}(\theta) = \hat{\mathbb{E}}_t[r_t(\theta)\hat{A}_t]

When θ=θold\theta = \theta_{\text{old}}, every ratio is exactly 1, and LCPIL^{CPI} reduces to the ordinary policy gradient. As θ\theta moves away from θold\theta_{\text{old}}, the surrogate stays accurate only as long as the two policies remain close — which is exactly what the trust region (TRPO) or clipping (PPO) is there to enforce.

The practical payoff of the ratio formulation is significant: because the old policy’s rollout data is fixed, PPO can run multiple minibatch gradient steps over the same batch. REINFORCE uses each trajectory once and discards it. PPO reuses data across several epochs, extracting more learning signal per environment interaction.

def compute_ratio(new_log_probs, old_log_probs):
    """
    r_t(theta) = pi_new(a|s) / pi_old(a|s)
    computed stably from log-probs.
    """
    new_log_probs = torch.tensor(new_log_probs, dtype=torch.float32)
    old_log_probs = torch.tensor(old_log_probs, dtype=torch.float32)
    return torch.exp(new_log_probs - old_log_probs)

4. Clipping: PPO’s Core Stabilizer

Without constraints, maximizing rt(θ)Âtr_t(\theta)\hat{A}_t can still produce over-large updates. PPO-Clip constrains the effective ratio:

LCLIP(θ)=𝔼t[min(rt(θ)Ât,clip(rt(θ),1ϵ,1+ϵ)Ât)] L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t,\;\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]

The min\min term creates a pessimistic bound that limits gains from moving too far from the old policy. Typical ϵ\epsilon is 0.20.2.

The original PPO paper also presents an adaptive KL-penalty variant, but PPO-Clip is the dominant practical variant and the one used in this series.

def compute_ppo_clipping_loss(ratio, advantages, eps=0.2):
    ratio = torch.tensor(ratio, dtype=torch.float32)
    advantages = torch.tensor(advantages, dtype=torch.float32)

    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1.0 - eps, 1.0 + eps) * advantages

    return -torch.min(surr1, surr2).mean()

Enjoying this walkthrough? Subscribe for Part 2 on PPO for LLM grounding with RLHF (PyTorch).


5. Advantage Estimation With GAE

Generalized Advantage Estimation (GAE) provides a practical bias-variance trade-off:

ÂtGAE=l=0(γλ)lδt+l,δt=rt+γV(st+1)V(st) \hat{A}_t^{GAE} = \sum_{l=0}^{\infty}(\gamma\lambda)^l\delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

When λ=0\lambda=0, behavior is TD-like (lower variance, higher bias). When λ=1\lambda=1, it approaches Monte Carlo behavior (lower bias, higher variance). Values like λ=0.95\lambda=0.95 often work well in practice.

def compute_gae(rewards, values, next_values, dones, gamma=0.99, lam=0.95):
    rewards = torch.tensor(rewards, dtype=torch.float32)
    values = torch.tensor(values, dtype=torch.float32)
    next_values = torch.tensor(next_values, dtype=torch.float32)
    dones = torch.tensor(dones, dtype=torch.float32)

    advantages = torch.zeros_like(rewards)
    last_gae = torch.tensor(0.0)

    for t in reversed(range(len(rewards))):
        delta = rewards[t] + gamma * next_values[t] * (1.0 - dones[t]) - values[t]
        last_gae = delta + gamma * lam * (1.0 - dones[t]) * last_gae
        advantages[t] = last_gae

    return advantages

6. Integrated Actor-Critic Objective

So far we have two separate learning problems sitting side by side. The policy (Sections 3–4) decides which actions to take. The value function (Section 5, inside GAE) estimates how good a state is so we can compute advantages. In REINFORCE we only had the policy — advantages came from raw returns or a simple running baseline. PPO needs something better, because GAE explicitly depends on V(st)V(s_t) at every timestep.

Actor-critic architecture solves this by training both components together:

  • The actor is the policy network πθ(as)\pi_\theta(a \mid s). It outputs action probabilities (or distribution parameters in continuous action spaces) and is updated via the clipped surrogate objective.
  • The critic is the value network Vϕ(s)V_\phi(s). It outputs a scalar estimate of expected return from a given state and is updated to minimize prediction error against observed returns.

Why not keep them separate? You could, but a common approach in deep RL is to share early layers between them: a single trunk extracts state features, and actor and critic heads branch from it. Joint training through a single objective is simpler to implement and lets gradient information from the value loss improve the shared representations that the policy also relies on.

The combined PPO objective captures all three learning signals in one expression:

L(θ,ϕ)=LCLIP(θ)c1LVF(ϕ)+c2S[πθ] L(\theta,\phi) = L^{CLIP}(\theta) - c_1 L^{VF}(\phi) + c_2 S[\pi_\theta]

This objective is maximized. Here:

  • LCLIPL^{CLIP} improves policy behavior while constraining updates.
  • LVFL^{VF} is value-function mean-squared error, so subtracting it corresponds to minimizing value error.
  • S[πθ]S[\pi_\theta] is entropy regularization to avoid premature policy collapse.
def ppo_objective_components(
    new_log_probs,
    old_log_probs,
    advantages,
    values,
    returns,
    entropy,
    eps=0.2,
    c1=0.5,
    c2=0.01,
):
    new_lp = torch.tensor(new_log_probs, dtype=torch.float32)
    old_lp = torch.tensor(old_log_probs, dtype=torch.float32)
    ratio = torch.exp(new_lp - old_lp)

    surr1 = ratio * torch.tensor(advantages, dtype=torch.float32)
    surr2 = torch.clamp(ratio, 1.0 - eps, 1.0 + eps) * torch.tensor(advantages, dtype=torch.float32)
    l_clip = torch.min(surr1, surr2).mean()

    l_vf = ((torch.tensor(values, dtype=torch.float32) - torch.tensor(returns, dtype=torch.float32)) ** 2).mean()
    s_ent = torch.tensor(entropy, dtype=torch.float32).mean()

    objective = l_clip - c1 * l_vf + c2 * s_ent
    return {
        "L_clip": l_clip.item(),
        "L_vf": l_vf.item(),
        "S_entropy": s_ent.item(),
        "objective": objective.item(),
    }

7. Why PPO Mattered for RLHF

In InstructGPT-style RLHF, a simplified pipeline is:

  1. Supervised fine-tuning (SFT) on curated demonstrations.
  2. Reward model training on human preference comparisons.
  3. PPO optimization against that learned reward, typically with a KL anchor to the SFT policy.

(Simplified view: many production pipelines add iterative rounds, rejection-sampling stages, and reward-model refreshes.)

InstructGPT and similar production pipelines used both PPO clipping and an explicit KL penalty in shaped reward:

r(x,y)=rϕ(x,y)βlogπRL(yx)πSFT(yx) r(x,y) = r_\phi(x,y) - \beta \log \frac{\pi_{RL}(y\mid x)}{\pi_{SFT}(y\mid x)}

Historically, PPO dominated the InstructGPT era because it offered first-order simplicity with trust-region-like stability. By 2024-2026, methods like DPO and GRPO reduced PPO’s dominance in many settings, but PPO remains a foundational reference point for understanding alignment-time policy optimization.


What Comes Next

This part focused on evolution and concepts: REINFORCE instability, TRPO trust regions, PPO clipping, GAE, and RLHF context. Part 2 will focus on how PPO is used to ground LLMs with RLHF, with a PyTorch-based workflow and practical alignment-oriented training considerations.


References

  • Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229-256.
  • Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML 2015. https://arxiv.org/abs/1502.05477
  • Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR 2016. https://arxiv.org/abs/1506.02438
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. https://arxiv.org/abs/1707.06347
  • Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155
  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf