Expected SARSA: Bridging On-Policy and Off-Policy TD Control

Reinforcement Learning

TD Learning

Expected SARSA

How Expected SARSA averages over policy distributions to reduce update variance compared to SARSA and Q-Learning, with code and policy visualization.

Author

Ravi Sankar Krothapalli

Published

March 19, 2026

Why Expected SARSA?

In a previous post, we placed SARSA and Q-Learning side by side and observed the clearest expression of the on-policy vs off-policy divide. That comparison surfaced a deeper tension: SARSA is honest about the cost of exploration, but it carries sampling noise into every update; Q-Learning bootstraps from the imagined best action, gaining speed at the cost of accuracy about what the agent will actually do. Both compromises are measurable.

Expected SARSA occupies the precise middle ground. It eliminates SARSA’s per-step variance without abandoning the on-policy guarantee, and it subsumes Q-Learning as a degenerate special case when the target policy becomes greedy. The entire change amounts to exactly one line of code — replacing a sampled Q-lookup with a probability-weighted sum. The reasoning behind that single substitution spans the full MDP framework.

This post builds that reasoning from the equations, then implements all three algorithms on the same 4×4 grid, and lets the reward curves and learned policy maps verify what the math predicts.

The RL Landscape: MDP Framework

Reinforcement learning control algorithms operate within the mathematical framework of a Markov Decision Process, defined by the tuple $\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$ (some formulations use a 4-tuple without $\gamma$ , treating the discount factor as a separate learning parameter rather than part of the environment definition). For model-free methods, the transition dynamics $\mathcal{P}(s'|s,a)$ and reward function $\mathcal{R}(s,a)$ are strictly unknown — the agent must estimate the optimal action-value function $q_*(s,a)$ purely from experience.

Temporal Difference (TD) learning updates value estimates using bootstrapped targets. Two dominant approaches define the classical landscape:

SARSA (on-policy): The TD target relies on the actual action $A_{t+1}$ sampled from the current behavior policy — exploration noise enters the update directly.
Q-Learning (off-policy): The TD target assumes the greedy policy for the next state, $\max_a Q(S_{t+1}, a)$ — optimistic and decoupled from behavior.

While Q-Learning suffers from maximization bias and off-policy stability risks, standard SARSA injects the full stochasticity of the behavior policy into every TD target through the sampled $A_{t+1}$ . Expected SARSA closes this gap analytically.

The Expected SARSA Update Rule

Instead of a sampled next action, Expected SARSA updates toward the expected value of the next state under the current policy $\pi$ :

$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \sum_{a \in \mathcal{A}} \pi(a|S_{t+1})\, Q(S_{t+1}, a) - Q(S_t, A_t) \right]$

Variable decomposition:

Symbol	Meaning
$Q(S_t, A_t)$	Current state-action value estimate at time $t$
$\alpha \in (0,1]$	Learning rate (step size)
$R_{t+1}$	Immediate scalar reward after taking $A_t$ in state $S_t$
$\gamma \in [0,1]$	Discount factor: present value of future rewards
$\pi(a \mid S_{t+1})$	Probability of selecting action $a$ in next state $S_{t+1}$ under policy $\pi$
$\sum_{a} \pi(a \mid S_{t+1})\,Q(S_{t+1},a)$	Expected value of next state under $\pi$ — replaces the sampled $Q(S_{t+1}, A_{t+1})$

By integrating out the next action $a$ , Expected SARSA eliminates the sampling variance in the TD target introduced by the random selection of $A_{t+1}$ — while reward stochasticity and transition noise remain. The expected update direction is preserved, matching standard SARSA when run on-policy.

Bridging SARSA and Q-Learning

Expected SARSA is a strict generalization of both its predecessors.

Q-Learning is a special case. Assume the evaluation policy $\pi$ is greedy with respect to the current $Q$ -values. The probability mass function collapses to a Kronecker delta:

$\pi(a \mid S_{t+1}) = \begin{cases} 1 & \text{if } a = \arg\max_{a'} Q(S_{t+1}, a') \\ 0 & \text{otherwise} \end{cases}$

Substituting into the expectation term:

$\sum_{a \in \mathcal{A}} \pi(a \mid S_{t+1})\,Q(S_{t+1}, a) = 1 \cdot \max_a Q(S_{t+1}, a) = \max_a Q(S_{t+1}, a)$

The rule collapses exactly into the standard Q-Learning update. Q-Learning is Expected SARSA with a greedy target policy.

SARSA is a special case in expectation. When the evaluation policy equals the behavior policy, the expected value of a single sampled $Q(S_{t+1}, A_{t+1})$ is identical to the analytical sum — but Expected SARSA achieves the same expected update with lower variance because integration is noise-free relative to sampling.

Furthermore, Expected SARSA can be used off-policy safely: when the data-generating behavior policy differs from the target policy $\pi$ inside the expectation, learning remains stable without the divergence risks that afflict off-policy standard SARSA.

Section 1: Setup and Reproducibility

Before developing any learning algorithm, a stable reproducible baseline is essential. If the environment parameters drift between runs, it becomes impossible to tell whether a difference in the curves comes from the algorithm or from the world it is operating in. All three agents will train on the same 4×4 grid under identical hyperparameters — locked before the first learning rule is written.

import random

import numpy as np
import pandas as pd
import plotly.graph_objects as go

np.random.seed(42)
random.seed(42)

GRID_SIZE = 4
N_ACTIONS = 4
N_STATES  = GRID_SIZE * GRID_SIZE

EPISODES  = 500
ALPHA     = 0.1
GAMMA     = 0.95
EPSILON   = 0.2    # elevated ε amplifies variance differences between algorithms
MAX_STEPS = 100

The 4×4 grid has 16 states and 64 state-action pairs per Q-table. Five hundred episodes give the learning curves enough room to show both early exploration behavior and late-stage convergence. Raising $\varepsilon$ to 0.2 (versus 0.1 in the previous post) directly enlarges the variance of SARSA’s sampled target — making Expected SARSA’s analytical expectation more visibly beneficial. Plain takeaway: lock every knob before the algorithms change so that any difference traces back to the update rule alone.

Section 2: A 4×4 Grid Environment

Now that the configuration is fixed, the environment needs to be defined before any learning rule touches it. The world is deliberately simple — a deterministic grid where the agent moves up, down, left, or right — so the complexity budget stays in the algorithms, not in the dynamics.

ACTIONS = {
    0: (-1,  0),   # up
    1: ( 1,  0),   # down
    2: ( 0, -1),   # left
    3: ( 0,  1),   # right
}


class GridEnv:
    """Deterministic 4×4 grid world for a three-algorithm comparison.

    States are integers 0-15 encoded row-major. Reward is +1.0 at the goal
    (state 15), -0.01 step penalty elsewhere. Episodes terminate on goal reach
    or when the step budget is exhausted.
    """

    def __init__(self, grid_size=GRID_SIZE, max_steps=MAX_STEPS):
        self.grid_size = grid_size
        self.max_steps = max_steps
        self.goal      = grid_size * grid_size - 1
        self.agent_pos = (0, 0)
        self.steps     = 0

    def state_to_pos(self, state):
        return divmod(state, self.grid_size)

    def pos_to_state(self, pos):
        return pos[0] * self.grid_size + pos[1]

    def reset(self):
        self.agent_pos = (0, 0)
        self.steps     = 0
        return 0

    def step(self, action):
        dr, dc = ACTIONS[action]
        r = min(max(self.agent_pos[0] + dr, 0), self.grid_size - 1)
        c = min(max(self.agent_pos[1] + dc, 0), self.grid_size - 1)
        self.agent_pos = (r, c)
        self.steps    += 1

        at_goal = self.pos_to_state(self.agent_pos) == self.goal
        reward  = 1.0 if at_goal else -0.01
        done    = at_goal or self.steps >= self.max_steps
        return self.pos_to_state(self.agent_pos), reward, done


env   = GridEnv()
state = env.reset()
for action in [3, 1, 3, 1, 3, 1]:
    next_state, reward, done = env.step(action)
    print(f"s={state:2d}  a={action}  s'={next_state:2d}  r={reward:5.2f}  done={done}")
    state = next_state
    if done:
        break

s= 0  a=3  s'= 1  r=-0.01  done=False
s= 1  a=1  s'= 5  r=-0.01  done=False
s= 5  a=3  s'= 6  r=-0.01  done=False
s= 6  a=1  s'=10  r=-0.01  done=False
s=10  a=3  s'=11  r=-0.01  done=False
s=11  a=1  s'=15  r= 1.00  done=True

The test walk traces a right-then-down corridor from state 0 toward state 15, confirming that the goal reward and step penalty are wired correctly. A 4×4 grid gives 16 states and an optimal path of 6 steps — large enough that the algorithms must actually generalize across intermediate states rather than memorizing a two-step shortcut, while staying small enough to inspect every Q-value in the policy map.

Section 2.5: Visualizing the Grid

Before training any agent, it helps to see the full 4×4 world. Blue marks the start, green marks the goal, and dark slate covers the intermediate states. Hover over any cell to inspect its state ID and role.

State 0 (top-left) begins every episode; state 15 (bottom-right) is the only rewarding cell. The optimal path from corner to corner takes exactly 6 steps — 3 right and 3 down in any order — and every algorithmic difference between the three methods emerges from how each one values this corridor under noisy ε-greedy exploration.

Section 3: Expected SARSA Agent

The environment is in place. Now the core algorithm. Standard SARSA calls epsilon_greedy to sample

A_{t+1}

and looks up

Q(S_{t+1}, A_{t+1})

. Expected SARSA replaces that noisy lookup with expected_q — a probability-weighted sum over the entire action row for the next state. The helper constructs the ε-greedy distribution explicitly and computes the inner product with the Q-value row.

def epsilon_greedy(q, state, epsilon):
    """ε-greedy action selection."""
    if random.random() < epsilon:
        return random.randrange(N_ACTIONS)
    return int(np.argmax(q[state]))


def expected_q(q, state, epsilon):
    """Expected value of a state under the ε-greedy policy.

    Under ε-greedy with |A| actions, the greedy action receives probability
    (1 - ε + ε/|A|) and each non-greedy action receives ε/|A|.
    This closed-form avoids sampling entirely.
    """
    n_actions       = q[state].shape[0]
    greedy_a        = int(np.argmax(q[state]))
    probs           = np.full(n_actions, epsilon / n_actions)
    probs[greedy_a] += 1.0 - epsilon
    return float(np.dot(probs, q[state]))


def expected_sarsa_train(env, episodes=EPISODES, alpha=ALPHA,
                         gamma=GAMMA, epsilon=EPSILON):
    q       = np.zeros((N_STATES, N_ACTIONS))
    rewards = []

    for _ in range(episodes):
        s     = env.reset()
        total = 0.0

        while True:
            a               = epsilon_greedy(q, s, epsilon)
            s_next, r, done = env.step(a)

            # Expected SARSA target: analytical expectation — no sampled A_{t+1}
            target   = r + gamma * expected_q(q, s_next, epsilon) * (not done)
            q[s, a] += alpha * (target - q[s, a])

            s      = s_next
            total += r
            if done:
                break

        rewards.append(total)

    return np.array(rewards), q

With $\varepsilon = 0.2$ and $|\mathcal{A}| = 4$ , expected_q assigns probability $0.85$ to the greedy action and $0.05$ to each of the three exploratory alternatives. The inner product weights every Q-value by its likelihood under the current policy. This deterministic sum removes the variance term introduced by sampling — meaningful when $\varepsilon$ is non-trivial. The added cost is one dot product per step, $O(|\mathcal{A}|)$ , which is negligible in the tabular setting.

esarsa_rewards, esarsa_q = expected_sarsa_train(GridEnv())

print(f"Expected SARSA | mean reward (all eps):     {esarsa_rewards.mean():.3f}")
print(f"Expected SARSA | mean reward (last 50 eps): {esarsa_rewards[-50:].mean():.3f}")

Expected SARSA | mean reward (all eps):     0.934
Expected SARSA | mean reward (last 50 eps): 0.938

The gap between the all-episode mean and the final-50 mean measures how much Expected SARSA improved over the course of training. Its absolute value is less informative here than how it compares to the same gap for SARSA and Q-Learning in the next section.

Section 4: SARSA and Q-Learning Baselines

Both baselines share the same epsilon_greedy selector and the same environment constructor. The only difference is the bootstrap target — which maps directly to the taxonomy from the theory sections above.

def sarsa_train(env, episodes=EPISODES, alpha=ALPHA,
                gamma=GAMMA, epsilon=EPSILON):
    q       = np.zeros((N_STATES, N_ACTIONS))
    rewards = []

    for _ in range(episodes):
        s     = env.reset()
        a     = epsilon_greedy(q, s, epsilon)
        total = 0.0

        while True:
            s_next, r, done = env.step(a)
            a_next          = epsilon_greedy(q, s_next, epsilon)

            # On-policy target: Q(s', a') — value of the action ε-greedy actually picks
            q[s, a] += alpha * (r + gamma * q[s_next, a_next] * (not done) - q[s, a])

            s, a   = s_next, a_next
            total += r
            if done:
                break

        rewards.append(total)

    return np.array(rewards), q


def qlearning_train(env, episodes=EPISODES, alpha=ALPHA,
                    gamma=GAMMA, epsilon=EPSILON):
    q       = np.zeros((N_STATES, N_ACTIONS))
    rewards = []

    for _ in range(episodes):
        s     = env.reset()
        total = 0.0

        while True:
            a               = epsilon_greedy(q, s, epsilon)
            s_next, r, done = env.step(a)

            # Off-policy target: max Q(s', ·) — greedy best regardless of behavior
            q[s, a] += alpha * (r + gamma * np.max(q[s_next]) * (not done) - q[s, a])

            s      = s_next
            total += r
            if done:
                break

        rewards.append(total)

    return np.array(rewards), q

Three update lines capture the full taxonomy:

SARSA: q[s_next, a_next] — value of the action ε-greedy chose; sampling noise is baked into the target.
Expected SARSA: expected_q(q, s_next, epsilon) — weighted average over all actions; no noise from sampling.
Q-Learning: np.max(q[s_next]) — value of the best available action; deterministic but optimistically biased.

Expected SARSA’s target has the same expected value as SARSA’s but achieves it with reduced variance — without Q-Learning’s upward bias from the maximum operator.

sarsa_rewards, sarsa_q = sarsa_train(GridEnv())
ql_rewards,    ql_q    = qlearning_train(GridEnv())

print(f"SARSA      | mean reward (all eps):     {sarsa_rewards.mean():.3f}")
print(f"SARSA      | mean reward (last 50 eps): {sarsa_rewards[-50:].mean():.3f}")
print()
print(f"Q-Learning | mean reward (all eps):     {ql_rewards.mean():.3f}")
print(f"Q-Learning | mean reward (last 50 eps): {ql_rewards[-50:].mean():.3f}")

SARSA      | mean reward (all eps):     0.930
SARSA      | mean reward (last 50 eps): 0.936

Q-Learning | mean reward (all eps):     0.933
Q-Learning | mean reward (last 50 eps): 0.935

Section 5: Three-Way Reward and Variance Curves

All three agents have completed 500 episodes under identical conditions. The reward curves below answer two separate questions: do the algorithms converge at comparable rates (top panel), and does Expected SARSA actually exhibit lower update variance (bottom panel)?

Focus on the bottom panel first: Expected SARSA’s standard deviation line tracks below SARSA’s for most of the training run. This is the variance reduction the theory promises — not an artifact of convergence speed but a structural consequence of replacing a stochastic sample with a deterministic sum. In the top panel, all three algorithms reach comparable late-episode rewards, confirming that the variance reduction does not come at a convergence cost on this environment. Q-Learning’s early optimism may produce a faster initial climb; SARSA’s curve tends to be the noisiest; Expected SARSA typically sits between the two on both dimensions.

Section 6: Three-Way Policy Visualization

The reward curves show how fast each algorithm improved — but not what they learned about individual states. The policy maps below display the final greedy action in every cell for all three algorithms simultaneously. Hover over any cell to inspect all four Q-values and the maximum Q. Where all three arrows agree, the algorithms converged to the same decision despite their different update rules. Where they diverge, the algorithmic distinction left a visible mark on the learned values.

On a deterministic 4×4 grid with 500 episodes, all three policies typically agree on the greedy path to the goal. The more informative signal sits in the hover Q-values: Q-Learning tends to overestimate state values because its greedy bootstrap applies the maximum at every step; standard SARSA tends to underestimate them because exploratory actions — which carry the step penalty — are fully reflected in

Q(S_{t+1}, A_{t+1})

; Expected SARSA typically sits between the two, weighting greedy and exploratory actions by their actual probabilities under

\pi

When to Use Expected SARSA

Use Expected SARSA when:

Your behavior policy is stochastic (ε-greedy or softmax) and sample variance from $A_{t+1}$ is slowing or destabilizing learning — this is the primary motivation for the algorithm.
You want off-policy behavior (a target policy different from your data-collection policy) without the divergence risks of off-policy standard SARSA.
The action space $|\mathcal{A}|$ is small enough that computing the expectation per step is inexpensive — each update costs $O(|\mathcal{A}|)$ rather than $O(1)$ , but the constant is small in tabular settings.
You want a single algorithm that spans the whole SARSA–Q-Learning spectrum: set the evaluation policy to ε-greedy for SARSA-equivalent behavior, or purely greedy ( $\varepsilon \to 0$ ) to recover Q-Learning exactly.

Consider Q-Learning instead when:

You are building a deep RL system (DQN and variants) where computing the full expectation over a large output layer is expensive — most deep RL practice uses sample-based targets or distributional methods.
Fast value propagation is more important than variance control and the environment allows cheap exploration.

Consider standard SARSA instead when:

Safety matters and you want Q-values that honestly reflect the cost of the deployed exploration policy — SARSA’s values directly encode the expected return under ε-greedy at runtime, not the hypothetical greedy strategy.
Interpretability of the value function during training is a system requirement.

The real-world positioning:

Expected SARSA is the theoretically cleanest choice for classical model-free TD control in tabular and small discrete settings. It removes a source of variance that standard SARSA retains and a source of bias that Q-Learning introduces and it does so with a single additional dot product per step. In environments where stochastic policies are non-trivial and action spaces remain manageable, it is the right default.

The 4×4 grid was deterministic and small enough that all three algorithms converged to the same greedy policy. In larger environments with stochastic transitions, delayed rewards, or higher exploration rates, the variance reduction compounds over thousands of steps and the choice between these three update rules becomes one of the most consequential design decisions in the system.

References

[1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. https://mitpress.mit.edu/9780262039246/reinforcement-learning/

[2] van Seijen, H., van Hasselt, H., Whiteson, S., & Wiering, M. (2009). A theoretical and empirical analysis of Expected Sarsa. In Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) (pp. 177–184). IEEE. https://ieeexplore.ieee.org/document/4927542

[3] Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292. https://doi.org/10.1007/BF00992698

[4] Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems (Technical Report CUED/F-INFENG/TR 166). Cambridge University Engineering Department.