import random
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
# Fixed seeds make SARSA vs Q-learning runs reproducible.
np.random.seed(42)
random.seed(42)
GRID_SIZE = 2
N_ACTIONS = 4
EPISODES = 300
ALPHA = 0.1
GAMMA = 0.95
EPSILON = 0.1
MAX_STEPS = 20SARSA vs Q-Learning: Understanding On-Policy and Off-Policy RL
Why This Distinction Matters
Reinforcement Learning (RL) focuses on how an agent learns from its interactions with the environment to develop a strategy for future actions. It offers two fundamental approaches for learning action-value functions:
- On-policy methods like SARSA, which learn from the actions the agent actually takes.
- Off-policy methods like Q-learning, which learn from the optimal next action regardless of what the agent actually does.
The difference between these two approaches is crucial, as it affects not only the calculations for updating value functions but also influences the agent’s behavior. This includes factors such as the agent’s willingness to take risks, its ability to navigate the environment, and the consistency of its learning over time.
As agents transition from controlled environments to real-world situations, like self-driving cars or industrial automation, it is essential to understand how they balance exploration and exploitation.
The Role of Temporal Difference Learning
Both SARSA and Q‑learning belong to a broader family of methods called Temporal Difference (TD) learning—the core mechanism that enables agents to update their value estimates step‑by‑step while interacting with the environment.
TD learning is what makes real‑time reinforcement learning possible: instead of waiting for an episode to finish (as in Monte Carlo methods), an agent can immediately adjust its predictions based on partial experience. This incremental update process is exactly where the paths of on‑policy and off‑policy methods diverge, and it determines how each algorithm responds to exploration, uncertainty, and risk.Core Update Rules
SARSA (On-Policy)
Q-Learning (Off-Policy)
SARSA updates with the next action the policy actually selected, while Q-learning updates with the greedy best next action.
Intuition
SARSA is cautious because it learns from what really happened under exploration.
Q-learning is optimistic because it bootstraps from the best estimated future action.
Quick Comparison
| Aspect | SARSA (On-Policy) | Q-Learning (Off-Policy) |
|---|---|---|
| Next-step target | ||
| Learns from | Behavior policy | Target greedy policy |
| Exploration effect | Fully reflected in updates | Decoupled from target update |
| Typical behavior | Safer, smoother | More aggressive, often faster |
Section 1: Setup and Reproducibility
Before we code SARSA and Q-learning, we need a tiny controlled experiment we can trust. If the setup keeps changing, it becomes hard to tell whether behavior differences come from the algorithm or from the environment. So both methods will use the same environment and the same knobs, and we start by locking that in.
The goal of this section is to create one stable baseline for the entire comparison. In the next section, we run SARSA and Q-learning under identical conditions so the results are directly comparable. Plain takeaway: keep everything else fixed first, then any difference you see is really about the learning rule.
Section 2: Define a Minimal 2x2 Grid Environment
Now that the setup values are fixed, we define a tiny environment before implementing either learning rule. This section keeps the dynamics simple enough to inspect step by step, so debugging does not get mixed with algorithm behavior. A minimal environment also makes it easier to see whether SARSA and Q-learning differ because of their update targets, not because of complex world mechanics.
ACTIONS = {
0: (-1, 0), # up
1: (1, 0), # down
2: (0, -1), # left
3: (0, 1), # right
}
class GridEnv:
"""Tiny deterministic 2x2 grid world for SARSA vs Q-learning comparison.
States are grid cells encoded as integers, and actions are up/down/left/right
moves. The agent gets +1.0 at the goal cell, a small -0.01 step penalty
otherwise, and an episode ends on goal reach or max step limit.
"""
def __init__(self, grid_size=GRID_SIZE, max_steps=MAX_STEPS):
"""Initialize environment size, goal location, and episode counters."""
self.grid_size = grid_size
self.max_steps = max_steps
self.goal_pos = (grid_size - 1, grid_size - 1)
self.agent_pos = (0, 0)
self.steps = 0
def state_to_pos(self, state):
"""Convert an integer state id to a (row, col) grid position."""
return divmod(state, self.grid_size)
def pos_to_state(self, pos):
"""Convert a (row, col) grid position to its integer state id."""
return pos[0] * self.grid_size + pos[1]
def reset(self):
"""Reset to the start state and return the starting state id."""
self.agent_pos = (0, 0)
self.steps = 0
return self.pos_to_state(self.agent_pos)
def step(self, action):
"""Apply one action and return (next_state, reward, done).
Returns the next state id, scalar reward, and episode-termination flag.
"""
dr, dc = ACTIONS[action]
# Clip movement at grid boundaries so the agent cannot leave the board.
r = min(max(self.agent_pos[0] + dr, 0), self.grid_size - 1)
c = min(max(self.agent_pos[1] + dc, 0), self.grid_size - 1)
self.agent_pos = (r, c)
self.steps += 1
at_goal = self.agent_pos == self.goal_pos
# Reward is sparse: +1 at goal, small living penalty otherwise.
reward = 1.0 if at_goal else -0.01
# Episode ends when goal is reached or max step budget is exhausted.
done = at_goal or self.steps >= self.max_steps
return self.pos_to_state(self.agent_pos), reward, done
env = GridEnv()
state = env.reset()
for action in [3, 1, 3]:
next_state, reward, done = env.step(action)
print(f"s={state}, a={action}, s'={next_state}, r={reward}, done={done}")
state = next_state
if done:
breaks=0, a=3, s'=1, r=-0.01, done=False
s=1, a=1, s'=3, r=1.0, done=True
We use a 2x2 grid so every state transition is easy to see and debug. Four actions (up, down, left, right) are enough to capture movement without extra complexity. A goal reward, small step penalty, and step limit keep both algorithms under the same pressure, so the comparison stays fair.
Section 2.5: Visualizing the Grid
Before training either agent, it helps to see the world they navigate. The 2×2 grid has four cells — one start, one goal, and two intermediates. Hover over any cell to inspect its state ID, coordinates, and reward. Toggle between All Actions to see every valid transition, and Optimal Path to see the two-step route both algorithms are trying to learn.
State 0 (top-left) is where every episode begins; state 3 (bottom-right) is the only rewarding cell. The optimal path covers 2 steps — every algorithmic difference between SARSA and Q-learning stems from how each one values these transitions under exploration.
Section 3: SARSA Agent
Now that the environment is in place and we can see the grid, the first learning rule to implement is SARSA. Its defining characteristic is the update target: it uses Q(s', a') — the Q-value of the action the agent actually selects next under its exploration policy. This makes SARSA fully on-policy: what happens during exploration directly shapes what the agent learns.
def epsilon_greedy(q, state, epsilon):
if random.random() < epsilon:
return random.randrange(N_ACTIONS)
return int(np.argmax(q[state]))
def sarsa_train(env, episodes=EPISODES, alpha=ALPHA, gamma=GAMMA, epsilon=EPSILON):
q = np.zeros((GRID_SIZE * GRID_SIZE, N_ACTIONS))
rewards = []
for _ in range(episodes):
s = env.reset()
a = epsilon_greedy(q, s, epsilon)
total = 0.0
while True:
s_next, r, done = env.step(a)
a_next = epsilon_greedy(q, s_next, epsilon)
# On-policy target: Q(s', a') — the action we will actually take
q[s, a] += alpha * (r + gamma * q[s_next, a_next] * (not done) - q[s, a])
s, a = s_next, a_next
total += r
if done:
break
rewards.append(total)
return np.array(rewards), qThe one line that defines SARSA is the update: q[s_next, a_next] uses the action the agent will actually take next — chosen by the same ε-greedy policy that might explore randomly. The Q-table learns to account for that randomness, which tends to make SARSA’s learned values more conservative than Q-learning’s.
sarsa_rewards, sarsa_q = sarsa_train(GridEnv())
print(f"SARSA | mean reward (all eps): {sarsa_rewards.mean():.3f}")
print(f"SARSA | mean reward (last 50 eps): {sarsa_rewards[-50:].mean():.3f}")SARSA | mean reward (all eps): 0.988
SARSA | mean reward (last 50 eps): 0.988
The jump from early to late reward tells us how quickly SARSA found a reliable path. We will compare this trajectory directly against Q-learning in Section 5 — for now, sarsa_rewards and sarsa_q are stored for that comparison.
Section 4: Q-Learning Agent
Q-learning uses the same ε-greedy exploration as SARSA but its update target is different in one crucial way: instead of Q(s', a_next) — the value of the action the agent will actually take — it uses max_a Q(s', a) — the value of the best action available, regardless of what the policy would choose. This decouples learning from behavior, making Q-learning off-policy.
def qlearning_train(env, episodes=EPISODES, alpha=ALPHA, gamma=GAMMA, epsilon=EPSILON):
q = np.zeros((GRID_SIZE * GRID_SIZE, N_ACTIONS))
rewards = []
for _ in range(episodes):
s = env.reset()
total = 0.0
while True:
a = epsilon_greedy(q, s, epsilon)
s_next, r, done = env.step(a)
# Off-policy target: max Q(s', ·) — the greedy best, not the action we'll take
q[s, a] += alpha * (r + gamma * np.max(q[s_next]) * (not done) - q[s, a])
s = s_next
total += r
if done:
break
rewards.append(total)
return np.array(rewards), qCompare the two update lines side by side:
- SARSA:
q[s_next, a_next]— value of the action ε-greedy actually picked - Q-learning:
np.max(q[s_next])— value of the best action available
Notice also that Q-learning doesn’t need to carry a_next across the loop boundary — it has no use for the next action until the next step.
ql_rewards, ql_q = qlearning_train(GridEnv())
print(f"Q-Learning | mean reward (all eps): {ql_rewards.mean():.3f}")
print(f"Q-Learning | mean reward (last 50 eps): {ql_rewards[-50:].mean():.3f}")Q-Learning | mean reward (all eps): 0.988
Q-Learning | mean reward (last 50 eps): 0.989
Because Q-learning always bootstraps from the best possible next action, its Q-values tend to be higher — and its learned policy more aggressive — than SARSA’s. Whether that translates to faster convergence or just overconfidence becomes clear in Section 5.
Section 5: Training & Reward Curves
Both agents have now run 300 episodes on the same grid under identical hyperparameters. Putting their reward curves side by side reveals whether the algorithmic difference — one line of code — produces a measurable difference in learning speed or stability. Raw per-episode reward is noisy, so a 20-episode rolling average overlays each curve to surface the underlying trend.
If Q-learning’s rolling average climbs faster early, it reflects its optimistic target driving quicker value propagation. If SARSA’s variance stays lower in later episodes, it reflects the stabilising effect of learning from actual rather than hypothetical behaviour. Section 6 will show what each algorithm learned about individual states.
Section 6: Q-Value & Policy Visualization
The reward curves show how fast each algorithm improved — but not what they actually learned about individual states. This section inspects the final Q-tables directly: cell colour encodes the base role of each state, and the white arrow shows the greedy policy — what each algorithm would do if exploration stopped. Hover over any cell to see all four action Q-values. Where the two policies show the same arrows, both algorithms converged on the same answer. Where they differ, the on-policy vs off-policy distinction left a visible mark.
If both grids show the same arrows, SARSA and Q-learning converged to the same greedy strategy despite their different update rules. Differences in hover Q-values reveal where the algorithms disagree on how much a state is worth — even when they agree on what to do there.
When To Use Which
The two-line summary at the top of this post still holds, but the visualizations give it substance.
Use SARSA when:
- The environment is dangerous or costly to explore — SARSA learns that exploratory actions carry risk and incorporates that into its value estimates, nudging the policy toward safer paths.
- Your agent will be deployed with the same exploration policy it trained with — because SARSA’s Q-values directly reflect the ε-greedy policy’s behaviour, they are more honest about what the agent will actually experience at runtime.
- Training stability matters more than speed — SARSA’s on-policy updates tend to produce smoother reward curves with lower variance, as seen in Section 5.
Use Q-learning when:
- You want to learn the optimal policy as fast as possible and exploration risk is acceptable — Q-learning’s optimistic bootstrapping drives faster early value propagation.
- You plan to separate the data-collection policy from the final deployment policy — off-policy learning makes this natural, and it underpins more powerful algorithms like DQN that learn from replay buffers of past experience.
- The environment is stationary and deterministic — Q-learning’s aggressive targets converge cleanly when the world does not change under the agent.
The real-world divide:
In practice, the on-policy vs off-policy distinction becomes critical as environments grow more complex. SARSA’s caution makes it a better fit for safety-sensitive domains — robotics, medical dosing, financial execution — where a catastrophic exploratory action cannot be undone. Q-learning and its deep counterparts dominate in simulation-first domains — game playing, recommendation systems, logistics — where cheap exploration and large replay buffers are available.
The 2×2 grid was small enough that both algorithms converged to the same greedy policy. In larger environments with stochastic transitions or delayed rewards, the difference in update targets compounds over thousands of steps — and the choice between on-policy and off-policy becomes one of the most consequential design decisions in the system.
References
[1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. https://mitpress.mit.edu/9780262039246/reinforcement-learning/
[2] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. https://link.springer.com/article/10.1007/BF00992816
[3] Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems (Technical Report CUED/F-INFENG/TR 166). Cambridge University Engineering Department.
[4] Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292. https://doi.org/10.1007/BF00992698