Expected SARSA: Bridging On-Policy and Off-Policy TD Control
Why Expected SARSA?
In a previous post, we placed SARSA and Q-Learning side by side and observed the clearest expression of the on-policy vs off-policy divide. That comparison surfaced a deeper tension: SARSA is honest about the cost of exploration, but it carries sampling noise into every update; Q-Learning bootstraps from the imagined best action, gaining speed at the cost of accuracy about what the agent will actually do. Both compromises are measurable.