Deep Q-Networks: From Tables to Neural Function Approximators

Reinforcement Learning

Deep Q-Networks

PyTorch

Build a Deep Q-Network from scratch with experience replay and target networks. Hands-on PyTorch implementation on a continuous-state grid.

Author

Ravi Sankar Krothapalli

Published

March 28, 2026

Why Deep Q-Networks?

The posts so far have been honest about a hidden assumption: the state space is small enough to fit in a table. On a 4×4 grid with 16 integer states, Q-tables are the right tool. But the real problems that made reinforcement learning famous — Atari games, robotic locomotion, navigation in continuous space — have state spaces that are either continuous or so vast that enumeration is physically impossible. A Q-table for a 210×160 pixel Atari screen (the native ALE resolution) would need a row for every distinct pixel configuration: $256^{210 \times 160}$ rows — and DQN actually operates on 84×84 grayscale preprocessed frames, meaning $256^{84 \times 84}$ entries. The table approach is not merely inefficient; it is categorically ruled out.

Deep Q-Networks (DQN), introduced by Mnih et al. at DeepMind, resolved this by replacing the Q-table with a neural network — a universal function approximator that generalizes across states. The core Q-Learning update rule is unchanged. The decisive contribution came in two stages: the 2013 NIPS workshop paper (arXiv:1312.5602) introduced experience replay; the 2015 Nature paper added the target network, together achieving reliable convergence from raw high-dimensional pixel observations at scale for the first time.

This post builds both stabilizers from scratch using NumPy only — the same computations that automatic differentiation frameworks perform under the hood, here made fully transparent. Implementing backpropagation explicitly is the most reliable way to understand what loss.backward() actually does.

Post road map: The theory sections establish why tables break (The Scaling Wall), then develop each DQN component in isolation — neural approximation, experience replay, target network. The implementation sections wire these components into a working agent on a continuous-state variant of the same grid world, finishing with a reward/loss diagnostic and an interactive Q-value landscape heatmap.

The Scaling Wall: When Q-Tables Break

Tabular Q-Learning stores exactly one value per $(s, a)$ pair. This works when:

The state space $\mathcal{S}$ is finite and small enough to fully enumerate.
Every meaningful $(s, a)$ pair receives enough visits for its Q-value to converge.

Both conditions fail together as state space grows. Consider the same 4×4 grid, but with the agent’s position reported as a 2D float vector $\mathbf{o} = \bigl[r/(N-1),\ c/(N-1)\bigr] \in [0,1]^2$ instead of an integer index. A hash-map approach would need a distinct entry for the exact float pair at each visit; states seen once get one update and are never revisited. In continuous control problems — velocity, joint angle, gravity — the table never reaches useful density regardless of episode count.

The requirement shifts from memorization to generalization: a reliable value estimate for state $(0.34, 0.67)$ should inform the (unvisited) neighbouring state $(0.35, 0.67)$ . Neural networks provide this interpolation by design.

Neural Function Approximation

Replace the Q-table with a parameterized function $Q(s, a;\, \theta)$ , where $\theta$ are the learnable weights of a neural network. The network takes a state observation as input and returns a Q-value for every action in one forward pass.

The training target follows directly from the Bellman optimality equation. For a transition $(s, a, r, s')$ , the TD error is:

$\delta = r + \gamma \max_{a'} Q(s', a';\, \theta^{-}) - Q(s, a;\, \theta)$

The loss minimizes the Huber-smoothed squared TD error over a random minibatch $\mathcal{B}$ :

$\mathcal{L}(\theta) = \frac{1}{|\mathcal{B}|} \sum_{(s,a,r,s') \in \mathcal{B}} \mathcal{L}_\delta\!\left(\,r + \gamma \max_{a'} Q(s', a';\,\theta^{-}) - Q(s,a;\,\theta)\right)$

where $\mathcal{L}_\delta(x) = \begin{cases} \tfrac{1}{2}x^2 & |x| \le 1 \\ |x| - \tfrac{1}{2} & \text{otherwise}\end{cases}$ is the Huber loss, and $\theta^{-}$ are the target network weights — a slowly drifting copy of $\theta$ used only for computing targets.

Variable decomposition:

Symbol	Meaning
$\theta$	Policy network weights — updated every optimization step
$\theta^{-}$	Target network weights — soft-updated toward $\theta$
$Q(s, a;\,\theta)$	Neural Q-value: predicted return from state $s$ , action $a$
$\gamma \in [0,1]$	Discount factor: relative value of future rewards
$\tau \in (0,1]$	Soft update rate: $\theta^{-} \leftarrow \tau\theta + (1-\tau)\theta^{-}$
$\mathcal{B}$	Minibatch drawn uniformly from the replay buffer

Experience Replay

Naive online Q-Learning with a neural network fails because consecutive transitions $(s_t, a_t, r_t, s_{t+1}),\ (s_{t+1},\ldots)$ are highly correlated — each gradient update steers the weights toward the current episode’s trajectory rather than the general value function.

Experience replay breaks this correlation:

Store every transition $(s, a, r, s', \text{done})$ in a fixed-capacity ring buffer.
At each optimization step, draw a uniformly random minibatch from the buffer.

The resulting gradient is estimated over a decorrelated mix of past experience. The buffer also converts each transition into a reusable training sample — the network sees each experience multiple times, substantially improving data efficiency.

The Target Network

A second failure mode: if both the prediction $Q(s,a;\theta)$ and the target $r + \gamma\max_{a'} Q(s',a';\theta)$ are computed with the same weights $\theta$ , every gradient step simultaneously moves both. The agent chases a moving reference — a recipe for oscillation and divergence observed in practice.

The target network fixes this. The original 2015 Nature DQN paper used a periodic hard copy of the policy weights every $C$ steps (e.g., $C = 10{,}000$ ). A smoother modern alternative — adopted from DDPG (Lillicrap et al., 2015) — is the soft Polyak update applied at every step:

$\theta^{-} \leftarrow \tau\,\theta + (1 - \tau)\,\theta^{-}$

With $\tau = 0.005$ , the target network is a rolling exponential average of the policy network, changing slowly enough that TD targets remain stable across thousands of steps. This implementation uses the soft update; both variants accomplish the same goal of decoupling the “what I predict” network from the “what I aim for” network.

Section 1: Setup and Reproducibility

The tabular posts used NumPy throughout. DQN continues in that spirit — the network forward pass, backpropagation, and Adam weight update are all NumPy operations. Before writing any environment or network code, seeds and hyperparameters are fixed so training curves are deterministic and reproducible across machines.

import math
import random
from collections import deque

import numpy as np
import plotly.graph_objects as go

np.random.seed(42)
random.seed(42)

GRID_SIZE = 4 # same 4×4 world as the tabular posts
N_ACTIONS = 4 # up / down / left / right
OBS_DIM = 2 # continuous state: [row/(N-1), col/(N-1)]

EPISODES = 400
GAMMA = 0.95 # matches the tabular series
EPS_START = 1.0 # full random exploration at episode 0
EPS_END = 0.05 # 5 % random floor at convergence
EPS_DECAY = 100 # episode scale of exponential ε-decay
LR = 1e-3 # Adam learning rate
BATCH_SIZE = 32
BUFFER_SIZE = 5_000
TAU = 0.005 # Polyak soft-update rate
MAX_STEPS = 100
HIDDEN = 64 # hidden layer width

The setup mirrors the tabular posts in every dimension except OBS_DIM = 2: the agent now receives normalized float coordinates rather than a discrete state integer, making the Q-table strictly inapplicable. With EPS_DECAY = 100, $\varepsilon$ falls to roughly 0.1 by episode 300 and near its floor by 400, giving the agent ample exploitation time in the final quarter of training.

Section 2: A Continuous-State Grid Environment

The rules are identical to the tabular posts: start at (0,0), reach goal (3,3), reward +1 at the goal and −0.01 at every other step. The observation changes from integer state index to a 2D float32 vector $[r/3,\ c/3]$ . Both components lie in $[0, 1]$ . This single change is what requires a neural network — a Q-table cannot use arbitrary floats as row keys.

ACTIONS = {0: (-1, 0), 1: (1, 0), 2: (0, -1), 3: (0, 1)}


class ContinuousGridEnv:
    """4×4 grid world with a continuous (normalized) float observation.

    Structurally identical to the GridEnv in the tabular posts. The only
    difference: _obs() returns [row/(N-1), col/(N-1)] in [0,1]^2 instead of
    an integer state index — a representation a Q-table cannot index into.
    """

    def __init__(self, grid_size=GRID_SIZE, max_steps=MAX_STEPS):
        self.grid_size = grid_size
        self.max_steps = max_steps
        self.n = grid_size - 1 # normalisation denominator
        self.goal = (grid_size - 1, grid_size - 1)
        self.pos = (0, 0)
        self.steps = 0

    def _obs(self):
        return np.array([self.pos[0] / self.n,
                         self.pos[1] / self.n], dtype=np.float32)

    def reset(self):
        self.pos = (0, 0)
        self.steps = 0
        return self._obs()

    def step(self, action):
        dr, dc = ACTIONS[action]
        r = min(max(self.pos[0] + dr, 0), self.grid_size - 1)
        c = min(max(self.pos[1] + dc, 0), self.grid_size - 1)
        self.pos = (r, c)
        self.steps += 1
        at_goal = self.pos == self.goal
        reward = 1.0 if at_goal else -0.01
        done = at_goal or self.steps >= self.max_steps
        return self._obs(), reward, done


env = ContinuousGridEnv()
obs = env.reset()
print(f"Start observation (float32): {obs} — shape {obs.shape}")
for a in [3, 1, 3, 1, 3, 1]:
    obs, r, done = env.step(a)
    print(f" obs={obs} r={r:.2f} done={done}")
    if done:
        break

Start observation (float32): [0. 0.] — shape (2,)
 obs=[0.         0.33333334] r=-0.01 done=False
 obs=[0.33333334 0.33333334] r=-0.01 done=False
 obs=[0.33333334 0.6666667 ] r=-0.01 done=False
 obs=[0.6666667 0.6666667] r=-0.01 done=False
 obs=[0.6666667 1.       ] r=-0.01 done=False
 obs=[1. 1.] r=1.00 done=True

The test trace follows the same right-then-down corridor — the optimal 6-step path. Goal state (3,3) now maps to [1.0, 1.0] rather than integer 15. The neural network will learn to assign high Q-values to observations near [1.0, 1.0] and lower values proportional to grid distance, generalizing smoothly to states it never visited during training.

Section 3: The Q-Network with Explicit Backpropagation

The neural Q-network maps a 2D float observation to Q-values for all four actions simultaneously. The architecture is a two-hidden-layer ReLU MLP; all forward, backward, and weight-update steps are written explicitly in NumPy. Seeing the chain rule applied layer-by-layer is the clearest path to understanding what loss.backward() computes in any automatic differentiation library.

class QNetwork:
    """Two-layer ReLU Q-network with explicit backpropagation and Adam.

    Architecture: obs_dim -> HIDDEN -> HIDDEN -> n_actions (linear output)
    Optimizer: Adam with gradient-norm clipping (max norm 10).

    Calling forward() caches activations for the subsequent backward pass.
    Use predict() for inference — it skips the cache to avoid side effects.
    """

    def __init__(self, obs_dim=OBS_DIM, n_actions=N_ACTIONS,
                 hidden=HIDDEN, lr=LR, seed=0):
        self.n_actions = n_actions
        rng = np.random.default_rng(seed)

        # He (Kaiming) initialisation — correct scale for ReLU activations
        self.W1 = rng.standard_normal((hidden, obs_dim)) * np.sqrt(2.0 / obs_dim)
        self.b1 = np.zeros(hidden)
        self.W2 = rng.standard_normal((hidden, hidden)) * np.sqrt(2.0 / hidden)
        self.b2 = np.zeros(hidden)
        self.W3 = rng.standard_normal((n_actions, hidden)) * np.sqrt(2.0 / hidden)
        self.b3 = np.zeros(n_actions)

        # Adam optimiser state (one m, v pair per parameter tensor)
        self._lr = lr
        self._t = 0
        self._beta1, self._beta2, self._eps = 0.9, 0.999, 1e-8
        shapes = [self.W1.shape, self.b1.shape,
                  self.W2.shape, self.b2.shape,
                  self.W3.shape, self.b3.shape]
        self._m = [np.zeros(s) for s in shapes]
        self._v = [np.zeros(s) for s in shapes]

    # ------------------------------------------------------------------
    def forward(self, x):
        """Batch forward pass; caches pre-activations for backward."""
        self._x = x
        self._z1 = x @ self.W1.T + self.b1 # (B, hidden)
        self._h1 = np.maximum(0.0, self._z1)
        self._z2 = self._h1 @ self.W2.T + self.b2
        self._h2 = np.maximum(0.0, self._z2)
        return self._h2 @ self.W3.T + self.b3 # (B, n_actions)

    def predict(self, x):
        """Inference-only forward pass — does not overwrite the cache."""
        h1 = np.maximum(0.0, x @ self.W1.T + self.b1)
        h2 = np.maximum(0.0, h1 @ self.W2.T + self.b2)
        return h2 @ self.W3.T + self.b3

    # ------------------------------------------------------------------
    def backward_and_update(self, dq):
        """Backprop + Adam update given output gradient dq (B, n_actions)."""
        B = dq.shape[0]

        # Output layer
        dW3 = (dq.T @ self._h2) / B
        db3 = dq.mean(axis=0)

        # Hidden layer 2 — chain rule through ReLU
        dh2 = dq @ self.W3
        dz2 = dh2 * (self._z2 > 0)
        dW2 = (dz2.T @ self._h1) / B
        db2 = dz2.mean(axis=0)

        # Hidden layer 1
        dh1 = dz2 @ self.W2
        dz1 = dh1 * (self._z1 > 0)
        dW1 = (dz1.T @ self._x) / B
        db1 = dz1.mean(axis=0)

        grads = [dW1, db1, dW2, db2, dW3, db3]
        params = [self.W1, self.b1, self.W2, self.b2, self.W3, self.b3]

        # Global gradient-norm clipping (cap at 10)
        total_norm = np.sqrt(sum(np.sum(g ** 2) for g in grads))
        clip_coef = min(1.0, 10.0 / (total_norm + 1e-6))

        # Adam update with bias correction
        self._t += 1
        bc1 = 1.0 - self._beta1 ** self._t
        bc2 = 1.0 - self._beta2 ** self._t
        for i, (p, g) in enumerate(zip(params, grads)):
            g = g * clip_coef
            self._m[i] = self._beta1 * self._m[i] + (1.0 - self._beta1) * g
            self._v[i] = self._beta2 * self._v[i] + (1.0 - self._beta2) * g ** 2
            p -= self._lr * (self._m[i] / bc1) / (
                np.sqrt(self._v[i] / bc2) + self._eps)

    # ------------------------------------------------------------------
    def soft_update_from(self, source, tau):
        """Polyak update: self.weights <- tau * source + (1-tau) * self."""
        for s_arr, t_arr in [
            (source.W1, self.W1), (source.b1, self.b1),
            (source.W2, self.W2), (source.b2, self.b2),
            (source.W3, self.W3), (source.b3, self.b3),
        ]:
            t_arr[:] = tau * s_arr + (1.0 - tau) * t_arr


net = QNetwork()
dummy = np.zeros((1, OBS_DIM), dtype=np.float32)
out = net.predict(dummy)
n_params = (net.W1.size + net.b1.size + net.W2.size +
            net.b2.size + net.W3.size + net.b3.size)
print(f"Input: {list(dummy.shape)}")
print(f"Output: {list(out.shape)} (one Q-value per action)")
print(f"Total parameters: {n_params:,}")

Input: [1, 2]
Output: [1, 4] (one Q-value per action)
Total parameters: 4,612

The backward pass follows the chain rule exactly as taught: start from the output gradient dq, propagate back through the linear layer, multiply by the ReLU derivative (a binary mask of where $z > 0$ ), accumulate parameter gradients, and repeat for each layer. The Adam update then applies bias-corrected momentum — the same algorithm that torch.optim.Adam runs internally, with gradient-norm clipping added on top.

Section 4: The Replay Buffer

The replay buffer is a fixed-capacity ring that stores transitions. Its .sample() method draws a uniformly random minibatch and returns five NumPy arrays ready for the optimization step. When the buffer is at capacity, incoming transitions silently overwrite the oldest entries.

class ReplayBuffer:
    """Fixed-capacity ring buffer for (s, a, r, s', done) transitions.

    Random minibatch sampling breaks the temporal correlation that destabilises
    online gradient descent on sequential RL data.
    """

    def __init__(self, capacity=BUFFER_SIZE):
        self.buf = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buf.append((state, action, float(reward),
                         next_state, float(done)))

    def sample(self, batch_size=BATCH_SIZE):
        batch = random.sample(self.buf, batch_size)
        s, a, r, ns, d = zip(*batch)
        return (
            np.array(s, dtype=np.float32),
            np.array(a, dtype=np.int32),
            np.array(r, dtype=np.float32),
            np.array(ns, dtype=np.float32),
            np.array(d, dtype=np.float32),
        )

    def __len__(self):
        return len(self.buf)


buf = ReplayBuffer()
for _ in range(100):
    o = np.random.rand(OBS_DIM).astype(np.float32)
    o_ = np.random.rand(OBS_DIM).astype(np.float32)
    buf.push(o, random.randrange(N_ACTIONS), -0.01, o_, False)

s_b, a_b, r_b, ns_b, d_b = buf.sample(16)
print(f"Buffer size: {len(buf)}")
print(f"Batch — s:{list(s_b.shape)} a:{list(a_b.shape)} "
      f"r:{list(r_b.shape)} s':{list(ns_b.shape)} d:{list(d_b.shape)}")

Buffer size: 100
Batch — s:[16, 2] a:[16] r:[16] s':[16, 2] d:[16]

The buffer capacity of 5,000 transitions covers roughly 150 converged episodes, providing gradient estimates over a broad temporal window. A buffer that is too small causes the network to overfit the agent’s most recent trajectory; too large, and early random-walk transitions distort learning long after the policy has improved. The sweet spot trades off recency against diversity.

Section 5: The DQN Agent

The DQN agent wires together the policy network, target network, and replay buffer. select_action implements ε-greedy selection. optimize runs one minibatch update — computes the Huber loss gradient analytically, passes it backward through the policy network via explicit backprop, advances the soft Polyak target-network update, and returns the scalar loss.

class DQNAgent:
    """Minimal DQN agent implementing the four core DQN techniques.

    1. Neural Q-function — generalises across continuous state observations.
    2. Experience replay — random minibatches break temporal correlation.
    3. Target network — frozen theta_minus keeps TD targets stable.
    4. Soft Polyak update — theta_minus <- tau*theta + (1-tau)*theta_minus.
    """

    def __init__(self, seed=0):
        self.policy_net = QNetwork(seed=seed)
        self.target_net = QNetwork(seed=seed)
        # Initialise target net with identical weights
        self.target_net.soft_update_from(self.policy_net, tau=1.0)
        self.buffer = ReplayBuffer()

    def select_action(self, obs, epsilon):
        if random.random() < epsilon:
            return random.randrange(N_ACTIONS)
        q_vals = self.policy_net.predict(obs[np.newaxis, :]).squeeze(0)
        return int(q_vals.argmax())

    def optimize(self):
        if len(self.buffer) < BATCH_SIZE:
            return None

        states, actions, rewards, next_states, dones = self.buffer.sample()

        # Forward: Q(s, a; theta) — full output matrix for backprop
        q_all = self.policy_net.forward(states)  # (B, n_actions)
        q_pred = q_all[np.arange(BATCH_SIZE), actions]  # (B,)

        # Target: r + gamma * max_a' Q(s', a'; theta_minus) — no gradient
        q_next = self.target_net.predict(next_states).max(axis=1)
        q_target = rewards + GAMMA * q_next * (1.0 - dones)

        # Huber loss gradient: clip(q_pred - target, -1, 1)
        delta = q_pred - q_target  # (B,)
        dq_pred = np.clip(delta, -1.0, 1.0) / BATCH_SIZE  # (B,)

        # Sparse gradient: only the taken action receives a gradient signal
        dq_all = np.zeros_like(q_all)
        dq_all[np.arange(BATCH_SIZE), actions] = dq_pred

        self.policy_net.backward_and_update(dq_all)

        # Soft Polyak update of the target network
        self.target_net.soft_update_from(self.policy_net, TAU)

        # Scalar Huber loss for diagnostics
        loss = float(np.mean(np.where(
            np.abs(delta) <= 1.0,
            0.5 * delta ** 2,
            np.abs(delta) - 0.5,
        )))
        return loss

Three implementation details deserve attention: q_all[np.arange(B), actions] selects the Q-value for the action actually taken — not the maximum — because the Bellman equation applies only to the chosen action. The gradient is placed back into dq_all at the same index: actions not selected in this batch receive zero gradient, which is correct since their Q-values were not part of the loss computation. The target network’s predict() call (not forward()) is important — it leaves the policy network’s cached activations untouched, so the subsequent backward pass operates on the correct cache.

Section 6: Training

All components are assembled. The training loop runs EPISODES episodes, decaying $\varepsilon$ exponentially from 1.0 toward 0.05. Each step stores a transition in the buffer and calls optimize. Per-episode losses are averaged for the convergence diagnostic.

def run_dqn(seed=42):
    """Train DQN for EPISODES; return (episode_rewards, episode_losses, agent)."""
    np.random.seed(seed)
    random.seed(seed)

    env = ContinuousGridEnv()
    agent = DQNAgent(seed=seed)

    episode_rewards = []
    episode_losses = []

    for ep in range(EPISODES):
        eps = EPS_END + (EPS_START - EPS_END) * math.exp(-ep / EPS_DECAY)
        obs = env.reset()
        total = 0.0
        ep_losses = []

        while True:
            action = agent.select_action(obs, eps)
            obs_next, r, done = env.step(action)
            agent.buffer.push(obs, action, r, obs_next, done)

            loss = agent.optimize()
            if loss is not None:
                ep_losses.append(loss)

            obs = obs_next
            total += r
            if done:
                break

        episode_rewards.append(total)
        episode_losses.append(
            float(np.mean(ep_losses)) if ep_losses else 0.0
        )

    return np.array(episode_rewards), np.array(episode_losses), agent


rewards, losses, agent = run_dqn()

print(f"DQN | mean reward (all eps): {rewards.mean():.3f}")
print(f"DQN | mean reward (last 50 eps): {rewards[-50:].mean():.3f}")
print(f"DQN | mean loss (last 50 eps): {losses[-50:].mean():.4f}")

DQN | mean reward (all eps): 0.905
DQN | mean reward (last 50 eps): 0.948
DQN | mean loss (last 50 eps): 0.0005

The gap between the all-episode mean and the last-50 mean measures how much the agent improved over training. The loss figure should decrease from a high initial value (large TD errors from an uninitialised network) toward a stable low floor (residual Bellman approximation error). A loss that plateaus early without reward improvement suggests the learning rate needs tuning; a loss that diverges late indicates the buffer is being depleted of diverse experience.

Section 7: Reward and Loss Curves

The top panel traces the learning trajectory: early episodes accumulate near-zero or negative reward as the agent explores randomly. As the replay buffer populates and

\varepsilon

decays, the rolling average climbs toward positive territory, reflecting reliable goal-reaching. The bottom panel shows the Huber loss: high early due to the uninitialised network’s large prediction errors, declining as the Q-estimates converge toward their Bellman targets. A residual loss plateau is expected — perfect Bellman consistency would require an infinite-capacity network trained on infinite data.

Section 8: Q-Value Landscape

The tabular posts compared greedy arrows across integer states. With a neural network, every location in the unit square $[0,1]^2$ has a well-defined Q-value. For each grid cell $(r, c)$ , the observation $[r/3,\ c/3]$ is fed to the trained policy network; the maximum over actions gives the state value under the greedy policy. Cells closer to the goal should carry higher values — the network has learned that [1.0, 1.0] is the most rewarding state.

Cells on the direct path to the goal (bottom-right corner) should carry the highest Q-values. The arrows trace the greedy policy: cells on the optimal south-east corridor should point south or east. The start cell (top-left, S) carries the lowest value — the agent stands farthest from the goal and must traverse the entire grid. This continuous value landscape is the payoff of neural function approximation: the network interpolates Q-values at every floating-point position in

[0,1]^2

, not just the 16 discrete cells it was trained on.

When to Use Deep Q-Networks

Use DQN when:

The state space is continuous, high-dimensional, or too large to enumerate — the fundamental requirement that rules out tabular methods.
You need generalization: nearby unvisited states should receive reasonable Q-value estimates derived from prior experience in neighbouring states.
The action space is discrete and finite — DQN outputs one unit per action; the optimal action is the argmax. Spaces up to a few hundred actions are tractable; continuous actions require a different architecture.
Your task fits within episodic boundaries and has a reward signal dense enough for standard TD updates.

Consider alternatives when:

The action space is continuous (joint torques, steering angle, thrust): use actor-critic methods — DDPG, TD3, or SAC — which learn a separate policy network and do not rely on a discrete argmax.
Sample efficiency is critical: model-based methods (Dreamer, MuZero) or off-policy actor-critics achieve the same performance with far fewer environment interactions.
The state is partially observable: vanilla DQN assumes Markovian observations; recurrent DQN (DRQN) or explicit state-estimation layers are needed when the current observation does not capture sufficient history.

References

[1] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with deep reinforcement learning. NIPS Deep Learning Workshop 2013. https://arxiv.org/abs/1312.5602

[2] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236

[3] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations (ICLR 2016). https://arxiv.org/abs/1509.02971

[4] Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR 2015). https://arxiv.org/abs/1412.6980

[5] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In IEEE International Conference on Computer Vision (ICCV 2015). https://arxiv.org/abs/1502.01852

[6] Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. https://mitpress.mit.edu/9780262039246/reinforcement-learning/ - You observe training instability despite correct hyperparameters: try Double DQN (select action with policy network, evaluate with target network) to eliminate Q-value overestimation from the max operator, or Prioritized Experience Replay to focus gradient updates on the highest-error transitions.