Helpful context:


Most of the machine learning you encounter in a first course is supervised: you have data, you have labels, you minimize a loss. The answers are given to you - your job is to generalize them.

Reinforcement learning is different. There are no labels. An agent makes decisions, the world responds, and the agent receives a reward signal - a scalar that says how well it did. The goal is to learn a strategy that maximizes long-run reward, entirely from this feedback loop. This is how you train a robot to walk, a game-playing AI to beat humans at Go, or a language model to follow instructions.

The mathematical framework that makes all of this precise is the Markov Decision Process.


Markov Decision Processes

A Markov Decision Process (MDP) is a tuple $(S, A, P, R, \gamma)$:

  • $S$ - the state space. The set of all situations the agent can find itself in. Could be finite (the squares of a chessboard), countably infinite (integers), or continuous ($\mathbb{R}^d$).
  • $A$ - the action space. The set of decisions available to the agent. Could also be discrete (left/right/up/down) or continuous (a torque applied to a joint).
  • $P: S \times A \times S \to [0,1]$ - the transition function. $P(s' \mid s, a)$ is the probability of moving to state $s'$ when the agent takes action $a$ in state $s$. These are conditional probabilities: for any fixed $(s, a)$, $\sum_{s'} P(s' \mid s, a) = 1$.
  • $R: S \times A \to \mathbb{R}$ - the reward function. $R(s, a)$ is the immediate reward received when the agent takes action $a$ in state $s$. Sometimes written $R(s, a, s')$ if the reward also depends on where you land.
  • $\gamma \in [0, 1)$ - the discount factor. Controls how much future rewards are worth relative to immediate ones.

The Markov property: the next state depends only on the current state and action, not on the history. $P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)$. This is what makes the math tractable - the future is independent of the past given the present.

An MDP is a generalization of a Markov chain: if there are no actions, just states and transitions, you recover a Markov chain. Adding actions and rewards turns it into a decision problem.


Policies

A policy is the agent’s decision rule - a mapping from states to actions (or distributions over actions).

Deterministic policy: $\pi: S \to A$. In state $s$, always take action $\pi(s)$. No randomness in the decision itself.

Stochastic policy: $\pi: S \times A \to [0,1]$. $\pi(a \mid s)$ is the probability of taking action $a$ in state $s$. For each $s$, $\sum_a \pi(a \mid s) = 1$. Stochastic policies are useful during learning (they encourage exploration) and in game-theoretic settings where determinism can be exploited.

A policy, combined with the MDP transition function, turns the MDP into a Markov chain: fix $\pi$, and the dynamics $P^\pi(s' \mid s) = \sum_a \pi(a \mid s) P(s' \mid s, a)$ describe a chain over states alone. This is why the Markov chain theory carries over directly.

History-based policies. More generally, a policy could map the entire history of interactions to an action: $\pi: (S \times A)^* \times S \to \Delta(A)$. In the infinite-horizon discounted MDP setting, this generality is unnecessary. Because of the Markov property - the state $s_t$ already summarizes everything about the past that matters for future rewards - there always exists a stationary Markov policy (one that maps only the current state to an action) that is optimal among all policies, including history-based ones. This is a non-trivial theorem, but it justifies searching only over $\pi: S \to A$. In settings where the state is partially observed (POMDPs), or in finite-horizon games where how many steps remain matters, history-based policies can be strictly better.


Trajectories and Probability Over Trajectories

A trajectory (or rollout) is a sequence of states and actions generated by running a policy in an MDP:

$$\tau = (s_0, a_0, s_1, a_1, s_2, a_2, \ldots)$$

Starting from an initial state $s_0 \sim \rho_0$ (the initial state distribution), the trajectory is generated by alternating: sample $a_t \sim \pi(\cdot \mid s_t)$, then sample $s_{t+1} \sim P(\cdot \mid s_t, a_t)$.

The probability of a finite trajectory $(s_0, a_0, s_1, \ldots, s_T)$ under policy $\pi$ is:

$$P(\tau) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi(a_t \mid s_t) P(s_{t+1} \mid s_t, a_t).$$

This is just the chain rule of probability applied to the sequential structure. Each factor is either the probability of choosing an action given the current state (the policy), or the probability of the environment transitioning to the next state (the dynamics). The trajectory probability is their product.

This distribution over trajectories is the central object of RL theory. The agent wants to choose $\pi$ to make high-reward trajectories likely.


Returns and Discounting

The return of a trajectory is the total accumulated reward:

$$G_t = R(s_t, a_t) + \gamma R(s_{t+1}, a_{t+1}) + \gamma^2 R(s_{t+2}, a_{t+2}) + \cdots = \sum_{k=0}^\infty \gamma^k R(s_{t+k}, a_{t+k}).$$

The discount factor $\gamma \in [0,1)$ downweights future rewards. A reward received $k$ steps in the future is worth only $\gamma^k$ times a reward received now.

Why discount? Several reasons:

  1. Mathematical convenience. With $\gamma < 1$ and bounded rewards, the infinite sum converges: $|G_t| \leq R_{\max}/(1-\gamma)$.
  2. Economic interpretation. A dollar today is worth more than a dollar tomorrow. $\gamma = 1/(1+r)$ for interest rate $r$.
  3. Uncertainty. In a continuing task, there’s always a chance the episode ends. Discounting models the probability $1 - \gamma$ that each step is the last.

Effective horizon. The discount factor implies an effective planning horizon. Since $\gamma^k \approx 0$ for $k \gg 1/(1-\gamma)$, the agent effectively plans over roughly $H_{\text{eff}} = \frac{1}{1-\gamma}$ steps into the future. With $\gamma = 0.99$, the effective horizon is about 100 steps. With $\gamma = 0.9$, it’s 10 steps. Setting $\gamma$ is a design choice: higher $\gamma$ means the agent cares about longer-term consequences; lower $\gamma$ means it’s more myopic.


Finite Horizon MDPs and Time-Dependent Policies

An infinite-horizon MDP with discounting is one formulation. The other major setting is a finite-horizon MDP: the agent acts for exactly $T$ steps, and the return is the undiscounted sum:

$$G_0 = \sum_{t=0}^{T-1} R(s_t, a_t).$$

No $\gamma$ needed - the sum is finite, so it converges automatically.

In finite-horizon MDPs, the optimal policy is time-dependent: $\pi_t: S \to A$, where $t$ is the current timestep. This is because the value of being in a state changes as fewer steps remain - near the end, you should exploit; earlier, you might explore or build up position. A single stationary policy cannot capture this.

The optimal value functions carry a time index: $V_t^*(s)$ is the optimal expected return with $T - t$ steps remaining. They satisfy the finite-horizon Bellman equations, solved by backward induction:

$$V_T^(s) = 0 \quad \text{(no reward after the last step)}$$ $$V_t^(s) = \max_a \left[R(s,a) + \sum_{s'} P(s' \mid s, a) V_{t+1}^*(s')\right], \quad t = T-1, \ldots, 0.$$

This is pure dynamic programming: solve from the last step backwards to the first. The optimal policy at each step is $\pi_t^(s) = \arg\max_a [R(s,a) + \sum_{s'} P(s'\mid s,a) V_{t+1}^(s')]$. Total complexity: $O(T \cdot |S|^2 \cdot |A|)$.

The connection between discounting and finite horizons: an infinite-horizon MDP with discount $\gamma$ behaves similarly to a finite-horizon MDP with $T \approx 1/(1-\gamma)$ steps. The discount factor imposes an effective horizon even when the problem is formally infinite.


State Distribution

As a policy $\pi$ runs in an MDP starting from $s_0 \sim \rho_0$, it induces a distribution over states at each timestep:

$$d_t^\pi(s) = P(s_t = s \mid \pi, \rho_0).$$

The discounted state occupancy measure (or state visitation distribution) averages these over time:

$$d^\pi(s) = (1-\gamma)\sum_{t=0}^\infty \gamma^t d_t^\pi(s).$$

This is a proper probability distribution over $S$ (it sums/integrates to 1). It tells you how often the agent visits each state under policy $\pi$, with earlier visits weighted more heavily.

The state occupancy measure matters because the policy’s performance can be written as an expectation under it:

$$J(\pi) = \mathbb{E}\pi[G_0] = \frac{1}{1-\gamma} \mathbb{E}{s \sim d^\pi, a \sim \pi(\cdot|s)}[R(s,a)].$$

To improve the policy, you want to shift $d^\pi$ toward high-reward states.


Value Functions

The state-value function under policy $\pi$ is the expected return starting from state $s$:

$$V^\pi(s) = \mathbb{E}\pi\left[\sum{k=0}^\infty \gamma^k R(s_{t+k}, a_{t+k}) \Big| s_t = s\right].$$

The action-value function (Q-function) is the expected return starting from state $s$, taking action $a$ first, then following $\pi$:

$$Q^\pi(s, a) = \mathbb{E}\pi\left[\sum{k=0}^\infty \gamma^k R(s_{t+k}, a_{t+k}) \Big| s_t = s, a_t = a\right].$$

The relationship: $V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s, a)$.

The optimal value functions are obtained by maximizing over all policies:

$$V^(s) = \max_\pi V^\pi(s), \qquad Q^(s,a) = \max_\pi Q^\pi(s,a).$$

A policy $\pi^$ is optimal if $V^{\pi^}(s) = V^(s)$ for all $s$. Crucially, for finite MDPs, an optimal policy always exists and is always deterministic: $\pi^(s) = \arg\max_a Q^*(s, a)$.


Bellman Equations

The value functions satisfy recursive equations - the Bellman equations - that express the value at a state in terms of the values at successor states.

Bellman expectation equation (for a fixed policy $\pi$):

$$V^\pi(s) = \sum_a \pi(a \mid s)\left[R(s,a) + \gamma \sum_{s'} P(s' \mid s, a) V^\pi(s')\right].$$

Read this as: the value of $s$ under $\pi$ is the expected immediate reward plus the discounted expected value of the next state, where the expectation is over both the policy’s action choice and the environment’s transition.

Bellman optimality equation (for the optimal value function):

$$V^(s) = \max_a \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s, a) V^(s')\right].$$

This is a fixed-point equation: $V^*$ is the unique function satisfying it (by the contraction mapping theorem). Solving it gives the optimal value function, from which the optimal policy follows immediately.


Planning: Solving MDPs

Planning means computing the optimal policy given full knowledge of the MDP $(S, A, P, R, \gamma)$. This is distinct from learning, where the agent doesn’t know $P$ or $R$ and must estimate them from interaction.

Value Iteration. Repeatedly apply the Bellman optimality operator:

$$V_{k+1}(s) \leftarrow \max_a \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s, a) V_k(s')\right].$$

Start with $V_0 \equiv 0$. The Bellman operator is a $\gamma$-contraction in $L^\infty$, so $V_k \to V^$ as $k \to \infty$. After convergence, extract $\pi^(s) = \arg\max_a [R(s,a) + \gamma \sum_{s'} P(s'\mid s,a)V^*(s')]$.

Why $\mathcal{T}$ is a contraction. The Bellman optimality operator $(\mathcal{T}V)(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'\mid s,a) V(s')]$ satisfies: for any $V, V'$,

$$|\mathcal{T}V - \mathcal{T}V'|\infty \leq \gamma |V - V'|\infty.$$

Proof: $\lvert(\mathcal{T}V)(s) - (\mathcal{T}V')(s)\rvert \leq \gamma\max_{s'}\lvert V(s') - V'(s')\rvert = \gamma|V-V'|\infty$, using $\lvert\max f - \max g\rvert \leq \max\lvert f - g\rvert$. By the Banach fixed-point theorem, $\mathcal{T}$ has a unique fixed point ($V^$) and $|V_k - V^|\infty \leq \gamma^k |V_0 - V^*|_\infty$. To reach $\epsilon$-accuracy with $V_0 = 0$:

$$k \geq \frac{1}{1-\gamma}\log\left(\frac{R_{\max}}{(1-\gamma)\epsilon}\right) \text{ iterations.}$$

Complexity per iteration: $O(|S|^2 |A|)$ for a tabular MDP with finite state and action spaces. Total: $O\left(\frac{|S|^2|A|}{1-\gamma}\log\frac{R_{\max}}{(1-\gamma)\epsilon}\right)$.

Policy Iteration. Alternate between two steps:

  1. Policy evaluation: given $\pi_k$, solve for $V^{\pi_k}$ exactly (by solving the linear system defined by the Bellman expectation equation).
  2. Policy improvement: set $\pi_{k+1}(s) = \arg\max_a [R(s,a) + \gamma \sum_{s'} P(s'\mid s,a)V^{\pi_k}(s')]$.

Policy iteration converges in finitely many steps (since there are finitely many deterministic policies for a finite MDP), and each iteration is guaranteed not to make the policy worse. It typically converges in far fewer iterations than value iteration, though each iteration costs more.

Both methods are exact - they find the true optimum - but only feasible for small MDPs where $|S|$ and $|A|$ are manageable. Real-world MDPs (robot control, game playing, language modeling) have state spaces that are continuous or astronomically large. This is where function approximation (neural networks as value or policy approximators) enters, and exact planning gives way to approximate methods like Q-learning, actor-critic algorithms, and policy gradient methods.


Multi-Armed Bandits: MDPs Without State

A multi-armed bandit is the simplest sequential decision problem: a single state, $K$ actions (“arms”), and a stochastic reward $r \sim \mathcal{D}_a$ for each action $a$. You pull an arm, observe the reward, pull again. No state, no transitions, no long-term consequences - just immediate rewards.

It is an MDP with $|S| = 1$ and $\gamma = 0$. The optimal action is simply $a^* = \arg\max_a \mathbb{E}[r_a]$, but the distributions $\mathcal{D}_a$ are unknown.

The bandit problem isolates the exploration-exploitation tradeoff: you must balance exploring arms you know little about (to learn) and exploiting the arm that currently looks best (to earn). In a full MDP, this is compounded with the credit assignment problem - which action in a long trajectory caused a later reward?

Key bandit algorithms:

  • $\epsilon$-greedy: With probability $\epsilon$ pick a random arm; otherwise pick the current best estimate. Simple but not optimal.
  • UCB (Upper Confidence Bound): Pick the arm that maximizes $\hat{\mu}_a + c\sqrt{\log t / N_a}$, where $\hat{\mu}_a$ is the empirical mean, $N_a$ is the number of pulls, and $t$ is the total number of steps. The bonus $c\sqrt{\log t / N_a}$ is an optimistic upper bound on the true mean. UCB is optimistic in the face of uncertainty.
  • Thompson Sampling: Maintain a posterior over each arm’s mean, sample a mean for each arm, and pull the arm with the highest sample. Bayesian and empirically competitive with UCB.

UCB and Thompson Sampling achieve logarithmic regret: the gap between the reward of the best arm and what you actually collect grows as $O(\log T)$, which is optimal (no algorithm can do better in general).


Policy Gradient Methods

All planning methods above require full knowledge of $P$ and $R$. Policy gradient methods work in the model-free setting: directly optimize $J(\pi_\theta) = \mathbb{E}{\tau \sim \pi\theta}[G_0]$ over the parameters $\theta$ of a parameterized policy $\pi_\theta$, using only sampled trajectories.

The challenge: $J(\pi_\theta)$ is an expectation over trajectories whose distribution depends on $\theta$. How do we compute $\nabla_\theta J$?

The Policy Gradient Theorem

$$\nabla_\theta J(\pi_\theta) = \mathbb{E}{s \sim d^{\pi\theta}, a \sim \pi_\theta(\cdot|s)}\left[Q^{\pi_\theta}(s,a) \nabla_\theta \log \pi_\theta(a \mid s)\right].$$

The key trick (the log-derivative trick): $\nabla_\theta \pi_\theta(a|s) = \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)$. This converts the gradient of the probability into an expectation. The proof uses the Bellman equation for $Q^{\pi_\theta}$ to show that the gradient of the state occupancy measure $d^{\pi_\theta}$ with respect to $\theta$ does not appear in the final expression - the terms cancel, leaving only the per-step policy gradient weighted by $Q$.

The result is remarkable: to compute the gradient of the total return, you only need to know the per-step gradient of the log-policy and the Q-values - you don’t need to differentiate through the environment’s transition function $P$.

REINFORCE

REINFORCE (Williams, 1992) uses Monte Carlo to estimate the policy gradient. Roll out a full trajectory, compute the return $G_t$ from each timestep, and use it as a proxy for $Q^{\pi_\theta}(s_t, a_t)$:

$$\nabla_\theta J(\pi_\theta) \approx \frac{1}{N}\sum_{n=1}^N \sum_{t=0}^T G_t^{(n)} \nabla_\theta \log \pi_\theta(a_t^{(n)} \mid s_t^{(n)}).$$

Algorithm:

  1. Initialize $\theta$ randomly.
  2. Roll out trajectory $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_T)$ under $\pi_\theta$.
  3. Compute returns: $G_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k}$.
  4. Update: $\theta \leftarrow \theta + \alpha \sum_t G_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)$.
  5. Repeat.

Why it works. The update direction is an unbiased estimate of $\nabla_\theta J$ by the policy gradient theorem. Under the Robbins-Monro conditions on the step size $\alpha_k$ ($\sum_k \alpha_k = \infty$, $\sum_k \alpha_k^2 < \infty$), stochastic gradient ascent converges to a local optimum.

High Variance of REINFORCE

The estimate $G_t \nabla_\theta \log \pi_\theta(a_t|s_t)$ uses the full return $G_t$ to credit action $a_t$. But $G_t$ includes all rewards from step $t$ onward - including noise that has nothing to do with whether $a_t$ was a good choice. If a good action early in an episode is followed by random bad rewards, REINFORCE penalizes the good action. This is high variance.

Baselines. Subtract a state-dependent baseline $b(s_t)$ from the return:

$$\nabla_\theta J \approx \sum_t (G_t - b(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t).$$

This is still unbiased: since $\mathbb{E}{a \sim \pi\theta}[b(s)\nabla_\theta \log \pi_\theta(a \mid s)] = b(s)\nabla_\theta \sum_a \pi_\theta(a \mid s) = 0$, the baseline cancels in expectation. But it reduces variance by centering the signal. The optimal baseline is $b(s_t) = V^{\pi_\theta}(s_t)$, and $G_t - V^{\pi_\theta}(s_t)$ is an estimate of the advantage function $A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s)$.

Importance Sampling and Off-Policy Learning

After updating $\theta$, trajectories collected under $\pi_{\mathrm{old}}$ are no longer samples from $\pi_\theta$. Collecting fresh trajectories for every gradient step is expensive. Importance sampling corrects for this:

$$\mathbb{E}{\tau \sim \pi\theta}[f(\tau)] = \mathbb{E}{\tau \sim \pi{\mathrm{old}}}\left[\frac{P(\tau \mid \pi_\theta)}{P(\tau \mid \pi_{\mathrm{old}})} f(\tau)\right].$$

The importance weight for a trajectory is the ratio of trajectory probabilities. Since the transition probabilities $P(s_{t+1} \mid s_t, a_t)$ cancel, it reduces to the product of policy ratios:

$$w(\tau) = \prod_{t=0}^T \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\mathrm{old}}(a_t \mid s_t)}.$$

This allows reusing old data. The problem: when $\pi_\theta$ and $\pi_{\mathrm{old}}$ differ significantly, the product of ratios can be exponentially large or small, causing extreme variance. This motivates keeping the policies close.

Trust Regions and TRPO

Trust Region Policy Optimization (TRPO) constrains the KL divergence between old and new policy to keep importance weights well-behaved:

$$\max_\theta; \mathbb{E}{s \sim d^{\pi{\mathrm{old}}}, a \sim \pi_{\mathrm{old}}}\left[\frac{\pi_\theta(a \mid s)}{\pi_{\mathrm{old}}(a \mid s)} A^{\pi_{\mathrm{old}}}(s,a)\right]$$

$$\text{subject to}\quad \mathbb{E}{s \sim d^{\pi{\mathrm{old}}}}\left[D_{\mathrm{KL}}\left(\pi_{\mathrm{old}}(\cdot \mid s) ;|; \pi_\theta(\cdot \mid s)\right)\right] \leq \delta.$$

The objective is the importance-weighted advantage - how much better the new policy does on the states and actions the old policy visited. The KL constraint ensures the policies don’t diverge.

TRPO comes with a monotonic improvement guarantee: if the KL constraint is satisfied, the true objective $J(\pi_\theta)$ is guaranteed not to decrease. This makes it far more stable than vanilla REINFORCE. The constraint is enforced via conjugate gradient and a line search. PPO (Proximal Policy Optimization) achieves similar stability by clipping the ratio $\pi_\theta / \pi_{\mathrm{old}}$ directly, avoiding the expensive second-order optimization.


Which Algorithm Should You Use?

The theory above describes the landscape. In practice, you need to pick an algorithm. The choice follows from a handful of questions about your problem.

Question 1: Do you know (or can you learn) the environment dynamics?

If you have access to a simulator or can learn a reliable model of $P(s' \mid s, a)$, you can plan inside the model rather than purely from interactions. Model-based RL is dramatically more sample-efficient - you can generate synthetic rollouts cheaply. Algorithms: Dreamer (learns a world model in latent space), MBPO (model-based policy optimization with short model rollouts), PETS (probabilistic ensembles with trajectory sampling).

If the dynamics are unknown and hard to model (real-world robotics, markets, biological systems), you’re in the model-free setting. Most of what follows applies here.

Question 2: Is your action space discrete or continuous?

Discrete actions (games, navigation on a grid, choosing from a menu): the Q-function $Q(s, a)$ can be computed for all actions simultaneously - you don’t need a policy network.

  • DQN (Deep Q-Network): off-policy, value-based. Stores experience in a replay buffer and trains a neural net to approximate $Q^*$. Simple and effective. Variants: Double DQN (fixes overestimation bias), Dueling DQN (separate value/advantage streams), Rainbow (combines all improvements).

Continuous actions (robot joints, throttle, a vector of forces): you can’t enumerate all actions, so you need a policy that outputs the action directly. This is where actor-critic methods dominate.

Question 3: On-policy or off-policy?

On-policy algorithms train on data collected by the current policy. They are stable and theoretically well-understood, but sample-inefficient - you discard data after each update.

  • PPO (Proximal Policy Optimization): the workhorse of modern RL. Clips the importance ratio to prevent large updates. Works for both discrete and continuous actions. First choice for most problems where sample efficiency is not critical.
  • TRPO: like PPO but with a hard KL constraint instead of clipping. Stronger theoretical guarantees, higher computational cost. Rarely worth it over PPO in practice.
  • A2C / A3C: on-policy actor-critic; A3C runs parallel workers asynchronously. Useful when you have many CPU cores and a fast simulator.

Off-policy algorithms reuse past experience from a replay buffer. More sample-efficient but harder to stabilize.

  • SAC (Soft Actor-Critic): entropy-regularized, stochastic policy. Adds an entropy bonus $\mathcal{H}(\pi(\cdot \mid s))$ to the objective, encouraging exploration automatically. The go-to for continuous action spaces. Stable, sample-efficient, robust to hyperparameters.
  • TD3 (Twin Delayed DDPG): deterministic policy, two Q-networks to reduce overestimation, delayed policy updates. Slightly better peak performance than SAC in some benchmarks; less robust. Use when you want a deterministic policy.
  • DDPG: the predecessor to TD3. Notoriously unstable; prefer TD3 or SAC.

Question 4: Do you have a sparse reward or a hard exploration problem?

Standard policy gradient and Q-learning fail badly when rewards are sparse - the agent never stumbles onto the good signal by chance.

  • HER (Hindsight Experience Replay): relabels failed trajectories with the goals actually achieved, manufacturing a dense reward signal from sparse feedback. Works on top of any off-policy algorithm (usually SAC or TD3).
  • Curiosity-driven exploration: adds an intrinsic reward for visiting novel states (RND, ICM). Useful for hard exploration problems like Montezuma’s Revenge.
  • Go-Explore: stores a frontier of “promising states” and explicitly plans to return to and explore from them.

Question 5: Do you need massive scale or distributed training?

  • IMPALA: off-policy actor-critic designed for distributed training. Separates acting (many workers) from learning (one learner), using V-trace to correct for the off-policy gap. Scales to thousands of cores.
  • Ape-X: distributed DQN/DDPG with prioritized experience replay. Each actor generates experience with a slightly stale policy; a central learner trains on the prioritized buffer.

The Quick-Reference Table

Setting Recommended algorithm
Discrete actions, limited compute DQN / Double DQN
Discrete or continuous, want simplicity and stability PPO
Continuous actions, sample efficiency matters SAC
Continuous actions, deterministic policy needed TD3
Sparse rewards, goal-conditioned SAC + HER
Known or learnable dynamics MBPO / Dreamer
Distributed / massive scale IMPALA / Ape-X
Strong KL guarantees needed TRPO

A useful reference: RL Picker walks through these questions interactively and covers ~80 algorithms across the full taxonomy.


Summary

Concept Definition
MDP $(S, A, P, R, \gamma)$: states, actions, transitions, rewards, discount
History-based policy Maps full history to action; Markov policies suffice for infinite-horizon MDPs
Finite horizon MDP $T$ steps, no discount; optimal policy is time-dependent $\pi_t: S \to A$
Policy $\pi: S \to A$ (deterministic) or $\pi(a \mid s)$ (stochastic)
Trajectory $(s_0,a_0,s_1,a_1,\ldots)$; probability is product of policy and transition factors
Return $G_t = \sum_{k=0}^\infty \gamma^k R(s_{t+k}, a_{t+k})$
Effective horizon $\approx 1/(1-\gamma)$; how far ahead the agent effectively plans
State occupancy $d^\pi(s) = (1-\gamma)\sum_t \gamma^t d_t^\pi(s)$; discounted state visitation
$V^\pi(s)$ Expected return from $s$ following $\pi$
$Q^\pi(s,a)$ Expected return from $(s,a)$, then following $\pi$
Advantage $A^\pi$ $Q^\pi(s,a) - V^\pi(s)$; how much better $a$ is than average under $\pi$
Bellman expectation $V^\pi(s) = \sum_a \pi(a \mid s)\bigl[R(s,a) + \gamma \sum_{s'} P(s'\mid s,a) V^\pi(s')\bigr]$
Bellman optimality $V^(s) = \max_a\bigl[R(s,a) + \gamma \sum_{s'} P(s'\mid s,a) V^(s')\bigr]$; unique fixed point
Contraction Bellman op is $\gamma$-contraction; $O\bigl(\tfrac{1}{1-\gamma}\log\tfrac{1}{\epsilon}\bigr)$ iterations to $\epsilon$-accuracy
Value iteration Repeatedly apply Bellman optimality operator
Policy iteration Alternate exact policy evaluation and greedy improvement; finitely many steps
Multi-armed bandit Single-state MDP with $\gamma=0$; pure exploration-exploitation
Policy gradient theorem $\nabla_\theta J = \mathbb{E}{d^{\pi\theta}, \pi_\theta}\bigl[Q^{\pi_\theta} \nabla_\theta \log \pi_\theta\bigr]$
REINFORCE Monte Carlo policy gradient; unbiased but high variance
Baseline / advantage Subtract $V^\pi(s_t)$ from return; reduces variance without adding bias
Importance sampling Reweight old-policy samples by product of per-step policy ratios
TRPO Maximize importance-weighted advantage subject to KL constraint; monotonic improvement

The MDP is the foundation on which all modern RL is built. Deep RL methods - Q-networks, policy gradients, actor-critic - are all solving or approximately solving MDP optimization problems, just in settings where the state space is too large for exact methods.