Policy Gradient Methods

June 17th, 2025

In the previous chapter we discussed methods of approximating value functions \[Q_{\theta}(s,a) \approx Q^{\pi}(s,a)\]and generating a policy directly from the value function (e.g. $\epsilon$-greedy). In this lecture, we discuss methods that will directly parametrize the policy itself \[\pi_{\theta}(s,a)= \mathbb{P}[a|s; \theta]\]Note here that the policy outputs a probability distribution, whereas previously it outputted a single action $a$. This is because we are now learning the policy rather than just picking the best action from the action-value function. As a reminder, we continue with the model-free setting here.

Advantages

Better convergence properties: the policy parametrization provides a smooth objective function that's well suited for gradient-based optimization, which guarantees convergence to at least a local optimum
Effective in high-dimensional or continuous action spaces: value-based methods require finding the action that maximizes $Q(s,a)$, which becomes computationally expensive or impossible in continuous/high-dimensional action spaces, policy-based methods directly output action probabilities
Can learn stochastic policies: we can now learn a probability for each action, if that's best, since we're learning policies as opposed to implicitly generating them from the action-value function $Q$

Disadvantages

Typically converge to a local rather than global optimum: the policy space is generally non-convex, so gradient ascent can get stuck in local maxima
Evaluating a policy is typically inefficient and high variance: in value-based methods when evaluating a policy we just check the $Q(s,a)$ under that policy, which will give us cumulative expected future reward, in policy-based methods we must rollout the policy for an entire episode and then get one result from that, and since the rollout will be dependent on several stochastically determined actions it will be much higher in variance.

Setup

Since we're using neural networks to learn a policy here, we start by defining an objective function that we will optimize. We can define various different objective functions, but here we will define it to be the reward function, i.e. the expected return, and maximize it: \[J(\theta) = \sum_{s}d_{\pi_{\theta}}(s)V_{\pi_{\theta}}(s) = \sum_{s}\left(d_{\pi_{\theta}}(s)\sum_{a} \pi(a|s; \theta)Q_{\pi_{\theta}}(s)\right)\] where $d_{\pi_{\theta}}$ is the stationary distribution of the Markov chain for $\pi_{\theta}$

Policy Optimization

Let $J(\theta)$ be any policy objective function
Policy gradient algorithms search for a local maximum in $J(\theta)$ by ascending the gradient of the policy w.r.t the parameters $\theta$ \[\Delta\theta=\alpha \nabla_{\theta}J(\theta)\]where $\nabla_{\theta}J(\theta)$ is the policy gradient and $\alpha$ is a step-size parameter

Finite Difference Policy Gradient

We can compute the gradient numerically by perturbing $\theta$ by a small amount $\epsilon$ in the k-th dimension. This method works even when $J(\theta)$ is not differentiable, but of course is extremely slow. \[\frac{\partial J(\theta)}{\partial \theta_{k}} \approx \frac{J(\theta + \epsilon \mu_{k}) - J(\theta)}{\epsilon}\]

Policy Gradient Theorem

We can also analytically compute the gradient using the policy gradient theorem which states that for any differentiable policy $\pi(s|a;\theta)$, the policy gradient is given as follows: \[\begin{align} J(\theta) &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi(a|s; \theta) Q_\pi(s, a) \\ \nabla_{\theta} J(\theta) &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \nabla_{\theta} \pi(a|s; \theta) Q_\pi(s, a) \\ &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi(a|s; \theta) \textcolor{green}{\nabla_{\theta} \log \pi(a|s; \theta)} Q_\pi(s, a) \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) Q_\pi(s, a)] \end{align}\]

You'll notice that we used a trick in the gradient equation, this is referred to as the log-likelihood trick and comes from the calculus chain rule: \[\begin{align}\frac{d}{dx} \log f(x) = \frac{1}{f(x)}\frac{df}{dx} \\ \frac{df}{dx} = f(x) \frac{d}{dx} \log f(x) \end{align}\] We apply this trick because:

We cannot sample from $\nabla_{\theta} \pi(a|s;\theta)$ directly because it's a vector of partial derivatives and not a probability distribution, but sampling from $\pi(a|s; \theta)\cdot \log \pi(a|s; \theta)$ is doable for us!
Mathematical convenience:
- Using $\log$ converts multiplications to additions: $\log(abc) = \log(a) + \log(b) + \log(c)$
- Working with logs prevents numerical underflow when probabilities get very small
- Log functions have nice derivative properties

Monte-Carlo Policy Gradient (REINFORCE)

Since we're learning the policy now, you may be wondering where we will get the $Q_{\pi_{\theta}}(s,a)$ that we need in the policy gradient . In the REINFORCE algorithm we use the return $v_t$ as an unbiased sample of $Q_{\pi_{\theta}}(s_t,a_t)$. We use the policy gradient theorem and update parameters using stochastic gradient descent: \[\Delta \theta_t = \alpha \nabla_{\theta}\log \pi_{\theta}(s_t, a_t)v_t\] REINFORCE Algorithm

Initialize $\theta$ arbitrarily
Generate an episode $S_1, A_1, R_2,..., S_{T-1}, A_{T-1}, R_T$ by following the policy $\pi_{\theta}$
For t=1 to t=T-1:
- $v_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2}...$
- $\theta \leftarrow \theta + \alpha \nabla_{\theta}\log \pi_{\theta}(s_t, a_t)v_t$

We generate a full-episode, e.g. running a game from start to finish since we're using Monte-Carlo estimation in which we must wait to see the complete outcome. Then for each time-step we update the parameters using gradient ascent.

Actor-Critic Policy Gradient

Instead of using Monte-Carlo estimation to approximate the action-value function we can use [[7 Value Function Approximation]] to estimate it: \[Q_{w}(s,a) \approx Q_{\pi_{\theta}}(s,a)\]We'll call the policy the actor and the estimated action-value function the critic. So we maintain two sets of parameters:

Critic Updates action-value function parameters $w$
Actor Updates policy parameters $\theta$, in direction suggested by critic because $Q_{w}(s,a)$ is in the gradient update

Actor-critic algorithms follow an approximate policy gradient: \[\begin{align} \nabla_{\theta}J(\theta) &\approx \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) Q_w(s, a)] \\ \Delta \theta_t &= \alpha \nabla_{\theta}\log \pi_{\theta}(s_t, a_t)Q_{w}(s,a) \end{align}\] Before we move on, take a moment to pause and recall the learning process from [[7 Value Function Approximation]]. We learn $Q_w(s,a)$ by performing gradient descent on the mean-squared error \[J(w) = \mathbb{E}_{\pi}[(q_{\pi}(S,A) - \hat{q}(S,A,w))^2]\]where $q_{\pi}(S,A)$ is:

$G_t$ in Monte-Carlo policy evaluation
$R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1})$ in Temporal-Difference learning, or
$q_t^{\lambda}$ in TD($\lambda$)

Action-Value Actor-Critic Algorithm

Applying the actor-critic paradigm to action-values, using a linear function to approximate the action-value function: \[Q_w(s,a) = \phi (s,a)^Tw\]where $\phi(s,a)$ is a feature vector that contains relevant features from the environment for the state action pair $(s,a)$. We will assume that the critic updates $w$ using TD(0).

QAC:

Initialize $s$, $\theta$, $w$ arbitrarily
Sample $a \sim \pi(s;\theta)$
while s != terminal:
- Get reward $r$ and new state $s$ from environment
- Sample next action $a' \sim \pi(s';\theta)$
- Compute TD target: $\delta = r + \gamma Q_w(s',a') - Q_w(s,a)$
- $\theta = \theta + \alpha \nabla_{\theta}\log \pi_{\theta}(s_t, a_t)Q_{w}(s,a)$
- $w \leftarrow w + \beta \delta \phi_w(s,a)$
- $a \leftarrow a'$, $s \leftarrow s'$

Reducing Variance Using a Baseline

Monte-Carlo policy gradient still has high variance for all the same reasons discussed in [[5 Model-free Prediction]]. Temporal-Difference learning helps with this, but we can still benefit from reduced variance.

The idea is that we can subtract a baseline function $B(s)$ from the policy gradient, and this can reduce variance without changing the expectation: \[\begin{align} \mathbb{E}_{\pi_\theta} \left[\nabla_\theta \log \pi_\theta(s, a)B(s)\right] &= \sum_{s \in \mathcal{S}} d^{\pi_\theta}(s) \sum_a \nabla_\theta \pi_\theta(s, a)B(s) \\ &= \sum_{s \in \mathcal{S}} d^{\pi_\theta} B(s) \nabla_\theta \sum_{a \in \mathcal{A}} \pi_\theta(s, a) \\ &=0 \end{align}\]and since probabilities sum to 1, we take the gradient of the constant 1 to get 0.

A good baseline is the state-value function $B(s) = V_{\pi_{\theta}}(s)$ which represents the expected future return if we start from state $s$ and follow policy $\pi_\theta$. Subtracting that gives us the advantage function: \[\begin{align} A_{\pi_{\theta}}(s,a) &= Q_{\pi_{\theta}}(s,a) - V_{\pi_{\theta}}(s,a) \\ \nabla_{\theta}J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) \textcolor{red}{A_{\pi_{\theta}}(s, a)}] \end{align}\]

If $A^{\pi_\theta}(s,a) > 0$: Action $a$ is better than average $\rightarrow$ increase its probability
If $A^{\pi_\theta}(s,a) < 0$: Action $a$ is worse than average $\rightarrow$ decrease its probability
If $A^{\pi_\theta}(s,a) = 0$: Action $a$ is exactly average $\rightarrow$ no change needed

This is much more informative than using raw returns $Q^{\pi_\theta}(s,a)$, because actions might have high absolute returns simply because the state itself is valuable, not because the action choice was particularly good. By subtracting the state value, we focus on the relative quality of actions rather than their absolute returns.

Additionally, this baseline reduces variance because we're essentially "centering" our estimates around zero instead of having them spread across a wide range of return values.

Estimating the Advantage Function

Since the advantage function can significantly reduce the variance of the policy gradient, the critic should really estimate the advantage function. The efficient way to do that follows this explanation:

For the true value function $V_{\pi_{\theta}}(s)$, the TD error $\delta_{\pi_{\theta}}$\[\delta_{\pi_{\theta}} = r + \gamma V_{\pi_{\theta}}(s') - V(s)\]is an unbiased estimate of the advantage function\[\begin{align} \mathbb{E}_{\pi_\theta} \left[\delta_{\pi_\theta} | s, a\right] &= \mathbb{E}_{\pi_\theta} \left[r + \gamma V_{\pi_\theta}(s') | s, a\right] - V_{\pi_\theta}(s) \\ &= Q_{\pi_\theta}(s, a) - V_{\pi_\theta}(s) \\ &= A_{\pi_\theta}(s, a) \end{align}\]this is because $\mathbb{E}_{\pi_\theta} \left[r + \gamma V_{\pi_\theta}(s') | s, a\right]$ represents, "given that we're in state $s$ and take action $a$, what's the expected value of [immediate reward + discounted future value], averaged over all the different rewards and next states that could happen?", and this is equal to $Q_{\pi_{\theta}}(s,a)$!

In practice we don't have the true value function so we can use estimate one and get an approximate TD error that will approximately be an unbiased estimate of the advantage function\[\begin{align} \nabla_{\theta}J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) \textcolor{red}{A_{\pi_{\theta}}(s, a)}] \\ \delta_v &= r + \gamma V_v(s') - V_v(s) \end{align}\] This approach only requires one set of critic parameters $v$ as opposed to having to learn both the state and action value functions and having two sets of critic parameters.

Critics at Different Time-Scales

Setup: We use linear function approximation for the state-value function: \[V_v(s) = \phi(s)^T v\] where $\phi(s)$ is a feature vector representing state $s$, and $v$ are the learned parameters.

Objective: Minimize mean-squared error between predicted and target values: \[J(v) = \mathbb{E}_{\pi}[(target - V_v(s))^2]\]

The critic can estimate value function $V_v(s)$, similar to how we did in [[7 Value Function Approximation]], from many targets at different time-scales:

MC: target is the return $v_t$ \[\Delta v = \alpha \underbrace{\phi(s)}_{\text{gradient}} \underbrace{(\textcolor{red}{v_t} - V_v(s))}_{\text{error}}\]
TD(0): target is the TD target $r + \gamma V(s')$ \[\Delta v = \alpha \underbrace{\phi(s)}_{\text{gradient}} \underbrace{(\textcolor{red}{r + \gamma V(s')} - V_v(s))}_{\text{error}}\]
Forward-view TD($\lambda$): target is the $\lambda$-return $v_t^{\lambda}$ \[\Delta v = \alpha \underbrace{\phi(s)}_{\text{gradient}} \underbrace{(\textcolor{red}{v_t^{\lambda}} - V_v(s))}_{\text{error}}\]
Backward-view TD($\lambda$): use eligibility traces \[\begin{align} \delta_t &= r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \\ e_t &= \gamma\lambda e_{t-1} + \phi(s_t) \\ \Delta v &= \alpha \underbrace{\delta_t}_{\text{error}} \underbrace{e_t}_{\text{trace}} \end{align}\]

Actors at Different Time-Scales

Setup: We directly parametrize the policy: \[\pi_{\theta}(a|s) = \mathbb{P}[a|s; \theta]\] Objective: Maximize expected return:\[J(\theta) = \sum_{s}d_{\pi_{\theta}}(s)V_{\pi_{\theta}}(s)\] Policy Gradient:\[\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{A_{\pi_\theta}(s, a)}]\] Advantage Function: Measures how much better an action is compared to average: \[A_{\pi_\theta}(s,a) = Q_{\pi_\theta}(s,a) - V_{\pi_\theta}(s)\] The policy gradient can also be estimated at many time-scales:

Monte-Carlo policy gradient uses error from complete return \[\Delta\theta = \alpha \underbrace{\nabla_{\theta} \log \pi_{\theta}(s_t, a_t)}_{\text{gradient}} \underbrace{(\textcolor{red}{v_t} - V_v(s_t))}_{\text{advantage}}\]
Actor-critic policy gradient uses the one-step TD error \[\Delta\theta = \alpha \underbrace{\nabla_{\theta} \log \pi_{\theta}(s_t, a_t)}_{\text{gradient}} \underbrace{(\textcolor{red}{r + \gamma V_v(s_{t+1})} - V_v(s_t))}_{\text{advantage}}\]
Forward-view TD($\lambda$) policy gradient mixes over time-scales \[\Delta\theta = \alpha \underbrace{\nabla_{\theta} \log \pi_{\theta}(s_t, a_t)}_{\text{gradient}} \underbrace{(\textcolor{red}{v_t^{\lambda}} - V_v(s_t))}_{\text{advantage}}\]
Backward-view TD($\lambda$) policy gradient uses eligibility traces \[\begin{align} \delta_t &= r_{t+1} + \gamma V_v(s_{t+1}) - V_v(s_t) \\ e_{t+1} &= \lambda e_t + \nabla_{\theta} \log \pi_{\theta}(s_t, a_t) \\ \Delta\theta &= \alpha \underbrace{\delta_t}_{\text{error}} \underbrace{e_t}_{\text{trace}} \end{align}\]

Summary of Policy Gradient Algorithms

The policy gradient has many equivalent forms: \[\begin{align} \nabla_{\theta}J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{v_t}] & \text{REINFORCE} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{Q^w(s, a)}] & \text{Q Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{A^w(s, a)}] & \text{Advantage Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{\delta}] & \text{TD Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{\delta e}] & \text{TD($\lambda$) Actor-Critic} \end{align}\]

Each leads to a stochastic gradient ascent algorithm. The critic uses policy evaluation (e.g. MC or TD learning) to estimate $Q^{\pi}(s,a)$, $A^{\pi}(s,a)$, or $V^{\pi}(s)$.