Policy Gradient Methods

June 17th, 2025

In the previous chapter we discussed methods of approximating value functions \[Q_{\theta}(s,a) \approx Q^{\pi}(s,a)\]and generating a policy directly from the value function (e.g. \(\epsilon\)-greedy). In this lecture, we discuss methods that will directly parametrize the policy itself \[\pi_{\theta}(s,a)= \mathbb{P}[a|s; \theta]\]Note here that the policy outputs a probability distribution, whereas previously it outputted a single action \(a\). This is because we are now learning the policy rather than just picking the best action from the action-value function. As a reminder, we continue with the model-free setting here.

Advantages

Disadvantages

Setup

Since we're using neural networks to learn a policy here, we start by defining an objective function that we will optimize. We can define various different objective functions, but here we will define it to be the reward function, i.e. the expected return, and maximize it: \[J(\theta) = \sum_{s}d_{\pi_{\theta}}(s)V_{\pi_{\theta}}(s) = \sum_{s}\left(d_{\pi_{\theta}}(s)\sum_{a} \pi(a|s; \theta)Q_{\pi_{\theta}}(s)\right)\] where \(d_{\pi_{\theta}}\) is the stationary distribution of the Markov chain for \(\pi_{\theta}\)

Policy Optimization

Finite Difference Policy Gradient

We can compute the gradient numerically by perturbing \(\theta\) by a small amount \(\epsilon\) in the k-th dimension. This method works even when \(J(\theta)\) is not differentiable, but of course is extremely slow. \[\frac{\partial J(\theta)}{\partial \theta_{k}} \approx \frac{J(\theta + \epsilon \mu_{k}) - J(\theta)}{\epsilon}\]

Policy Gradient Theorem

We can also analytically compute the gradient using the policy gradient theorem which states that for any differentiable policy \(\pi(s|a;\theta)\), the policy gradient is given as follows: \[\begin{align} J(\theta) &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi(a|s; \theta) Q_\pi(s, a) \\ \nabla_{\theta} J(\theta) &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \nabla_{\theta} \pi(a|s; \theta) Q_\pi(s, a) \\ &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi(a|s; \theta) \textcolor{green}{\nabla_{\theta} \log \pi(a|s; \theta)} Q_\pi(s, a) \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) Q_\pi(s, a)] \end{align}\]

You'll notice that we used a trick in the gradient equation, this is referred to as the log-likelihood trick and comes from the calculus chain rule: \[\begin{align}\frac{d}{dx} \log f(x) = \frac{1}{f(x)}\frac{df}{dx} \\ \frac{df}{dx} = f(x) \frac{d}{dx} \log f(x) \end{align}\] We apply this trick because:

  1. We cannot sample from \(\nabla_{\theta} \pi(a|s;\theta)\) directly because it's a vector of partial derivatives and not a probability distribution, but sampling from \(\pi(a|s; \theta)\cdot \log \pi(a|s; \theta)\) is doable for us!
  2. Mathematical convenience:

Monte-Carlo Policy Gradient (REINFORCE)

Since we're learning the policy now, you may be wondering where we will get the \(Q_{\pi_{\theta}}(s,a)\) that we need in the policy gradient . In the REINFORCE algorithm we use the return \(v_t\) as an unbiased sample of \(Q_{\pi_{\theta}}(s_t,a_t)\). We use the policy gradient theorem and update parameters using stochastic gradient descent: \[\Delta \theta_t = \alpha \nabla_{\theta}\log \pi_{\theta}(s_t, a_t)v_t\] REINFORCE Algorithm

We generate a full-episode, e.g. running a game from start to finish since we're using Monte-Carlo estimation in which we must wait to see the complete outcome. Then for each time-step we update the parameters using gradient ascent.

Actor-Critic Policy Gradient

Instead of using Monte-Carlo estimation to approximate the action-value function we can use [[7 Value Function Approximation]] to estimate it: \[Q_{w}(s,a) \approx Q_{\pi_{\theta}}(s,a)\]We'll call the policy the actor and the estimated action-value function the critic. So we maintain two sets of parameters:

Actor-critic algorithms follow an approximate policy gradient: \[\begin{align} \nabla_{\theta}J(\theta) &\approx \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) Q_w(s, a)] \\ \Delta \theta_t &= \alpha \nabla_{\theta}\log \pi_{\theta}(s_t, a_t)Q_{w}(s,a) \end{align}\] Before we move on, take a moment to pause and recall the learning process from [[7 Value Function Approximation]]. We learn \(Q_w(s,a)\) by performing gradient descent on the mean-squared error \[J(w) = \mathbb{E}_{\pi}[(q_{\pi}(S,A) - \hat{q}(S,A,w))^2]\]where \(q_{\pi}(S,A)\) is:

Action-Value Actor-Critic Algorithm

Applying the actor-critic paradigm to action-values, using a linear function to approximate the action-value function: \[Q_w(s,a) = \phi (s,a)^Tw\]where \(\phi(s,a)\) is a feature vector that contains relevant features from the environment for the state action pair \((s,a)\). We will assume that the critic updates \(w\) using TD(0).

QAC:

  1. Initialize \(s\), \(\theta\), \(w\) arbitrarily
  2. Sample \(a \sim \pi(s;\theta)\)
  3. while s != terminal:

Reducing Variance Using a Baseline

Monte-Carlo policy gradient still has high variance for all the same reasons discussed in [[5 Model-free Prediction]]. Temporal-Difference learning helps with this, but we can still benefit from reduced variance.

The idea is that we can subtract a baseline function \(B(s)\) from the policy gradient, and this can reduce variance without changing the expectation: \[\begin{align} \mathbb{E}_{\pi_\theta} \left[\nabla_\theta \log \pi_\theta(s, a)B(s)\right] &= \sum_{s \in \mathcal{S}} d^{\pi_\theta}(s) \sum_a \nabla_\theta \pi_\theta(s, a)B(s) \\ &= \sum_{s \in \mathcal{S}} d^{\pi_\theta} B(s) \nabla_\theta \sum_{a \in \mathcal{A}} \pi_\theta(s, a) \\ &=0 \end{align}\]and since probabilities sum to 1, we take the gradient of the constant 1 to get 0.

A good baseline is the state-value function \(B(s) = V_{\pi_{\theta}}(s)\) which represents the expected future return if we start from state \(s\) and follow policy \(\pi_\theta\). Subtracting that gives us the advantage function: \[\begin{align} A_{\pi_{\theta}}(s,a) &= Q_{\pi_{\theta}}(s,a) - V_{\pi_{\theta}}(s,a) \\ \nabla_{\theta}J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) \textcolor{red}{A_{\pi_{\theta}}(s, a)}] \end{align}\]

This is much more informative than using raw returns \(Q^{\pi_\theta}(s,a)\), because actions might have high absolute returns simply because the state itself is valuable, not because the action choice was particularly good. By subtracting the state value, we focus on the relative quality of actions rather than their absolute returns.

Additionally, this baseline reduces variance because we're essentially "centering" our estimates around zero instead of having them spread across a wide range of return values.

Estimating the Advantage Function

Since the advantage function can significantly reduce the variance of the policy gradient, the critic should really estimate the advantage function. The efficient way to do that follows this explanation:

For the true value function \(V_{\pi_{\theta}}(s)\), the TD error \(\delta_{\pi_{\theta}}\)\[\delta_{\pi_{\theta}} = r + \gamma V_{\pi_{\theta}}(s') - V(s)\]is an unbiased estimate of the advantage function\[\begin{align} \mathbb{E}_{\pi_\theta} \left[\delta_{\pi_\theta} | s, a\right] &= \mathbb{E}_{\pi_\theta} \left[r + \gamma V_{\pi_\theta}(s') | s, a\right] - V_{\pi_\theta}(s) \\ &= Q_{\pi_\theta}(s, a) - V_{\pi_\theta}(s) \\ &= A_{\pi_\theta}(s, a) \end{align}\]this is because \(\mathbb{E}_{\pi_\theta} \left[r + \gamma V_{\pi_\theta}(s') | s, a\right]\) represents, "given that we're in state \(s\) and take action \(a\), what's the expected value of [immediate reward + discounted future value], averaged over all the different rewards and next states that could happen?", and this is equal to \(Q_{\pi_{\theta}}(s,a)\)!

In practice we don't have the true value function so we can use estimate one and get an approximate TD error that will approximately be an unbiased estimate of the advantage function\[\begin{align} \nabla_{\theta}J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) \textcolor{red}{A_{\pi_{\theta}}(s, a)}] \\ \delta_v &= r + \gamma V_v(s') - V_v(s) \end{align}\] This approach only requires one set of critic parameters \(v\) as opposed to having to learn both the state and action value functions and having two sets of critic parameters.

Critics at Different Time-Scales

Setup: We use linear function approximation for the state-value function: \[V_v(s) = \phi(s)^T v\] where \(\phi(s)\) is a feature vector representing state \(s\), and \(v\) are the learned parameters.

Objective: Minimize mean-squared error between predicted and target values: \[J(v) = \mathbb{E}_{\pi}[(target - V_v(s))^2]\]

The critic can estimate value function \(V_v(s)\), similar to how we did in [[7 Value Function Approximation]], from many targets at different time-scales:

Actors at Different Time-Scales

Setup: We directly parametrize the policy: \[\pi_{\theta}(a|s) = \mathbb{P}[a|s; \theta]\] Objective: Maximize expected return:\[J(\theta) = \sum_{s}d_{\pi_{\theta}}(s)V_{\pi_{\theta}}(s)\] Policy Gradient:\[\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{A_{\pi_\theta}(s, a)}]\] Advantage Function: Measures how much better an action is compared to average: \[A_{\pi_\theta}(s,a) = Q_{\pi_\theta}(s,a) - V_{\pi_\theta}(s)\] The policy gradient can also be estimated at many time-scales:

Summary of Policy Gradient Algorithms

The policy gradient has many equivalent forms: \[\begin{align} \nabla_{\theta}J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{v_t}] & \text{REINFORCE} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{Q^w(s, a)}] & \text{Q Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{A^w(s, a)}] & \text{Advantage Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{\delta}] & \text{TD Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{\delta e}] & \text{TD($\lambda$) Actor-Critic} \end{align}\]

Each leads to a stochastic gradient ascent algorithm. The critic uses policy evaluation (e.g. MC or TD learning) to estimate \(Q^{\pi}(s,a)\), \(A^{\pi}(s,a)\), or \(V^{\pi}(s)\).