June 17th, 2025
In the previous chapter we discussed methods of approximating value functions \[Q_{\theta}(s,a) \approx Q^{\pi}(s,a)\]and generating a policy directly from the value function (e.g. \(\epsilon\)-greedy). In this lecture, we discuss methods that will directly parametrize the policy itself \[\pi_{\theta}(s,a)= \mathbb{P}[a|s; \theta]\]Note here that the policy outputs a probability distribution, whereas previously it outputted a single action \(a\). This is because we are now learning the policy rather than just picking the best action from the action-value function. As a reminder, we continue with the model-free setting here.
Advantages
Disadvantages
Since we're using neural networks to learn a policy here, we start by defining an objective function that we will optimize. We can define various different objective functions, but here we will define it to be the reward function, i.e. the expected return, and maximize it: \[J(\theta) = \sum_{s}d_{\pi_{\theta}}(s)V_{\pi_{\theta}}(s) = \sum_{s}\left(d_{\pi_{\theta}}(s)\sum_{a} \pi(a|s; \theta)Q_{\pi_{\theta}}(s)\right)\] where \(d_{\pi_{\theta}}\) is the stationary distribution of the Markov chain for \(\pi_{\theta}\)
We can compute the gradient numerically by perturbing \(\theta\) by a small amount \(\epsilon\) in the k-th dimension. This method works even when \(J(\theta)\) is not differentiable, but of course is extremely slow. \[\frac{\partial J(\theta)}{\partial \theta_{k}} \approx \frac{J(\theta + \epsilon \mu_{k}) - J(\theta)}{\epsilon}\]
We can also analytically compute the gradient using the policy gradient theorem which states that for any differentiable policy \(\pi(s|a;\theta)\), the policy gradient is given as follows: \[\begin{align} J(\theta) &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi(a|s; \theta) Q_\pi(s, a) \\ \nabla_{\theta} J(\theta) &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \nabla_{\theta} \pi(a|s; \theta) Q_\pi(s, a) \\ &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi(a|s; \theta) \textcolor{green}{\nabla_{\theta} \log \pi(a|s; \theta)} Q_\pi(s, a) \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) Q_\pi(s, a)] \end{align}\]
You'll notice that we used a trick in the gradient equation, this is referred to as the log-likelihood trick and comes from the calculus chain rule: \[\begin{align}\frac{d}{dx} \log f(x) = \frac{1}{f(x)}\frac{df}{dx} \\ \frac{df}{dx} = f(x) \frac{d}{dx} \log f(x) \end{align}\] We apply this trick because:
Since we're learning the policy now, you may be wondering where we will get the \(Q_{\pi_{\theta}}(s,a)\) that we need in the policy gradient . In the REINFORCE algorithm we use the return \(v_t\) as an unbiased sample of \(Q_{\pi_{\theta}}(s_t,a_t)\). We use the policy gradient theorem and update parameters using stochastic gradient descent: \[\Delta \theta_t = \alpha \nabla_{\theta}\log \pi_{\theta}(s_t, a_t)v_t\] REINFORCE Algorithm
We generate a full-episode, e.g. running a game from start to finish since we're using Monte-Carlo estimation in which we must wait to see the complete outcome. Then for each time-step we update the parameters using gradient ascent.
Instead of using Monte-Carlo estimation to approximate the action-value function we can use [[7 Value Function Approximation]] to estimate it: \[Q_{w}(s,a) \approx Q_{\pi_{\theta}}(s,a)\]We'll call the policy the actor and the estimated action-value function the critic. So we maintain two sets of parameters:
Actor-critic algorithms follow an approximate policy gradient: \[\begin{align} \nabla_{\theta}J(\theta) &\approx \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) Q_w(s, a)] \\ \Delta \theta_t &= \alpha \nabla_{\theta}\log \pi_{\theta}(s_t, a_t)Q_{w}(s,a) \end{align}\] Before we move on, take a moment to pause and recall the learning process from [[7 Value Function Approximation]]. We learn \(Q_w(s,a)\) by performing gradient descent on the mean-squared error \[J(w) = \mathbb{E}_{\pi}[(q_{\pi}(S,A) - \hat{q}(S,A,w))^2]\]where \(q_{\pi}(S,A)\) is:
Applying the actor-critic paradigm to action-values, using a linear function to approximate the action-value function: \[Q_w(s,a) = \phi (s,a)^Tw\]where \(\phi(s,a)\) is a feature vector that contains relevant features from the environment for the state action pair \((s,a)\). We will assume that the critic updates \(w\) using TD(0).
QAC:
Monte-Carlo policy gradient still has high variance for all the same reasons discussed in [[5 Model-free Prediction]]. Temporal-Difference learning helps with this, but we can still benefit from reduced variance.
The idea is that we can subtract a baseline function \(B(s)\) from the policy gradient, and this can reduce variance without changing the expectation: \[\begin{align} \mathbb{E}_{\pi_\theta} \left[\nabla_\theta \log \pi_\theta(s, a)B(s)\right] &= \sum_{s \in \mathcal{S}} d^{\pi_\theta}(s) \sum_a \nabla_\theta \pi_\theta(s, a)B(s) \\ &= \sum_{s \in \mathcal{S}} d^{\pi_\theta} B(s) \nabla_\theta \sum_{a \in \mathcal{A}} \pi_\theta(s, a) \\ &=0 \end{align}\]and since probabilities sum to 1, we take the gradient of the constant 1 to get 0.
A good baseline is the state-value function \(B(s) = V_{\pi_{\theta}}(s)\) which represents the expected future return if we start from state \(s\) and follow policy \(\pi_\theta\). Subtracting that gives us the advantage function: \[\begin{align} A_{\pi_{\theta}}(s,a) &= Q_{\pi_{\theta}}(s,a) - V_{\pi_{\theta}}(s,a) \\ \nabla_{\theta}J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) \textcolor{red}{A_{\pi_{\theta}}(s, a)}] \end{align}\]
This is much more informative than using raw returns \(Q^{\pi_\theta}(s,a)\), because actions might have high absolute returns simply because the state itself is valuable, not because the action choice was particularly good. By subtracting the state value, we focus on the relative quality of actions rather than their absolute returns.
Additionally, this baseline reduces variance because we're essentially "centering" our estimates around zero instead of having them spread across a wide range of return values.
Since the advantage function can significantly reduce the variance of the policy gradient, the critic should really estimate the advantage function. The efficient way to do that follows this explanation:
For the true value function \(V_{\pi_{\theta}}(s)\), the TD error \(\delta_{\pi_{\theta}}\)\[\delta_{\pi_{\theta}} = r + \gamma V_{\pi_{\theta}}(s') - V(s)\]is an unbiased estimate of the advantage function\[\begin{align} \mathbb{E}_{\pi_\theta} \left[\delta_{\pi_\theta} | s, a\right] &= \mathbb{E}_{\pi_\theta} \left[r + \gamma V_{\pi_\theta}(s') | s, a\right] - V_{\pi_\theta}(s) \\ &= Q_{\pi_\theta}(s, a) - V_{\pi_\theta}(s) \\ &= A_{\pi_\theta}(s, a) \end{align}\]this is because \(\mathbb{E}_{\pi_\theta} \left[r + \gamma V_{\pi_\theta}(s') | s, a\right]\) represents, "given that we're in state \(s\) and take action \(a\), what's the expected value of [immediate reward + discounted future value], averaged over all the different rewards and next states that could happen?", and this is equal to \(Q_{\pi_{\theta}}(s,a)\)!
In practice we don't have the true value function so we can use estimate one and get an approximate TD error that will approximately be an unbiased estimate of the advantage function\[\begin{align} \nabla_{\theta}J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi(a|s; \theta) \textcolor{red}{A_{\pi_{\theta}}(s, a)}] \\ \delta_v &= r + \gamma V_v(s') - V_v(s) \end{align}\] This approach only requires one set of critic parameters \(v\) as opposed to having to learn both the state and action value functions and having two sets of critic parameters.
Setup: We use linear function approximation for the state-value function: \[V_v(s) = \phi(s)^T v\] where \(\phi(s)\) is a feature vector representing state \(s\), and \(v\) are the learned parameters.
Objective: Minimize mean-squared error between predicted and target values: \[J(v) = \mathbb{E}_{\pi}[(target - V_v(s))^2]\]
The critic can estimate value function \(V_v(s)\), similar to how we did in [[7 Value Function Approximation]], from many targets at different time-scales:
Setup: We directly parametrize the policy: \[\pi_{\theta}(a|s) = \mathbb{P}[a|s; \theta]\] Objective: Maximize expected return:\[J(\theta) = \sum_{s}d_{\pi_{\theta}}(s)V_{\pi_{\theta}}(s)\] Policy Gradient:\[\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{A_{\pi_\theta}(s, a)}]\] Advantage Function: Measures how much better an action is compared to average: \[A_{\pi_\theta}(s,a) = Q_{\pi_\theta}(s,a) - V_{\pi_\theta}(s)\] The policy gradient can also be estimated at many time-scales:
The policy gradient has many equivalent forms: \[\begin{align} \nabla_{\theta}J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{v_t}] & \text{REINFORCE} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{Q^w(s, a)}] & \text{Q Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{A^w(s, a)}] & \text{Advantage Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{\delta}] & \text{TD Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_{\theta} \log \pi_{\theta}(s, a) \textcolor{red}{\delta e}] & \text{TD($\lambda$) Actor-Critic} \end{align}\]
Each leads to a stochastic gradient ascent algorithm. The critic uses policy evaluation (e.g. MC or TD learning) to estimate \(Q^{\pi}(s,a)\), \(A^{\pi}(s,a)\), or \(V^{\pi}(s)\).