Solved -CS285 Assignment 2: Policy Gradients
2.1 Policy gradient
Recall that the reinforcement learning objective is to learn a θ∗ that maximizes the objective function:
J(θ) = Eτ∼πθ(τ) [r(τ)] (1)
where each rollout τ is of length T, as follows:
T−1
πθ(τ) = p(s0,a0,...,sT−1,aT−1) = p(s0)πθ(a0|s0) Y p(st|st−1,at−1)πθ(at|st)
t=1
and
T−1
r(τ) = r(s0,a0,...,sT−1,aT−1) = X r(st,at).
t=0
The policy gradient approach is to directly take the gradient of this objective:
Z
∇θJ(θ) = ∇θ πθ(τ)r(τ)dτ (2)
Z
= πθ(τ)∇θ logπθ(τ)r(τ)dτ. (3)
= Eτ∼πθ(τ) [∇θ logπθ(τ)r(τ)] (4)
(5) In practice, the expectation over trajectories τ can be approximated from a batch of N sampled trajectories:
) (6)
. (7)
Here we see that the policy πθ is a probability distribution over the action space, conditioned on the state. In the agent-environment loop, the agent samples an action at from πθ(·|st) and the environment responds with a reward r(st,at).
2.2 Variance Reduction
2.2.1 Reward-to-go
One way to reduce the variance of the policy gradient is to exploit causality: the notion that the policy cannot affect rewards in the past. This yields the following modified objective, where the sum of rewards here does not include the rewards achieved prior to the time step at which the policy is being queried. This sum of rewards is a sample estimate of the Q function, and is referred to as the “reward-to-go.”
. (8)
2.2.2 Discounting
Multiplying a discount factor γ to the rewards can be interpreted as encouraging the agent to focus more on the rewards that are closer in time, and less on the rewards that are further in the future. This can also be thought of as a means for reducing variance (because there is more variance possible when considering futures that are further into the future). We saw in lecture that the discount factor can be incorporated in two ways, as shown below.
The first way applies the discount on the rewards from full trajectory:
!
(9)
and the second way applies the discount on the “reward-to-go:”
. (10)
.
2.2.3 Baseline
Another variance reduction method is to subtract a baseline (that is a constant with respect to τ) from the sum of rewards:
∇θJ(θ) = ∇θEτ∼πθ(τ) [r(τ) − b]. (11)
This leaves the policy gradient unbiased because
∇θEτ∼πθ(τ) [b] = Eτ∼πθ(τ) [∇θ logπθ(τ) · b] = 0.
In this assignment, we will implement a value function Vφπ which acts as a state-dependent baseline. This value function will be trained to approximate the sum of future rewards starting from a particular state:
T−1
Vφπ(st) ≈ XEπθ [r(st0,at0)|st],
t0=t
so the approximate policy gradient now looks like this:
(12)
. (13)
2.2.4 Generalized Advantage Estimation
The quantity ) from the previous policy gradient expression (removing the i index for clarity) can be interpreted as an estimate of the advantage function:
Aπ(st,at) = Qπ(st,at) − V π(st), (14)
where Qπ(st,at) is estimated using Monte Carlo returns and V π(st) is estimated using the learned value function Vφπ. We can further reduce variance by also using Vφπ in place of the Monte Carlo returns to estimate the advantage function as:
, (15)
with the edge case δT−1 = r(sT−1,aT−1) − Vφπ(sT−1). However, this comes at the cost of introducing bias to our policy gradient estimate, due to modeling errors in Vφπ. We can instead use a combination of n-step Monte Carlo returns and Vφπ to estimate the advantage function as:
t+n
Aπn(st,at) = Xγt0−tr(st0,at0) + γnVφπ(s ) − Vφπ(st). (16)
t+n+1
t0=t
Increasing n incorporates the Monte Carlo returns more heavily in the advantage estimate, which lowers bias and increases variance, while decreasing n does the opposite. Note that n = T − t − 1 recovers the unbiased but higher variance Monte Carlo advantage estimate used in (13), while n = 0 recovers the lower variance but higher bias advantage estimate δt.
We can combine multiple n-step advantage estimates as an exponentially weighted sum, which is known as the generalized advantage estimator (GAE). Let λ ∈ [0,1]. Then we define:
, (17)
where is a normalizing constant. Note that a higher λ emphasizes advantage estimates with higher values of n, and a lower λ does the opposite. Thus, λ serves as a control for the bias-variance tradeoff, where increasing λ decreases bias and increases variance. In the infinite horizon case (T = ∞), we can show: )
(19)
where we have omitted the derivation for brevity (see the GAE paper https://arxiv.org/pdf/1506.02438.pdf for details). In the finite horizon case, we can write:
T−1
π X t0−t
AGAE(st,at) = (γλ) δt0, (20)
t0=t
which serves as a way we can efficiently implement the generalized advantage estimator, since we can recursively compute:
AπGAE(st,at) = δt + γλAπGAE(st+1,at+1) (21)
3 Overview of Implementation
3.1 Files
To implement policy gradients, we will be building up the code that we started in homework 1. All files needed to run your code are in the hw2 folder, but there will be some blanks you will fill with your solutions from homework 1. These locations are marked with # TODO: get this from hw1 and are found in the following files:
• infrastructure/rl trainer.py
• infrastructure/utils.py
• policies/MLP policy.py
After bringing in the required components from the previous homework, you can begin work on the new policy gradient code. These placeholders are marked with TODO, located in the following files:
• agents/pg agent.py
• policies/MLP policy.py
The script to run the experiments is found in scripts/run hw2.py (for the local option) or scripts/run hw2.ipynb (for the Colab option).
3.2 Overview
As in the previous homework, the main training loop is implemented in infrastructure/rl trainer.py.
The policy gradient algorithm uses the following 3 steps:
1. Sample trajectories by generating rollouts under your current policy.
2. Estimate returns and compute advantages. This is executed in the train function of pg agent.py
3. Train/Update parameters. The computational graph for the policy and the baseline, as well as the update functions, are implemented in policies/MLP policy.py.
4 Implementing Policy Gradients
You will be implementing two different return estimators within pg agent.py. The first (“Case 1” within calculate q vals) uses the discounted cumulative return of the full trajectory and corresponds to the “vanilla” form of the policy gradient (Equation 9):
T−1
0
r(τi) = X γt r(sit0,ait0).
t0=0
The second (“Case 2”) uses the “reward-to-go” formulation from Equation 10:
(22)
T−1
X t0−t
r(τi) = γ r(sit0,ait0).
(23)
t0=t
Note that these differ only by the starting point of the summation.
Implement these return estimators as well as the remaining sections marked TODO in the code. For the smallscale experiments, you may skip those sections that are run only if nn baseline is True; we will return to baselines in Section 6. (These sections are in MLPPolicyPG:update and PGAgent:estimate advantage.)
5 Small-Scale Experiments
After you have implemented all non-baseline code from Section 4, you will run two small-scale experiments to get a feel for how different settings impact the performance of policy gradient methods.
Experiment 1 (CartPole). Run multiple experiments with the PG algorithm on the discrete CartPole-v0 environment, using the following commands:
python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 1000 \
-dsa --exp_name q1_sb_no_rtg_dsa
python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 1000 \
-rtg -dsa --exp_name q1_sb_rtg_dsa
python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 1000 \
-rtg --exp_name q1_sb_rtg_na
python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 5000 \
-dsa --exp_name q1_lb_no_rtg_dsa
python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 5000 \
-rtg -dsa --exp_name q1_lb_rtg_dsa
python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 5000 \
-rtg --exp_name q1_lb_rtg_na
What’s happening here:
• -n : Number of iterations.
• -b : Batch size (number of state-action pairs sampled while acting according to the current policy at each iteration).
• -dsa : Flag: if present, sets standardize_advantages to False. Otherwise, by default, standardizes advantages to have a mean of zero and standard deviation of one.
• -rtg : Flag: if present, sets reward_to_go=True. Otherwise, reward_to_go=False by default.
• --exp_name : Name for experiment, which goes into the name for the data logging directory.
Various other command line arguments will allow you to set batch size, learning rate, network architecture, and more. You can change these as well, but keep them fixed between the 6 experiments mentioned above.
6 Implementing Neural Network Baselines
You will now implement a value function as a state-dependent neural network baseline. This will require filling in some TODO sections skipped in Section 4. In particular:
• This neural network will be trained in the update method of MLPPolicyPG along with the policy gradient update.
• In pg agent.py:estimate advantage, the predictions of this network will be subtracted from the reward-to-go to yield an estimate of the advantage. This implements
7 More Complex Experiments
Note: The following tasks take quite a bit of time to train. Please start early! For all remaining experiments, use the reward-to-go estimator.
Experiment 3 (LunarLander). You will now use your policy gradient implementation to learn a controller for LunarLanderContinuous-v2. The purpose of this problem is to test and help you debug your baseline implementation from Section 6.
Run the following command:
python cs285/scripts/run_hw2.py \
--env_name LunarLanderContinuous-v2 --ep_len 1000
--discount 0.99 -n 100 -l 2 -s 64 -b 40000 -lr 0.005 \
--reward_to_go --nn_baseline --exp_name q3_b40000_r0.005
8 Implementing Generalized Advantage Estimation
You will now use the value function you previously implemented to implement a simplified version of GAE-λ.
This will require filling in the remaining TODO section in pg agent.py:estimate advantage.
Experiment 5 (HopperV2). You will now use your implementation of policy gradient with generalized advantage estimation to learn a controller for a version of Hopper-v2 with noisy actions. Search over λ ∈ [0,0.95,0.99,1] to replace <λ> below. Note that with a correct implementation, λ = 1 is equivalent to the vanilla neural network baseline estimator. Do not change any of the other hyperparameters (e.g. batch size, learning rate).
python cs285/scripts/run_hw2.py \
--env_name Hopper-v2 --ep_len 1000
--discount 0.99 -n 300 -l 2 -s 32 -b 2000 -lr 0.001 \
--reward_to_go --nn_baseline --action_noise_std 0.5 --gae_lambda <λ> \
--exp_name q5_b2000_r0.001_lambda<λ>