Solved -CS285 Deep Reinforcement Learning HW4: Model-Based RL
In general, model-based reinforcement learning consists of two main parts: learning a dynamics function to model observed state transitions, and then using predictions from that model in some way to decide what to do (e.g., use model predictions to learn a policy, or use model predictions directly in an optimization setup to maximize predicted rewards).
In this assignment, you will do the latter. You will implement both the process of learning a dynamics model, as well as the process of creating a controller to perform action selection through the use of these model predictions. For references to this type of approach, see this paper and this paper.
2 Model-Based Reinforcement Learning
We will now provide a brief overview of model-based reinforcement learning (MBRL), and the specific type of MBRL you will be implementing in this homework. Please see Lecture 11: Model-Based Reinforcement Learning (with specific emphasis on the slides near page 9) for additional details.
MBRL consists primarily of two aspects: (1) learning a dynamics model and (2) using the learned dynamics models to plan and execute actions that minimize a cost function (or maximize a reward function).
2.1 Dynamics Model
In this assignment, you will learn a neural network dynamics model fθ of the form
∆ˆ t+1 = fθ(st,at) (1)
which predicts the change in state given the current state and action. So given the prediction ∆ˆ t+1, you can generate the next prediction with
ˆst+1 = st + ∆ˆ t+1. (2)
See the previously referenced paper for intuition on why we might want our network to predict state differences, instead of directly predicting next state.
You will train fθ in a standard supervised learning setup, by performing gradient descent on the following objective:
L(θ) = X k(st+1 −st) − fθ(st,at)k22
(st,at,st+1)∈D
(3)
= X
(4)
(st,at,st+1)∈D
In practice, it’s helpful to normalize the target of a neural network. So in the code, we’ll train the network to predict a normalized version of the change in state, as in
L(θ) = X kNormalize(s. (5)
(st,at,st+1)∈D
Since fθ is trained to predict the normalized state difference, you generate the next prediction with
ˆst+1 = st + Unnormalize(fθ(st,at)). (6)
2.2 Action Selection
Given the learned dynamics model, we now want to select and execute actions that minimize a known cost function (or maximize a known reward function). Ideally, you would calculate these actions by solving the following optimization:
∞
a∗t = argminXc(ˆst0,at0) where ˆst0+1 = ˆst0 + fθ(ˆst0,at0). (7)
at:∞ t0=t
However, solving Eqn. 7 is impractical for two reasons: (1) planning over an infinite sequence of actions is impossible and (2) the learned dynamics model is imperfect, so using it to plan in such an open-loop manner will lead to accumulating errors over time and planning far into the future will become very inaccurate.
Instead, one alternative is to solve the following gradient-free optimization problem:
t+H−1
A∗ = arg min X c(ˆst0,at0) s.t. ˆst0+1 = ˆst0 + fθ(ˆst0,at0), (8)
{A(0),...,A(K−1)} t0=t
in which A) are each a random action sequence of length
H. What Eqn. 8 says is to consider K random action sequences of length H, predict the result (i.e., future states) of taking each of these action sequences using the learned dynamics model fθ, evaluate the cost/reward associated with each candidate action sequence, and select the best action sequence. Note that this approach only plans H steps into the future, which is desirable because it prevent accumulating model error, but is also limited because it may not be sufficient for solving long-horizon tasks.
A better alternative to this random-shooting optimization approach is the crossentropy method (CEM), which is similar to random-shooting, but with iterative improvement of the distribution of actions that are sampled from. We first randomly initialize a set of K action sequences A(0),...,A(K−1), like in randomshooting. Then, we choose the J sequences with the highest predicted sum of discounted rewards as the ”elite” action sequences. We then fit a diagonal Gaussian with the same mean and variance as the ”elite” action sequences, and use this as our action sampling distribution for the next iteration. After repeating this process M times, we take the final mean of the Gaussian as the optimized action sequence. See Section 3.3 in this paper for more details.
Additionally, since our model is imperfect and things will never go perfectly according to plan, we adopt a model predictive control (MPC) approach, where at every time step we perform random-shooting or CEM to select the best H-step action sequence, but then we execute only the first action from that sequence before replanning again at the next time step using updated state information. This reduces the effect of compounding errors when using our approximate dynamics model to plan too far into the future.
2.3 On-Policy Data Collection
Although MBRL is in theory off-policy—meaning it can learn from any data—in practice it will perform poorly if you don’t have on-policy data. In other words, if a model is trained on only randomly-collected data, it will (in most cases) be insufficient to describe the parts of the state space that we may actually care about. We can therefore use on-policy data collection in an iterative algorithm to improve overall task performance. This is summarized as follows:
2.4 Ensembles
A simple and effective way to improve predictions is to use an ensemble of models. The idea is simple: rather than training one network fθ to make predictions, we’ll train N independently initialized networks, and average their predictions to get your final predictions
. (9)
In this assignment, you’ll train an ensemble of networks and compare how different values of N effect the model’s performance.
3 Code
You will implement the MBRL algorithm descr