CS285 Assignment 1 Imitation Learning-Solved
1Behavioral Cloning
1.The starter code provides an expert policy for each of the MuJoCo tasks in Open AI Gym. Fill in the blanks in the code marked with Todo to implement behavioral cloning. A command for running behavioral cloning is given in the Readme file.
We recommend that you read the files in the following order. For some files, you will need to fill in blanks, labeled TODO.
• scripts/run hw1.py
• infrastructure/rl trainer.py
• agents/bc agent.py (another read-only file)
• policies/MLP policy.py
• infrastructure/replay buffer.py • infrastructure/utils.py
• infrastructure/pytorch utils.py
See the homework pdf for more details.
2. Run behavioral cloning (BC) and report results on two tasks: the Ant environment, where a behavioral cloning agent should achieve at least 30% of the performance of the expert, and one environment of your choosing where it does not. Here is how you can run the Ant task:
python cs285/scripts/run_hw1.py \
--expert policy file cs285/policies/experts/Ant.pkl \
--env name Ant-v2 --exp name bc_ant --n_iter 1 \
--expert data cs285/expert data/expert_data_Ant-v2.pkl \
--video log freq -1
When providing results, report the mean and standard deviation of your policy’s return over multiple rollouts in a table, and state which task was used. When comparing one that is working versus one that is not working, be sure to set up a fair comparison in terms of network size, amount of data, and number of training iterations. Provide these details (and any others you feel are appropriate) in the table caption.
Note: What “report the mean and standard deviation means is that your eval batch size should be greater than ep len, such that you’re collecting multiple rollouts when evaluating the performance of your trained policy. For example, if ep len is 1000 and eval batch size is 5000, then you’ll be collecting approximately 5 trajectories (maybe more if any of them terminate early), and the logged Eval
Average Return and Eval Std Return represents the mean/std of your policy over these 5 rollouts. Make sure you include these parameters in the table caption as well.
Tip: To generate videos of the policy, remove the flag --video log freq -1 However, this is slower, and so you probably want to keep this flag on while debugging.
3. Experiment with one set of hyperparameters that affects the performance of the behavioral cloning agent, such as the amount of training steps, the amount of expert data provided, or something that you come up with yourself. For one of the tasks used in the previous question, show a graph of how the BC agent’s performance varies with the value of this hyperparameter. In the caption for the graph, state the hyperparameter and a brief rationale for why you chose it.
2 Dagger
1. Once you’ve filled in all of the TODO commands, you should be able to run Dagger.
python cs285/scripts/run_hw1.py \
--expert policy file cs285/policies/experts/Ant pkl \
--env name Ant-v2 --exp name dagger ant --niter 10 \
--do dagger --expert data cs285/expert data/expert_data_Ant-v2.pkl \
--video log freq -1
2. Run Dagger and report results on the two tasks you tested previously with behavioral cloning (i.e., Ant + another environment). Report your results in the form of a learning curve, plotting the number of Dagger iterations vs. the policy’s mean return, with error bars to show the standard deviation. Include the performance of the expert policy and the behavioral cloning agent on the same plot (as horizontal lines that go across the plot). In the caption, state which task you used, and any details regarding network architecture, amount of data, etc. (as in the previous section).