Selection of the Algorithms

Inverse RL was not considered, because the reward function is well defined by the environment
Model-Based vs. Model-Free: Has the agent access to (or learns) a model of the environment (model = function which predicts state transitions and rewards)
model free algorithms used
What to learn:
- Policy Optimization (model free, on-policy): Directly optimize the policy, makes algorithm more stable and reliable (see spinningup.openai.com/)
- Q-Learning (model free, off-policy): Learn approximator for the optimal action-value function, update can use data collected at any point during training, more sample efficient

Deep RL Bootcamp - Hackathon

Discussion