Deep RL Bootcamp - Hackathon

Results from Pascal Sager

January 15, 2021

Deep RL Bootcamp - Hackathon
Content
  • Documentation
  • Environment
  • Procedure
  • Selection of the Algorithms
  • Brief Explanation of the Algorithms
  • Results
  • Conclusion and Outlook
Deep RL Bootcamp - Hackathon
Documentation

I have created a detailed documentation on Github Pages: https://sagerpascal.github.io/rl-bootcamp-hackathon/


This presentation is only a summary of the documentation.

Deep RL Bootcamp - Hackathon
Environment

LunarLander-V2

  • Land between the two flags
  • Points: Speed, landing position, usage of engine
Deep RL Bootcamp - Hackathon
Procedure

Approach:

  1. Setup docker on the GPU-cluster
  2. Select the algorithm
  3. Implement simple basic version
  4. Compare with high-end implementations and improve the algorithm
  5. Tuning and documentation
  6. Repeat from 2.

Principles:

  • Quality before quantity: Good performance over many different algorithm
  • Work like hell :)
Deep RL Bootcamp - Hackathon
Selection of the Algorithms
  • Inverse RL was not considered, because the reward function is well defined by the environment
  • Model-Based vs. Model-Free: Has the agent access to (or learns) a model of the environment (model = function which predicts state transitions and rewards)
    model free algorithms used
  • What to learn:
    • Policy Optimization (model free, on-policy): Directly optimize the policy, makes algorithm more stable and reliable (see spinningup.openai.com/)
    • Q-Learning (model free, off-policy): Learn approximator for the optimal action-value function, update can use data collected at any point during training, more sample efficient
Deep RL Bootcamp - Hackathon
Selection of the Algorithms
Deep RL Bootcamp - Hackathon
Brief Explanation of the Algorithms

Double Deep Q-Network (Double DQN)

  • Collect rollouts in a replay buffer approximate the Q-Value function
algorithm
Deep RL Bootcamp - Hackathon
Brief Explanation of the Algorithms

Clip Proximal Policy Optimization (Clip PPO2)

  • Use on-policy value function to figure out how to update the policy
algorithm
Deep RL Bootcamp - Hackathon
Brief Explanation of the Algorithms

Advantage Actor-Critic (A2C)

  • Similar to PPO, but without clipping and with policy entropy
algorithm
Deep RL Bootcamp - Hackathon
Brief Explanation of the Algorithms

Deep Deterministic Policy Gradient (DDPG)

  • Collect rollouts (as for DQN)
  • Two prediction networks: Actor and Critic (as for Q-Actor-Critic)
  • Two target networks: Target-Actor and Target-Critic (as for DQN)
  • Learns concurrently a Q-function and a policy
  • Uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy

Algorithm not finished yet!

Deep RL Bootcamp - Hackathon
Results

https://sagerpascal.github.io/rl-bootcamp-hackathon/results

Deep RL Bootcamp - Hackathon
Conclusion and Outlook

Algorithm

  • DQN: Very sample efficient but less stable, reached a pretty high avgerage score
  • PPO: Very stable, took longer to win the game
  • A2C: The proof that Policy Optimization can also win fast (with a little luck)

Next Steps

  • Finish DDPG and win Lunar-Landing with continous action space
  • Cleanup code
  • Run sweeps
  • ...
Deep RL Bootcamp - Hackathon

Discussion

Deep RL Bootcamp - Hackathon

Global style