Deep RL Bootcamp - Hackathon

Results from Pascal Sager

Results Advantage Actor Critic (A2C)

A2C was the third network I implemented. The main goal was to implement a policy optimization method that learns as fast as DQN. Therefore, I put the focus more on speed than stability.

Network Architecture

I have tried different, relatively small networks. In the end, a network with 2 layers and 64 units each achieved the best performance.

different_architecture

The used network looks as follows:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 8)]               0         
_________________________________________________________________
dense (Dense)                (None, 64)                576       
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
=================================================================
Total params: 4,736
Trainable params: 4,736
Non-trainable params: 0
_________________________________________________________________

Parallel Environment

Since I implemented parallel environments for PPO2 (see Results PPO), I used this for A2C as well:

FPS: different_envs_fps

Reward: different_envs_reward

Different Optimizers and Schedulers

I tried different combinations of optimizers and schedulers (only a few of them plotted below):

Different Optimizers: different_optimizers

Different Schedulers: different_scheduler

In the end I was a little lucky and found a combination that is bit unstable but very fast. Since I wanted to achieve a solution which wins the game as fast as possible I used this combination of linear scheduler and RMSprop optimizer.

Hyperparameters

I tried different hyperparameters, but did not do any tuning with grid search or random search. Nevertheless, this was enough to win the game rather quickly.

Different Gamma: different_gamma

Different Learning Rate: different_lr