Navigation
Results Advantage Actor Critic (A2C)
A2C was the third network I implemented. The main goal was to implement a policy optimization method that learns as fast as DQN. Therefore, I put the focus more on speed than stability.
Network Architecture
I have tried different, relatively small networks. In the end, a network with 2 layers and 64 units each achieved the best performance.
The used network looks as follows:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 8)] 0
_________________________________________________________________
dense (Dense) (None, 64) 576
_________________________________________________________________
dense_1 (Dense) (None, 64) 4160
=================================================================
Total params: 4,736
Trainable params: 4,736
Non-trainable params: 0
_________________________________________________________________
Parallel Environment
Since I implemented parallel environments for PPO2 (see Results PPO), I used this for A2C as well:
FPS:
Reward:
Different Optimizers and Schedulers
I tried different combinations of optimizers and schedulers (only a few of them plotted below):
Different Optimizers:
Different Schedulers:
In the end I was a little lucky and found a combination that is bit unstable but very fast. Since I wanted to achieve a solution which wins the game as fast as possible I used this combination of linear scheduler and RMSprop optimizer.
Hyperparameters
I tried different hyperparameters, but did not do any tuning with grid search or random search. Nevertheless, this was enough to win the game rather quickly.
Different Gamma:
Different Learning Rate: