Deep RL Bootcamp - Hackathon

Results from Pascal Sager


Navigation

Results Deep Q-Net

Different Network Architectures

Activation Function

After implementing an initial version, I started testing different network architectures. I did this for two reasons:

  • The result was very bad (game could not be won).
  • The reward was very unstable (strong fluctuations)

One of the biggest improvements was achieved by using a different activation function. After switching from tanh to relu, the reward was much more stable (note: this plot was recorded only after an initial parameter tuning, so a reward of >200 was already achieved):

different_activation

Overall, using ReLU has led to much more stable results!

Different Number of Layers and Neurons

I have also tried different architectures. However, only a few architectures could be tested due to the limited time. Starting from the basic version, I tried different number of layers and neurons. I compared the networks by measuring the reward per time (and not per episode). I did this because I wanted to favor higher performing networks. However, the result is for both versions (per time or per episods) almost the same.

different_architectures

The best result was achieved with a rather small network with 22,021 parameters.

Different Loss Functions

To further improve the result I read trough the article https://.manning.com//grokking-deep-reinforcement-learning// and also compared my implementation with the OpenAI baseline. Both sources have recommended a Huber loss-function. Therefore I compared this loss function with the MSE loss which was used in the original paper (https://arxiv.org/abs/1312.5602).

square_loss_vs_huber_loss

Unfortunately, this only brought a slight improvement, but since I had already implemented the Huber loss function, I kept it.

Double Q-Learning

I also compared Q-Learning with Double Q-Learning.

Because the future maximum approximated action value in Q-learning is evaluated using the same Q function as in current action selection policy, in noisy environments Q-learning can sometimes overestimate the action values, slowing the learning. A variant called Double Q-learning was proposed to correct this. (source: https://en.wikipedia.org/wiki/Q-learning)

Q_Learning_vs_Double_Q_Learning

Double Q-learning has mainly made the result more stable.

Tuning Hyper-Parameters

Many different parameters could be tried out. However, the most influential ones are:

  • Buffer size
  • Gamma
  • Learning rate
  • Epsilon Decay

I have therefore limited myself to these parameters. First, I performed tests with the individual parameters to make sure that they have the expected influence and also to get a feeling for them.

Different Buffer Size: different_buffer_size

Different Gamma: mean_reward_different_gamma

Different Learning Rate: different_lr

Epsilon Decay: eps_decay_reward eps_decay_exploration_time

After this exploration I wanted to run sweeps, but didn’t have enough time. It is likely that the performance could be optimized even further. Additionally, other hyperparameters such as the batch size should also be examined.

Stability

DQN achieves good results relatively quickly, but one problem is stability. After the game is won for the first time, the algorithm has strong fluctuations. So far, it has not been found out what exactly causes this. Some modifications like the adjustment of the loss function have reduced these fluctuations, but they are still strong.

stability

Video of the Result