问题：Why does RLLib use a very small standard deviation (0.01) for weight initialization in the final fully connected layers of its default fully connected model?
In all the prior layers they use a standard deviation of 1.0. In experimenting with this myself, I've found that if I use essentially the same fully connected architecture as their default model, but use Xavier initialization for all my layers, my training results are significantly worse. What is unique or important about the weight initialization scheme used by RLLib, particularly with the standard deviations used for the final layers? Thanks in advance!
The default RLLib model I'm referring to is here: https://github.com/ray-project/ray/blob/master/rllib/models/tf/fcnet.py
I suppose it is because it has been observed that having small weight in the last layer of the MLP is a good strategy. See the paper "What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study" Section 3.2