- In the image below, the white ones are activations and the arrows are parameters. In one layer of a neural net, a vector of activations are multiplied by a matrix of parameters. And since parameters are part of a matrix, there are a lot of them below.
- We randomly throw away activations, not parameters, in each mini-batch with a probability p.
- Batch Norm:
- Works for continuous variables.
- Keeps loss landscape less bumpy, therefore allowing you to increase learning rate and arrive at good results quickly. If loss landscape is bumpy, you want to keep learning rate low lest you mistakenly jump off into some awful part of the weight space.
- Both a regularization technique and training helper, the latter probably because of the previous point.
- In neural networks, there are only two kinds of numbers: activations and parameters. The latter are learnable.