Bag of Tricks for Image Classification with Convolutional Neural Networks

This paper reviews a few tricks that have been proposed in the past for image classification using CNNs and proposes a few of its own. It then documents how those tricks improve training time or model accuracy.
I found the following interesting:
- Increasing the batch size during training will decrease the variance (and, therefore, noise) of stochastic gradient, so we can increase learning rate and, therefore, reduce training time.
- Learning rates should be kept low initially, otherwise they can lead to numerical instability.
- Apply weight decay only to weights, not to biases and $\gamma$ and $\beta$ in Batch Norm layers.
- You can use 16-bit floating point precision to decrease training time.
- Cosine decay for learning rate.
- I didn’t know about the teacher student model, where we make the student imitate the teacher. This is different from transfer learning where we use one model as the starting point to train another.
- Mixup training: An augmentation technique that randomly chooses two samples and blends them together with a weighted linear interpolation. I couldn’t imagine something like that would work.
I didn’t fully understand:
- Label smoothing.