# Bag of Tricks for Image Classification with Convolutional Neural Networks

Bag of Tricks for Image Classification with Convolutional Neural Networks:

- This paper reviews a few
*tricks*that have been proposed in the past for image classification using CNNs and proposes a few of its own. It then documents how those tricks improve training time or model accuracy. - I found the following interesting:
- Increasing the batch size during training will decrease the variance (and, therefore, noise) of stochastic gradient, so we can increase learning rate and, therefore, reduce training time.
- Learning rates should be kept low initially, otherwise they can lead to numerical instability.
- Apply weight decay only to weights, not to biases and $\gamma$ and $\beta$ in Batch Norm layers.
- You can use 16-bit floating point precision to decrease training time.
- Cosine decay for learning rate.
- I didn’t know about the teacher student model, where we make the
*student*imitate the*teacher*. This is different from transfer learning where we use one model as the starting point to train another. - Mixup training: An augmentation technique that randomly chooses two samples and blends them together with a weighted linear interpolation. I couldn’t imagine something like that would work.

- I didn’t fully understand:
- Label smoothing.