I am one of those people who function better by writing things down. One day, I realized that most of my notes don’t have to be private, so here they are - my second brain. Be warned that, if you stumble upon something here that doesn’t make sense to you, it isn’t meant to!
Today I learned
Fastai: While optimizing (using, say, SGD), you calculate derivative of loss with respect to each parameter separately thereby updating that particular parameter. Loss is a function of prediction and observed. Prediction is a function of model and inputs. Model is composed of parameters. Therefore, loss is a function of model parameters. When you call loss.backward(), Pytorch calculates the gradients - but where would it store those gradients? Since gradients are derivatives of the loss function with respect to the parameters and calculated per parameter, it makes sense to expect them inside parameter tensor. Gradients are found in parameter.grad. One should zero out the gradients after each execution on a mini-batch - using parameter.grad.zero_() - so that gradients from one run don’t affect the next run. Python: ...