Today I learned


  1. While optimizing (using, say, SGD), you calculate derivative of loss with respect to each parameter separately thereby updating that particular parameter.
  2. Loss is a function of prediction and observed. Prediction is a function of model and inputs. Model is composed of parameters. Therefore, loss is a function of model parameters.
  3. When you call loss.backward(), Pytorch calculates the gradients - but where would it store those gradients? Since gradients are derivatives of the loss function with respect to the parameters and calculated per parameter, it makes sense to expect them inside parameter tensor. Gradients are found in parameter.grad.
  4. One should zero out the gradients after each execution on a mini-batch - using parameter.grad.zero_() - so that gradients from one run don’t affect the next run.


  1. Tuple vs list:

    print(type([1, 2, 3, 4]))
    print(type((1, 2, 3, 4)))
    # Response
    <class 'list'>
    <class 'tuple'>


  1. Taipei in Taiwan is probably a good place to live if I were to start remote work.