Today I Learned

Today I learned

Deep Learning: Whenever you define a neural network, either as a vanilla Python function or as a nn.Module, it should only take a mini-batch of training data as input. For e.g., for a net that works on images, the input will only be X whose shape could be [256, 1, 28, 28]. Here, 256 is the number of items in each mini-batch, 1 is the number of channels (for gray-scale images) and 28 * 28 is the size of the images. First element of the input is always the batch-size. Loss function should only return a scalar (or a Tensor of size 1). That is because PyTorch’s backward function only works on scalars.

Today I learned

Statistics While the following from statistics is too fundamental to be interesting, I struggled (and failed) to fully grasp it a few years ago. Hopefully, it will stick this time! Anyway, before I forget: Standard deviation is the square root of variance. Variance is the average of squared sum of differences between mean and observed values. Standard deviation is a measure of how spread out observed values are. Normal or Gaussian distribution is simply a bell curve, something that is fairly common in the real world. For such a distribution, 68% of the values lie within 1 standard deviation of the mean, 95% within 2 and 99.7% within 3. Standard normal disribution: is a (special) normal distribution which has a mean of 0 and standard deviation of 1. The y-axis of such a graph is just numbers but the x-axis is called the Z-score. You can convert a normal distribution to a standard normal distribution by calculating the Z-score as: ((observerd-value - mean) / standard-deviation). The area under the curve is 1. If you search online, you’ll find a table that, for a given Z-score, gives the area of the curve on the left of that score. That will answer things like, what’s the probability that a given value X is less than something. Resources: ...

Today I learned

Fastai: Initially, when you start training, accuracy continues to go up. However, the model will start memorizing after sometime and validation set accuracy will decrease. That is when the model starts to overfit. The model looks at all images exactly once in every epoch. Architecture (for e.g., ResNet) is a template for a mathematical function. When trained, it turns into a model which contains parameters, possibly in millions, appropriate to tackle the problem at hand. Train = fit.

Today I learned

Fastai: Dense blocks use more memory but are a good choice for lower resolution images. In CNNs, you usually double the number of channels whenever you do stride 2. Otherwise, with stride 1, you keep the number of channels the same. (I say usually because I don’t know if this is a universal recommendation.)

Today I learned

Fastai: Unets are useful when the size of your outputs is similar to that of the input. So, any kind of generative modeling such as segmentation or decrappification. They don’t make sense for classification problems. Unets have a downsampling and upsampling path. These parts are also called encoder and decoder respectively. A layer in a neural network is always an affine function, such as a matrix multiplication, followed by a non-linearity, such as a relu.

Today I learned

On the AI/Fastai front: If x is a torch.Tensor and x.shape is torch.Size([4, 3, 5]), it means x contains 4 2D matrices with 3 rows and 5 columns. Also, x is 3D matrix, the technical term for which would be a rank-3 tensor. When images are represented using PyTorch tensors, the first dimension corresponds to the number of channels in it. (For e.g., an RGB image will contain 3 channels and, therefore, the first element in the tensor shape will be 3.) Matplotlib’s imshow function expects number of channels at the end and you can modify a tensor to that by applying the permute function.

Today I learned

Pandas: If df2 is a DataFrame, df2['A'] returns a Series whereas both df2[['A']] and df2[['A', 'B']] return a DataFrame. Fastai: Embeddings are indeed created for categorical variables. For example, in the Excel sheet on collaborative filtering, Jeremy had created embeddings for user-ids and movie-ids both of which were categorical variables. Let’s say we do a stride 2 in a CNN. While the output height and width will halve, we still double the number of output channels probably because the output channels detect various features of the images and we care more about capturing those features.

Today I learned

Fastai: Validation vs test data: Validation set: used actively during training to calculate and minimize loss. Test set: (at least in Fastai) it is used unlabeled, the assumption being that we need to predict on test data using our trained model. Force GC (i.e. garbage collection) in Python: learn = None import gc gc.collect()

Today I learned

Fastai: While optimizing (using, say, SGD), you calculate derivative of loss with respect to each parameter separately thereby updating that particular parameter. Loss is a function of prediction and observed. Prediction is a function of model and inputs. Model is composed of parameters. Therefore, loss is a function of model parameters. When you call loss.backward(), Pytorch calculates the gradients - but where would it store those gradients? Since gradients are derivatives of the loss function with respect to the parameters and calculated per parameter, it makes sense to expect them inside parameter tensor. Gradients are found in parameter.grad. One should zero out the gradients after each execution on a mini-batch - using parameter.grad.zero_() - so that gradients from one run don’t affect the next run. Python: ...

Today I learned

Fastai: While updating weights, we multiply learning rate with derivative of the loss function with respect to the weights. Loss function is a function of independent variables, X, and weights. Cross-entropy loss function is useful for classification problems where you don’t care about how close your prediction was. Softmax is an activation function that allows your output to be between 0 and 1. Somewhat like the Sigmoid that conforms output to a range. Some discussion on when to use or the other of these: Softmax vs Sigmoid function in Logistic classifier? You generally want Cross-entropy loss and Softmax for single-label multi-class classification problems. They go well together. Regularization techniques allow you to avoid over-fitting. Weight decay. Dropout. Batch Norm. Data augmentation. One way to avoid over-fitting is to use less parameters. However, Jeremy proposes that we instead use a lot of parameters but penalize complexity. Weight decay is one way to do the latter.