Review

Bag of Tricks for Image Classification with Convolutional Neural Networks

Bag of Tricks for Image Classification with Convolutional Neural Networks: This paper reviews a few tricks that have been proposed in the past for image classification using CNNs and proposes a few of its own. It then documents how those tricks improve training time or model accuracy. I found the following interesting: Increasing the batch size during training will decrease the variance (and, therefore, noise) of stochastic gradient, so we can increase learning rate and, therefore, reduce training time. Learning rates should be kept low initially, otherwise they can lead to numerical instability. Apply weight decay only to weights, not to biases and $\gamma$ and $\beta$ in Batch Norm layers. You can use 16-bit floating point precision to decrease training time. Cosine decay for learning rate. I didn’t know about the teacher student model, where we make the student imitate the teacher. This is different from transfer learning where we use one model as the starting point to train another. Mixup training: An augmentation technique that randomly chooses two samples and blends them together with a weighted linear interpolation. I couldn’t imagine something like that would work. I didn’t fully understand: Label smoothing.

DeepAnT - A Deep Learning Approach forUnsupervised Anomaly Detectionin Time Series

DeepAnT: A Deep Learning Approach forUnsupervised Anomaly Detectionin Time Series: It’s a long paper but, at it’s core, it presents a straightforward idea. Use CNN as a forecasting model to predict data for a given timestamp. Calculate an anomaly score by comparing the prediction against the observed value and call the time-series anomalous (for that timestamp) if the score is higher than some threshold. I found two problems with their proposal of calculating anomaly scores: It doesn’t take variance into account and only uses absolute difference. If the latter is fixed, shouldn’t anomaly score be inversely proportional to variance? How do I interpret that score in business terminology? They used Mean Absolute Error as the loss function for the model. Why not Mean Squared Error? The idea of choosing time-series specific thresholds would be tricky but the paper doesn’t clarify how to automate that at scale.

Time Series Anomaly Detection Using Convolutional Neural Networks and Transfer Learning

Quick notes after reading Time Series Anomaly Detection Using Convolutional Neural Networks and Transfer Learning: When I read about their idea of using U-Nets for anomaly detection, I thought well, that’s cool but where will you find all that labeled data. It was (pleasantly) surprising when they later mentioned that they generated synthetic data themselves, for all the types of time-series and anomalies they wanted to train the model on. Instead of plain change point (i.e. whether there was a short spike or dip) or change of trend detection, their U-Net based approach allows them to go deeper. They are able to do multi-class (i.e. multiple types of anomalies) and both single and multi label for sub-sequences of time-series. Impressive! The paper lacked details around input normalization (based on time-series scale) and data augmentation. They also talked about up-sampling and down-sampling the data to fit into 1024 length sequences their model expects but I didn’t understand how they do that. I couldn’t find their code on the internet. That’s a pity because that could have filled in the holes left in their text.