Intuitive understanding of regularization in deep learning

In machine learning, regularization is a way to combat high variance – in other words, model learning reproduces the problem of data rather than the underlying semantics of the problem. Similar to human learning, the idea is to construct homework questions to test and construct knowledge, rather than simply Rote: for example, learn multiplication tables, not how to multiply. < / P > < p > this phenomenon is particularly common in neural network learning – the stronger the learning ability, the greater the possibility of memory, which depends on how we practitioners guide the deep learning model to absorb our problems, not our data. Many of you have come across these methods in the past, and may have developed an intuitive understanding of how different regularization methods affect the results. For those of you who do not know, this paper provides an intuitive guidance for the formation of regularized neural network parameters. Visualizing these aspects is important because it’s easy to take many concepts for granted; the graphs in this article and their explanations will help you visualize the actual situation of model parameters as you add regularization. < / P > < p > in this article, I’ll take L2 and dropouts as the standard form of regularization. I’m not going to talk about other ways to change the way the model works. One of the core principles of deep learning is the ability of deep neural network as a universal function approximation. Whatever you are interested in, disease transmission, autopilot, astronomy, etc. can be compressed and expressed through a self-learning model. What an amazing idea! Although the question you are interested in is whether these problems can be represented by the analytic function f, when you adjust the machine learning model through training, the parameter θ adopted by the model allows the model to learn f * approximately. < / P > < p > for demonstration purposes, we’ll look at some relatively simple data: ideally, some of the data in one dimension is complex enough to make old-fashioned curve fitting painful, but not enough to make abstraction and understanding difficult. I’m going to create a complex function to simulate periodic signals, but I’m going to add something interesting. The following function implements the following equation: < / P > < p > Where a, B, C are random numbers sampled from different Gaussian distributions. The effect of these values is to add a lag between very similar functions so that they are randomly added together to produce very different F values. We will also add white noise to the data to simulate the effect of the collected data. < / P > < p > let’s visualize a randomly generated sample of data: in the rest of this article, we’ll use a small neural network to reproduce the curve. < / P > < p > for our model training, we will divide it into training / validation sets. To this end, I will sklearn.model_ A very convenient train in selection_ test_ Split function. Let’s design the training and validation set: < / P > < p > as we can see in the figure, both sets do quite well in representing the entire curve: if we delete one of them, we can more or less collect the same image represented by the data. This is a very important aspect of cross validation! < / P > < p > now that we have a dataset, we need a relatively simple model to try to replicate it. To achieve this goal, we will deal with a four layer neural network, which contains a single input and output value of three hidden layers, and each hidden layer has 64 neurons. < / P > < p > for convenience, each hidden layer has a leakyrelu activation, and the output has relu activation. In principle, these should be less important, but during testing, models sometimes fail to learn some “complex” functionality, especially when using easily saturated activation functions like tanh and sigmoid. In this paper, the details of this model are not important: what is important is that it is a fully connected neural network with the ability to learn to approximate certain functions. < / P > < p > in order to prove the validity of the model, I used the mean square error loss and Adam optimizer to perform the usual training / validation cycle without any form of regularization, and finally got the following result: < / P > < p > now, I can hear you asking: if the model works well, why should I do any regularization? In this demonstration, it doesn’t matter whether our model is too quasi merged: what I want to understand is how regularization affects a model; in our case, it can even have an adverse effect on a perfect working model. In a sense, you can think of this as a warning: deal with over fitting when you encounter it, but don’t deal with it until then. In Donald Knuth’s words, “immature optimization is the root of all evil.”. < / P > < p > now that we have completed all the template files, we can enter the core of the article! Our focus is to establish an intuitive understanding of regularization, that is, how different regularization methods affect our simple model from three aspects: < / P > < p > what is the loss of training / validation? What will happen to the performance of our model? What are the actual parameters? Although the first two points are simple, many people may not be familiar with how to quantify the third point. In this demonstration, I’m going to use kernel density assessment to measure the change in parameter values: for those familiar with tensorboard, you’ll see these graphs; for those who don’t know, they can be seen as complex histograms. The goal is to visualize how our model parameters change with regularization. The figure below shows the difference in theta distribution before and after training: < / P > < p > the blue curve is marked as “uniform” because it represents the model parameters we initialized with uniform distribution: you can see that this is basically a top hat function with equal probability in the center. This is in sharp contrast to the model parameters after training: after training, the model needs uneven value of θ to express our function. One of the most direct methods of regularization is the so-called L2 regularization: L2 refers to the L2 norm of the parameter matrix. According to linear algebra, the norm of matrix is: < / P > < p > in machine learning of pre neural network, parameters are usually expressed by vector instead of matrix / tensor, which is Euclidean norm. In deep learning, we usually deal with matrix / high dimensional tensor, but Euclidean norm can not be extended well. L2 norm is actually a special case of the above equation, where p = q = 2 is called Frobenius or Hilbert Schmidt norm, which can be extended to infinite dimensions. This equation defines the cost function J as MSE loss and L2 norm. The influence cost of the L2 norm is multiplied by this pre factor λ; this is known in many implementations as the “weight attenuation” hyperparameter, usually between 0 and 1. Because it controls the number of regularizations, we need to understand how this affects our model! < / P > < p > in a series of experiments, we will repeat the same training / validation / visualization cycle as before, but this is on a series of λ values. First of all, how does it affect our training? < / P > < p > let’s analyze it. A deeper red corresponds to a larger λ value, showing traces of training losses as logs of MSE losses. Remember, in our non regularization model, these curves are monotonically decreasing. Here, when we increase the value of λ, the final training error increases greatly, and the reduction of early loss is not so significant. What happens when we try to use these models to predict our function? < / P > < p > we can see that when the value of λ is very small, the function can still be well expressed. The turning point appears to be near λ = 0.01, where the qualitative shape of the curve is reproduced, but not the actual data point. From λ & gt; 0.01, the model only predicts the average value of the entire dataset. If we interpret these as our training losses, then the losses will stop, which is not surprising. We can see that the propagation of parameter values is greatly hindered, just as our λ goes from low to high. When λ = 1.0, the distribution of θ looks like a Dirac δ function at 0. Thus, we can eliminate L2 regularization as a constraint parameter space, forcing θ to be very sparse and close to zero. < / P > < p > another popular and cost-effective regularization method is to include dropouts in the model. The idea is that every time a model passes, some neurons are inactivated by setting their weights to 0 according to probability P. In other words, we apply a Boolean mask to the parameter, which is activated each time the data passes through a different unit. The basic principle behind this is to distribute model learning throughout the network, rather than one or two layers / neurons. < / P > < p > in our experiment, we will add dropout layer between each hidden layer, and adjust the dropout probability P from 0 to 1. In the former case, we should have a non regularized model, while in the latter case, our learning ability should be reduced because each hidden layer is disabled. < / P > < p > we see a very similar effect to L2 regularization: in general, the learning ability of the model decreases, and the proportion of final loss increases with the increase of dropout probability value. < / P > < p > as shown in the figure, we gradually increase the dropout probability. Starting with P = 0.1, we can see that our model is beginning to become quite unreliable for its predictions: most interestingly, it seems to track our data approximately, including noise! < / P > < p > at P = 0.2 and 0.3, this is more obvious at x = 11 – recall that it is difficult for our non regularized models to find the correct functional region. We see that the prediction with dropout actually makes this area incredibly fuzzy, almost as the model tells us that it is uncertain!. < / P > < p > from P = 0.4, the ability of the model seems to be greatly limited, and it can hardly reproduce the rest of the curve except for the first part. At P = 0.6, the prediction results seem to be close to the average value of the data set, which also seems to happen on the large value of L2 regularization. < / P > < p > compare this result with our L2 norm result: for dropout, our parameter distribution is wider, which increases the ability of our model expression. Except for P = 1.0, the actual value of dropout probability has little effect on the distribution of parameters, if any. At P = 1.0, our model doesn’t learn anything, just like uniform distribution. When the value of P decreases, the model can still learn even though the speed is reduced. < / P > < p > from our simple experiments, I hope you have formed some intuitive understanding of how these two regularization methods affect the neural network model from the three angles we explore. < p > < p > L2 regularization is very simple, only one super parameter needs to be adjusted. When we increase the weight of L2 penalty, due to the change of parameter space, the model capacity decreases rapidly for large values. For smaller values, you may not even see changes in model predictions.