The demo program trains a first model using the back-propagation algorithm without L2 regularization. The L2 norm calculates the distance of the vector coordinate from the origin of the vector space. I personally believe that we don’t have to stick to logistic sigmoid or tanh.
I was always interested in different kind of cost function, and regularization techniques, so today, I will implement different combination of Loss function with regularization to see which performs the best. If you want to know why we need activation functions please read my other blog post “Think of this function as just as tanh() function but with a wider range. While in L1 normalization we normalize each sample (row) so the absolute value of each element sums to 1. This function is able to return one of eight different matrix norms, or one of an infinite number of vector norms (described below), depending on the value of the ord parameter. In python, NumPy library has a Linear Algebra module, which has a method named norm (), that takes two arguments to function, first-one being the input vector v, whose norm to be calculated and the second one is the declaration of the norm (i.e. So I won’t add anything more, know lets take a look at regularizes.Again the red box from top to bottom represent L1 regularization and L2 regularization. If you see where the green star is located, we can see that the red regression line’s accuracy falls dramatically.Also, one thing to note is where the blue star lies, most of the model fails to predict the right value of Y at the beginning of X, this was very interesting to me. See below for the exact difference.As seen above, rather than following the strict rule of derivation, I just adjusted the cost function to be (Layer_4_act — Y)/m.I think when it comes to deep learning, sometimes creativity gives better results, I am not sure but Dr. Hinton did something with randomly decreasing weights in back propagation and still achieving good results. The number of hidden nodes is a free parameter and must be determined by trial and error. Know, lets take a look at the absolute sum of the weights.Overall it becomes very clear that the models with the regularization have much smaller weights. The parts written in red marker are the places where we BREAK THE RULE of taking derivative of absolute function!
For example,I have the attitude of a learner, the courage of an entrepreneur and the thinking of an optimist, engraved inside me. However since I have to drive derivative (back propagation) I will touch on something.As seen above, derivative of absolute function have three different cases, when X > 1, X < 1 and X = 0.Since we can’t just let the gradient to be ‘undefined’ I BREAK THIS RULE.Above, is the function that I will use to calculate the derivative of value X. )As expected the network with regularization were most robust to noises. And as seen above, I don’t have the second option, we merged that into first option.Above is creating training data and declaring some noise as well as learning rate and the alpha value (these are for regularization).There is nothing special about the network arch, simply put.And the weights have appropriate dimension to perform transformation between the layers. In L2 normalization we normalize each sample (row) so the squared elements sum to 1. However the model with pure L1 norm function was the least to change, but there is a catch! Crispy clear nothing for me to add. numpy.linalg.norm¶ numpy.linalg.norm (x, ord=None, axis=None, keepdims=False) [source] ¶ Matrix or vector norm. Neural Network L2 Regularization in Action The demo program creates a neural network with 10 input nodes, 8 hidden processing nodes and 4 output nodes. Simple, rather than following the strict derivative of absolute function, I loosen up the derivative a bit. In this article, we have explained Blockchain intuitively so that a five year old can get the basic idea as well. We have covered concepts like decentralization, immutability, fraudulent transaction, consensus protocol and much more.Given a list of non-negative integers representing the amount of money of each house, determine the maximum amount of money the robber can rob tonight without alerting the police.
1 for L1, 2 for L2 and inf for vector max). The result is a positive distance value. 1 As such, it is also known as the Euclidean norm as it is calculated as the Euclidean distance from the origin. Let’s do another example for L1 normalization (where X is the same as above)! We’ll also take a look at absolute sum of each model’s weight to see how small the weights became.I think the above explanation is the most simple yet effective explanation of both cost functions. Among them L1 cost function with L2 regularization had the smallest weight values.Where and how did I get the above result?