In my last article, I showed you that, how actually a neural network works. Let me give
you a quick recap so that you get familiar if you have forgotten something.
The
purpose of this article is to introduce you with the idea of gradient descent,
which tells how neural networks learn. Then after that, we are going to see how
this particular network performs and what those hidden layers of neurons are
actually doing.
So,
let’s start.
In the previous article, we have seen the example of handwritten digit recognition.
These
digits are provided on a 28 by 28-pixel grid each pixel with some grayscale value between 0 & 1 which determines the activations of all 784
neurons in the input layer of the network, which then determines the activation
for each neuron in the next layers based on a weighted sum of all the
activations in the previous layer and a special number called a bias,
then you apply that sum with some other activations function like the sigmoid
or ReLU, etc.
If we
arbitrary choose two hidden layers with 20 neurons each, the system has
about 16330 weights and biases togetherthat we can
adjust. These weights and biases values tell us how our network will perform.
Then
what we mean when we say that this network classifies a given digit?Is
that the brightest of those 10 neurons in the last layer corresponds to that
digit?
The
second layer which is a hidden layer could pick up on the edges and the third
layer could pick up on patterns like loops and lines.
And
the last layer could just piece together those patterns to recognize the given digits.
So here we learn how the network learns.
We
want to write an algorithm where we can apply training data to this system
which is in the form of different images of handwritten digits along with
labels for what they are to be recognized and It will adjust those 16,330 weights
and biases to improve its performance on the training data.
MNIST database has put together a
collection of thousands of handwritten digit images each one is labeled with
the numbers that they are supposed to be.
Basically,
we are trying to find the minimum of a certain function, with each neuron is connected
to all of the other neurons in the previous layer. The weights in the
weighted sum defining its activation is like the strength of these connections and
the bias is the indication of whether that neuron tends to be active or
inactive.
So
let’s take a simple example and try to understand what simply happening.
Imagine,
one day you and your friends went for trekking. All of you reached on the top
of a mountain. As you are tired and want some rest, you told your friends to
move forward and get down you will be joining them after taking some rest. While you trying to get down a mountain with
a blindfold on. It’s impossible to know which direction to go in, but there’s
one thing you can know: if you will be going down (making progress) or going up
(losing progress). Eventually, if you keep taking steps that lead you
downwards, you’ll reach the base.
Similarly,
it’s impossible for us to know what our model’s weights should be right from
start. But with some trial and error based on the loss function (whether you descending),
you can end up getting there eventually.
We
have to first initialize all of those weights and biases randomly.
Let’s
imagine, we feed in this image of a 5, but due to this random weights
and biases, the output layer just looks very different and complex.
So,
to overcome this problem we define a cost function which tells us that the
output should have activations which are 0 for most neurons and 1
for the neuron which is recognizing that digit...
So, mathematically
we can tell that we have to add up the squares of the differences between each of
error output activations and the value that we want them to have, this is
nothing but the cost of a single training.
Like
this we have multiple training examples and their cost function output values.
The
sum of these values is small when the system confidently recognizes the image
correctly but it's large when the system seems like it doesn't know what it's
recognizing.
So, in
this case we consider the average cost of all training examples available to us,
which is the measure of how noisy the system is. The system was a function
that takes 784 inputs, the pixel values and spits out ten numbers as its
output using all these weights and biases.
The
cost function takes as its input those 16,330 weights and biases and it produces
a single number as a result.
Now a
question arises how to change those weights and biases to get better results?
Let’s
consider a function that has one number as an input and one number as an output.
Here, we are going to find an input that minimizes the value of this function.
Technique
which we are going to use here is to start at any input and find out in which
direction you should take your step to make that output lower.
So
basically, we find out the slope of the function where you are currently at and
then shift to the left if that slope is positive or shift to the right if that
slope is negative.
By
repeatedly doing this at every point that is checking the new slope and taking
the proper step, we will get local minimum of the function.
Because
of starting with the random input, we cannot predict where the local minimum
should be.
One
thing to note down here is, step sizes are proportional to the slope and when
the slope is flattening out towards the minimum, steps get smaller and smaller.
Let
us consider a function with two inputs and one output, by taking input space in
the XY plane and the cost function as a surface above it, here we have to
consider in which direction to step-in this input space?
The
gradient of a function gives you the direction of steepest ascent.
Now
arises a new question, in which direction should we step to increase the
function most quickly? taking the negative of that gradient gives you the
direction to step that decreases the function most quickly.
It's
the basic idea for a function that has 16,330 inputs instead of two
inputs and organizing all 16,330 weights and biases of the system into a
big column vector.
The
negative gradient of the cost function is a vector. It's some direction inside
this input space that tells you which nudges to all of those numbers are going
to cause the rapid decrease to the cost function.
Changing
the weights and biases will make the output of the network on each example of
training.
This
cost function involves an average overall of training data. This is what about
the backpropagation.
The
cost function should have a nice smooth output so that we can find a local
minimum by taking little steps downhill.
The process of continuously putting an input of a function by some
multiple of the negative gradient is called gradient descent.
So,
let me summarize what actually gradient descent does?
It’s
a way to converge towards a global minimum of a cost function.
It tries to calculate what a small change in
each individual weight would do to the loss function (i.e. which direction
should we move)
Then it adjusts each individual weight-based
on its gradient (i.e. take a small step in the determined direction)
It keeps iterating step 1 and step 2 until the
loss function gets as low as possible and reaches global minima to get the best
performance model.
Each
component of the negative gradient gives an idea about two things first
the sign, of course, tells us whether the corresponding Component
of the input, the vector should be pushed up or down, but relative magnitudes of all
these components tell you which changes matter more.
An
adjustment to one of the weights in our network might have a much greater
impact on the cost function than the adjustment to some other weight.
For
example, if you have some function with two variables as an input and you
compute that its gradient at some particular point comes out as (3,1). You
can also infer it as changes to the first variable have three times the importance
as changes to the second variable that at least in the neighborhood of the
relevant input.
So,
when you initialize the network with random weights and biases and try to adjust
them many times based on this gradient descent process
How well does is actually perform on the images that it’s never
seen before?
With
the simple neural network what we have consider above, we observe that the
model performance is around 93-95%. But if we play around with the hidden layer
structure and try to make the network more complicated with couple of tweaks, we
can get the model performance up to 98%. That’s pretty good but not the best.
We
can get the better performance by creating more complex network than the simple
vanilla network.
That’s
how neural networks works and with the help of backpropagation algorithm and
gradient descent we get our best model by reducing the error term.
Finally,
let me take a small example so that you may get more strong foundation about
gradient descent.
Let’s
try to predict the price of a new house from the given housing data:
In this example our task is to predict the price of a new house
based on the size of the house.
Let’s plot the given historical
housing data:
From the plot , we can see it follows a linear trend as the
size of the house increases, the price is also increasing.
Next, let’s try to use a simple linear regression model,
where we try to fit a line on the given historical data and predict the price
of a new house (Ypred) based on its size (X).
From
the figure, we can see that the red line gives the predicted house price (i.e.,
Ypred) given house size(X).
So,
Ypred can be given as: Ypred = a+bX
Now, the blue line represents the actual prices from the
historical data i.e., Yactual.
The difference between Yactual and Ypred which is represented
by the yellow dashed lines is called prediction error (error) E.
So now, our aim is to find a line with optimal values of a, b
which is known as coefficients/weights that best fits the historical
data by reducing the prediction error/ error and improving prediction accuracy.
Here, our goal is to find optimal values of a, b that
can minimizes the error between actual and predicted values of house price.
So, Sum of Squared
Errors (SSE) = ½ Sum (Actual House Price – Predicted House Price)2
SSE = ½ Sum (Y – Ypred)2
(NOTE: 1/2 is used for mathematical convenience since it helps us
in calculating gradients in calculus easily.
There are other types of
measures of Error. Here, we used SSE which is just one of them).
This
is where the Gradient Descent comes into the picture. Gradient descent
is an optimization algorithm that finds the optimal values of weights a, b that
reduces the prediction error.
Let’s now try to understand the Gradient Descent
algorithm with an example:
Below are the steps involved in the gradient descent algorithm
are mentioned.
Step 1: First,
Initialize the weights (a & b) with random values and calculate Error (SSE).
Step 2: Calculate
the gradient i.e. change in SSE when the weights (a & b) are changed by a
very small value from their original randomly initialized value. This helps us
move the values of a & b in the direction in which SSE can be minimized.
Step 3: Now,
we adjust the weights with the gradients to reach the optimal values so that SSE
can be minimized.
Step 4: Use
these new weights for prediction and then calculate the new SSE.
Step 5: Finally,
we repeat steps 2 and 3 till further adjustments to weights doesn’t
significantly reduce the Error.
Let’s
now go through each of the steps in detail.
But
before that, we have to standardize the data as it will make the optimization
process faster and convenient.
Step 1: To fit a line Ypred = a + b X, we start
off with random values of a and b and then calculate the prediction error (SSE)
accordingly as shown below.
Step 2: In our second step we calculate the error gradient w.r.t
the weights as shown below.
∂SSE/∂a
= – (Y-Ypred)
∂SSE/∂b
= – (Y-Ypred) X
Here, SSE = ½ (Y-Ypred)2 = ½ (Y- (a+bX))2
Here, ∂SSE/∂a and ∂SSE/∂b are the gradients and
which give the direction of the movement of a, b w.r.t to SSE.
Step 3: Adjust the weights with the gradients to reach
the optimal values where SSE is minimum.
Now, we need to
update the random values of a, b so that we can move in the direction of
optimal a, b where total SEE is minimum.
Update
rules:
·
a – ∂SSE/∂a
·
b – ∂SSE/∂b
So, update rules:
1.
New a = a – lr * ∂SSE/∂a = 0.45-0.01*3.300
= 0.42
2.
New b = b – lr * ∂SSE/∂b= 0.75-0.01*1.545
= 0.73
Here, lr is the
learning rate = 0.01, which is the pace of adjustment to the weights. I already
explained you about learning rate earlier.
Note: These
values are randomly taken to explain you the concepts. Don’t consider them as a
fixed one.
Step 4: Now, we use these new a and b values for our prediction and then use it
to calculate the new Total SSE.
Now we can observe
from the above table, with the new prediction, the total SSE has gone down
(0.677 to 0.553). That implies our prediction accuracy has improved and model
will show better performance.
Step 5: Finally, in our last step werepeat
step 3 and 4 till the time further adjustments to a, b doesn’t show significantly
reduce in the error.
At
that time, we have arrived at the optimal value of a, b with the highest
prediction accuracy.
This is how the
Gradient Descent Algorithm works behind the scene. This optimization algorithm
and different variants form the core of many machine learning and Deep
Learning.
Note: We don’t
have to do all this manually. There are many built-in functions available which
performs all this for us. Only we have to call them.
I hope, finally you came to know the what is Gradient, how
its work and its importance. In next tutorials, I will try to up come with
a detailed explanation of some others topics.For more blogs/courses on
data science, machine learning, artificial intelligence and new technologies do
visit us atInsideAIML.