How well does is actually perform on the images that it’s never seen before?
Steps involved in the gradient descent algorithm
In my
last video, I showed you that, how actually a neural network works. Let me give
you a quick recap so that you get familiar if you have forgotten something.
The
purpose of this video is to introduce you with the idea of gradient descent,
which tells how neural networks learn. Then after that we are going to see how
this particular network performs and what those hidden layers of neurons are
actually doing.
So,
let’s start.
In my
previous videos, we have seen the example of handwritten digit recognition.
These
digits are provided on a 28 by 28 pixel grid each pixel with some
grayscale value between 0 & 1 which determines the activations of all 784
neurons in the input layer of the network, which then determines the activation
for each neuron in the next layers based on a weighted sum of all the
activations in the previous layer and a special number called a bias,
then you apply that sum with some other activations function like the sigmoid
or ReLU, etc.
If we
arbitrary choose two hidden layers with 20 neurons each, the system has
about 16330 weights and biases togetherthat we can
adjust. This weights and biases values tell us how our network will perform.
Then
what we mean when we say that this network classifies a given digit?Is
that the brightest of those 10 neurons in the last layer corresponds to that
digit?
The
second layer which is a hidden layer could pick up on the edges and the third
layer could pick up on patterns like loops and lines.
And
the last layer could just piece together those patterns to recognize the given digits.
So here we learn how the network learns.
We
want to write an algorithm where we can apply training data to this system
which is in the form of different images of handwritten digits along with
labels for what they are to be recognized and It will adjust those 16,330 weights
and biases to improve its performance on the training data.
MNIST database have put together
a collection of thousands of handwritten digit images each one is labeled with
the numbers that they are supposed to be.
Basically,
we are trying to find the minimum of a certain function, with each neuron is connected
to all of the other neurons in the previous layer. The weights in the
weighted sum defining its activation is like strength of this connections and
the bias is the indication of whether that neuron tends to be active or
inactive.
So let’s take a simple example and try to
understand what simply happening.
Imagine, one day you and your friends
went for trekking. All of you reached on the top of a mountain. As you are
tired and want some rest, you told your friends to move forward and get down
you will be joining them after taking some rest. While you trying to get down a mountain with
a blindfold on. It’s impossible to know which direction to go in, but there’s
one thing you can know: if you will be going down (making progress) or going up
(losing progress). Eventually, if you keep taking steps that lead you
downwards, you’ll reach the base.
Similarly, it’s impossible for us to know
what our model’s weights should be right from start. But with some trial and
error based on the loss function (whether you descending), you can end up
getting there eventually.
We
have to first initialize all of those weights and biases randomly.
Let’s
imagine, we feed in this image of a 5, but due to this random weights
and biases, the output layer just looks very different and complex.
So,
to overcome this problem we define a cost function which tells us that the
output should have activations which are 0 for most neurons and 1
for the neuron which is recognizing that digit...
So, mathematically
we can tell that we have to add up the squares of the differences between each
of error output activations and the value that we want them to have, this is
nothing but the cost of a single training.
Like
this we have multiple training examples and their cost function output values.
The
sum of these values is small when the system confidently recognizes the image
correctly but it's large when the system seems like it doesn't know what it's
recognizing.
So, in
this case we consider the average cost of all training examples available to us,
which is the measure for how noisy the system is. The system was a function
that takes 784 inputs, the pixel values and spits out ten numbers as its
output using all these weights and biases.
The
cost function takes as its input those 16,330 weights and biases and it produces
a single number as a result.
Now a
question arises how to change those weights and biases to get better results?
Let’s
consider a function that has one number as an input and one number as an output.
Here, we are going to find an input that minimizes the value of this function.
Technique
which we are going to use here is to start at any input and find out in which
direction you should take your step to make that output lower.
So
basically, we find out the slope of the function where you are currently at and
then shift to the left if that slope is positive or shift to the right if that
slope is negative.
By
repeatedly doing this at every point that is checking the new slope and taking
the proper step, we will get local minimum of the function.
Because
of starting with the random input, we cannot predict where the local minimum
should be.
One
thing to note down here is, step sizes are proportional to the slope and when
the slope is flattening out towards the minimum, steps get smaller and smaller.
Let
us consider a function with two inputs and one output, by taking input space in
the XY plane and the cost function as a surface above it, here we have to
consider in which direction to step-in this input space?
The
gradient of a function gives you the direction of steepest ascent.
Now arises
a new question, in which direction should we step to increase the function
most quickly? taking the negative of that gradient gives you the direction
to step that decreases the function most quickly.
It's
the basic idea for a function that has 16,330 inputs instead of two
inputs and organizing all 16,330 weights and biases of the system into a
big column vector.
The
negative gradient of the cost function is a vector. It's some direction inside
this input space that tells you which nudges to all of those numbers is going
to cause the rapid decrease to the cost function.
Changing
the weights and biases will make the output of the network on each example of
training.
This
cost function involves an average over all of training data. This is what about
the back propagation.
The
cost function should have a nice smooth output so that we can find a local
minimum by taking little steps downhill.
The
process of continuously putting an input of a function by some multiple of the
negative gradient is called gradient descent.
So,
let me summarize what actually gradient descent does?
It’s
a way to converge towards global minimum of a cost function.
· It
tries to calculate what a small change in each individual weight would do to
the loss function (i.e. which direction should we move)
· Then
it adjusts each individual weight based on its gradient (i.e. take a small step
in the determined direction)
· It
keeps iterating step 1 and step 2 until the loss function gets as low as
possible and reach global minima to get the best performance model.
Each
component of the negative gradient gives an idea about two things
first
the sign of course tells us whether the corresponding
Component
of the input vector should be pushed up or down, but relative magnitudes of all
these components tells you which changes matter more.
An
adjustment to one of the weights in our network might have a much greater
impact on the cost function than the adjustment to some other weight.
For
example, if you have some function with two variables as an input and you
compute that its gradient at some particular point comes out as (3,1).
You
can also infer it as, changes to first variable have three times the importance
as changes to the second variable that at least in the neighborhood of the
relevant input.
So,
when you initialize the network with random weights and biases and try to
adjust them many times based on this gradient descent process
How
well does is actually perform on the images that it’s never seen before?
With
the simple neural network what we have consider above, we observe that the
model performance is around 93-95%. But if we play around with the hidden layer
structure and try to make the network more complicated with couple of tweaks, we
can get the model performance up to 98%. That’s pretty good but not the best.
We
can get the better performance by creating more complex network than the simple
vanilla network.
That’s
how neural networks works and with the help of backpropagation algorithm and
gradient descent we get our best model by reducing the error term.
Finally,
let me take a small example so that you may get more strong foundation about
gradient descent.
Let’s
try to predict the price of a new house from the given housing data:
In this example our task is to predict the price of a new house
based on the size of the house.
Let’s plot the given historical housing data:
From the plot , we can see it
follows a linear trend as the size of the house increases, the price is also
increasing.
Next,
let’s try to use a simple linear regression model, where we try to fit a
line on the given historical data and predict the price of a new house (Ypred) based
on its size (X).
Figure: Regression line
From the figure, we can see that the red line gives the
predicted house price (i.e., Ypred) given house size(X).
So, Ypred can be given as: Ypred = a+bX
Now, the blue line represents the
actual prices from the historical data i.e., Yactual.
The difference between Yactual and
Ypred which is represented by the yellow dashed lines is called prediction
error (error) E.
So now, our aim is to find a line
with optimal values of a, b which is known as coefficients/weights that
best fits the historical data by reducing the prediction error/ error and
improving prediction accuracy.
Here, our goal is to find optimal values of a, b that
can minimizes the error between actual and predicted values of house price.
So, Sum of Squared
Errors (SSE) = ½ Sum (Actual House Price – Predicted House Price)2
SSE = ½ Sum (Y – Ypred)2
(NOTE: 1/2 is used for mathematical convenience since it helps us
in calculating gradients in calculus easily.
There are other types of
measures of Error. Here, we used SSE which is just one of them).
This
is where the Gradient Descent comes into the picture. Gradient descent
is an optimization algorithm that finds the optimal values of weights a, b that
reduces the prediction error.
Let’s now try to understand the Gradient Descent
algorithm with an example:
Below are the steps involved in the gradient descent algorithm
are mentioned.
Step 1: First, Initialize the
weights (a & b) with random values and calculate Error (SSE).
Step 2: Calculate the gradient
i.e. change in SSE when the weights (a & b) are changed by a very small
value from their original randomly initialized value. This helps us move the
values of a & b in the direction in which SSE can be minimized.
Step 3: Now, we adjust the weights
with the gradients to reach the optimal values so that SSE can be minimized.
Step 4: Use these new weights for
prediction and then calculate the new SSE.
Step 5: Finally, we repeat steps 2
and 3 till further adjustments to weights doesn’t significantly reduce the
Error.
Let’s
now go through each of the steps in detail.
But
before that, we have to standardize the data as it will make the optimization
process faster and convenient.
Step
1: To fit
a line Ypred = a + b X, we start off with random values of a and b and then calculate
the prediction error (SSE) accordingly as shown below.
Figure: Calculated SSE
Step
2: In our second step we calculate the error gradient w.r.t the weights
as shown below.
∂SSE/∂a
= – (Y-Ypred)
∂SSE/∂b
= – (Y-Ypred) X
Here, SSE = ½ (Y-Ypred)2 = ½ (Y- (a+bX))2
Here, ∂SSE/∂a and ∂SSE/∂b are the gradients and
which give the direction of the movement of a, b w.r.t to SSE.
Figure: Derivatives of a and
Step
3: Adjust the
weights with the gradients to reach the optimal values where SSE is minimum.
Like the Blog, then Share it with your friends and colleagues to make this AI community stronger.
To learn more about nuances of Artificial Intelligence, Python Programming, Deep Learning, Data Science and Machine Learning, visit our insideAIML blog page.