#### World's Best AI Learning Platform with profoundly Demanding Certification Programs

Designed by IITian's, only for AI Learners.

Download our e-book of Attention Mechanism

How garbage collection implemented in python? What is use of rank() function? How to know given a binary tree is a binary search tree or not? For loop giving incorrect answer How to Choose a classification algorithm for particular problem? Remove duplicate characters in a given string keeping only the first occurrences. For example, if the input is ‘tree traversal’ the output will be ‘tre avsl’. What is list and explain list methods? How to use datetime objects? Join Discussion

4 (4,001 Ratings)

220 Learners

Jun 24th (7:00 PM) 238 Registered

Neha Kumawat

10 months ago

In some of my previous articles, I have explained about the
activation functions and loss functions used in machine learning/deep learning.
And also written an article on optimizers and its types. I recommend you to
once go through it for better understanding.

In this article, I will give you an in-depth explanation of
Gradient descent optimizers and its different types.

So, let’s start…

In
previous articles, we saw how to deal with loss functions, which is a
mathematical way of measuring how wrong our predictions are.

During
the training process, we tweak and change the parameters (weights) of our model
to try and minimize that loss function, and make our predictions as correct and
optimized as possible.

But
you may be thinking that **how exactly do we do that?** **How do we change
the parameters of our model, by how much, and when?** This all questions are
very important which surely affects our model performance.

Now, where the optimizers come into the
picture. **Optimizers try to combine
together the loss function and model parameters by updating the model in
response to the output of the loss function. In simpler terms, we can say that
the optimizers shape and mould our model into its most accurate possible form
by dabbling with the weights. **The loss function act as a guide to the
terrain, telling the optimizer when it’s moving in the right or wrong
direction.

Let’s
take a simple example and try to understand what simply happening.

Imagine,
one day you and your friends went for trekking. All of you reached on the top
of a mountain. As you are tired and want some rest, you told your friends to
move forward and get down you will be joining them after taking some rest. While you trying to get down a mountain with
a blindfold on. It’s impossible to know which direction to go in, but there’s
one thing you can know: if you will be going down (making progress) or going up
(losing progress). Eventually, if you keep taking steps that lead you
downwards, you’ll reach the base.

Similarly,
it’s impossible for us to know what our model’s weights should be right from
start. But with some trial and error based on the loss function (whether you descending),
you can end up getting there eventually.

A hiker getting down from a mountain

Now as we know any discussion about
optimizers needs to begin with the most popular one, and which is known as **Gradient
Descent.** This algorithm is used across all types of Machine Learning and
Deep Learning problems which are to be optimized. It’s fast, robust, and
flexible and good performance.

Gradient
descent is one of the types of an optimization algorithm used to minimize some
loss function by iteratively moving in the direction of steepest descent as
defined by the negative of the gradient. In machine learning, we use gradient
descent to update the parameters of
our model. These parameters are nothing but they refer to coefficients in Linear
Regression in machine learning
and weights in
neural networks in deep learning.

1.
It tries to calculate what a small change in
each individual weight would do to the loss function (i.e. which direction
should the hiker walk-in)

2.
Then it adjusts each individual weight-based
on its gradient (i.e. take a small step in the determined direction)

3.
It keeps iterating step 1 and step 2 until the
loss function gets as low as possible and get the best model.

Local and Global minima

So, you might be thinking about what is Gradient and
descent is in gradient descent algorithm?

As of now, you may know, Gradients are nothing but partial
derivatives wrt weights and loss and are a measure of change. And Descent means
in which direction we should move to achieve global minima. They connect
the loss function and the weights; they tell us what specific operation we
should perform to our weights – add 6, subtract .06, or anything else which
helps us to lower the output of the loss function and thereby make our model
more accurate.

There are some other elements such as cost function, learning
rate, etc. which play an important role that makeup Gradient Descent and also
generalize to other optimizers.

The main point for learning neural networks is to define a
cost function which is also known as a loss function. Cost/Loss functions measures
how well the network predicts outputs on the test set. The goal is to then find
a set of weights and biases values that minimizes the cost/loss. One common
function that is often used is the mean squared error, which measures the difference between the actual value
of y and
the estimated value of y (the prediction).

The equation of the below regression line is **hθ(x)
= θ + θ1x**,
which has only two parameters: weight (θ1) and bias (θ0).

Cost Function

Learning rate is nothing but the size
of the steps. It plays a very important role in optimizing
our model. With a high value of learning rate, we can capture more ground in
each step, but we may risk overshooting the minima point as the slope of the
hill is constantly changing. On the other hand, with a very low learning rate,
we can move in the direction of the negative gradient as we are recalculating
it so frequently.

A low learning rate is more precise, but it’s a
time-consuming, so it will take us a very long time to achieve the global
minima point. (lowest point) and sometimes it also gets stuck at the local
minima.

So, choosing the correct value of the learning rate plays
an important role in our model performance.

Learning rate

Let’s
try to predict the price of a new how house from the given housing data:

Below
is the given historical data:

historical data

Let’s plot the given historical
housing data:

historical housing data

From the above figure, we can see it follows a linear trend
as **the size of the house increases, the price is also increasing.**

Next, let’s try to use a **simple linear regression model,**
where we try to fit a line on the given historical data and predict the price
of a new house (Ypred) based on its size (X).

Regression line

From
the above figure, we can see that the red line gives the predicted house price
(i.e., Ypred) given house size(X).

So,
Ypred can be given as:

Now, the blue line represents the actual prices from the
historical data i.e., Yactual.

The difference between Yactual and Ypred which is represented
by the yellow dashed lines is called **prediction error (error) E**.

So now, our aim is to find a line with optimal values of a, b
which is known as coefficients/weights that best fits the historical data by
reducing the prediction error/ error and improving prediction accuracy.

Here, our goal is to find optimal values of **a, b** that
can minimizes the error between actual and predicted values of house price

(NOTE: 1/2 is used for mathematical convenience since it helps us
in calculating gradients in calculus easily)

This
is where the Gradient Descent comes into the picture. Gradient descent is an
optimization algorithm that finds the optimal values of weights a, b that reduces
the prediction error.

Let’s now try to understand the **Gradient Descent algorithm
**with an example**:**

Below are the steps involved in the gradient descent algorithm
are mentioned.

But
before that, we have to standardize the data as it will make the optimization
process faster and convenient.

Normal and standardize data

Calculated SSE

∂SSE/∂a
= – (Y-Ypred)

∂SSE/∂b
= – (Y-Ypred) X

Here, SSE = ½ (Y-Ypred)2 = ½ (Y- (a+bX))2

Here, **∂SSE/∂a** and **∂SSE/∂b **are the **gradients** and
which give the direction of the movement of a, b w.r.t to SSE.

Derivatives of a and b

Total SSE

Now, we need to
update the random values of a, b so that we can move in the direction of
optimal a, b where total SEE is minimum.

·
a – ∂SSE/∂a

·
b – ∂SSE/∂b

So, update rules:

1.
New a = a – lr * **∂SSE/∂a = **0.45-0.01*3.300
= 0.42

2.
New b = b – lr * **∂SSE/∂b= **0.75-0.01*1.545
= 0.73

Here, lr is the
learning rate = 0.01, which is the pace of adjustment to the weights. I already
explained you about learning rate earlier.

Note: These
values are randomly taken to explain you the concepts. Don’t consider them as a
fixed one.

Total SSE with new weights

Now we can observe
from the above table, with the new prediction, the total SSE has gone down
(0.677 to 0.553). That implies our prediction accuracy has improved and model
will show better performance.

At
that time, we have arrived at the optimal value of a, b with the highest
prediction accuracy.

This is how the
Gradient Descent Algorithm works behind the scene. This optimization algorithm
and different variants form the core of many machine learning and Deep
Learning.

Note: We don’t
have to do all this manually. There are many built-in functions available which
performs all this for us. Only we have to call them.

There
are many different types but mainly these three variants of gradient descent
are common, which differ in how much data we use to compute the gradient of the
objective function. Depending on the amount of data, we make a trade-off
between the accuracy of the parameter update and the time, it takes to perform
an update.

Let’s
see how they differ from each other’s.

Because
this method calculates **the gradient for the entire data set in one
update, the calculation is very slow**, it will be very tricky to encounter a large number of data sets, and you cannot invest in new data to update the
model in real-time.

We
will define an iteration number epoch in advance, first calculate the gradient
vector params_grad, and then update the parameter params along the direction of
the gradient. The learning rate determines how big we take each step.

where x is a
parameter, dx is the gradient and learning rate is constant

For large
data sets, there may be similar samples, so BGD calculates the gradient. There will be
redundancy, and SGD is updated only once, there is no redundancy, it is faster,
and new samples can be added.

Fluctuations in SGD

When we
decrease the learning rate slightly, the convergence of SGD and BGD is the
same.

MBGD uses a
small batch of samples, that is n samples to calculate each time. In this way,
it can reduce the variance when the parameters are updated, and the convergence
is more stable. It can make full use of the highly optimized matrix operations
in the deep learning library for more efficient gradient calculations.

The
difference from SGD is that each cycle does not act on each sample, but a batch
with n samples.

Setting value of hyper-parameters: n Generally value is 50 ～ 256

- Mini-batch gradient descent does not guarantee good convergence,

- If the learning rate is too small, the convergence rate will be slow. If it is too large, the loss function will oscillate or even deviate at the minimum value. One measure is to set a larger learning rate. When the change between two iterations is lower than a certain threshold, the learning rate is reduced.

However,
the setting of this threshold needs to be written in advance adapt to the
characteristics of the data set.

In
addition, this method is to apply the **same learning rate** to all parameter updates. If our data is sparse, we would
prefer to update the features with lower frequency.

In
addition, for **non-convex functions,** it is also necessary to avoid trapping at the local
minimum or saddle point, because the error around the saddle point is the same,
the gradients of all dimensions are close to 0, and SGD is easily trapped here.

One
disadvantage of the SGD method is that its update direction depends entirely on
the current batch, so its update is very unstable. A simple way to solve this
problem is to introduce momentum.

Momentum is momentum, which simulates the inertia of an object when it is moving, that
is, the direction of the previous update is retained to a certain extent during
the update, while the current update gradient is used to fine-tune the final
update direction. In this way, you can increase the stability to a certain
extent, so that you can learn faster, and also have the ability to get rid of
local optimization.

SGD with and without momentum

I hope after reading this article, finally, you came to know **what is Gradient descent optimizers, how its work, different types, and
its importance**. In the next articles, I will come with a detailed explanation
of some other types of optimizers.** **For more blogs/courses on data
science, machine learning, artificial intelligence, and new technologies do
visit us at **InsideAIML.**

Thanks for reading…