All Courses

# Optimizers in Machine Learning and Deep Learning.

Kajal Pawar

3 years ago

Table  of Content
• What is an optimizer in Machine Learning/Deep Learning?
• Learning rate
• Types of Gradient descent optimizers
• Other Types of Optimizers
• how different optimizers works?
• How we can to choose the optimizers?
• Evolutionary map of Optimizers
In some of my previous articles, I have explained about the activation functions and loss functions used in machine learning and deep learning. I recommend you to once go through it for better understanding.
In this article, I will give you a brief introduction about optimizers and its types and later I will try to write a detailed article on each type of optimizers.
So, let’s start…
Deep learning algorithms involve optimization in many contexts. For example, while we try to perform inference in models such as PCA, it involves solving an optimization problem. We often use different optimization to write proofs or design algorithms.
As most of us already know, how it is quite common to invest days or months of time on hundreds of machines to solve even a single instance of the neural network training problem. Since this problem is very important and also sometimes so expensive, a specialized set of optimization techniques have been developed for solving it. This article presents some of the optimization techniques used in machine learning and deep learning.

## What is an optimizer in Machine Learning/Deep Learning?

In previous articles, we saw how to deal with loss functions, which is a mathematical way of measuring how wrong our predictions are.
During the training process, we tweak and change the parameters (weights) of our model to try and minimize that loss function, and make our predictions as correct and optimized as possible.
But you may be thinking that how exactly do we do that? How do we change the parameters of our model, by how much, and when? This all questions are very important which surely affects our model performance.
Now, where the optimizers come into the picture. Optimizers try to tie together the loss function and model parameters by updating the model in response to the output of the loss function. In simpler terms, we can say that the optimizers shape and mold our model into its most accurate possible form by dabbling with the weights. The loss function act as a guide to the terrain, telling the optimizer when it’s moving in the right or wrong direction.
Let’s take a simple example and try to understand what simply happening.
Imagine, one day you and your friends went for trekking. All of you reached on the top of a mountain. As you are tired and want some rest, you told your friends to move forward and get down you will be joining them after taking some rest.  While you trying to get down a mountain with a blindfold on. It’s impossible to know which direction to go in, but there’s one thing you can know: if you will be going down (making progress) or going up (losing progress). Eventually, if you keep taking steps that lead you downwards, you’ll reach the base.
Similarly, it’s impossible for us to know what our model’s weights should be right from start. But with some trial and error based on the loss function (whether you descending), you can end up getting there eventually.
Now as we know any discussion about optimizers needs to begin with the most popular one, and which is known as Gradient Descent. This algorithm is used across all types of Machine Learning and Deep Learning problems which are to be optimized. It’s a fast, robust, and flexible and good performance.

Gradient descent is one of the types of an optimization algorithm used to minimize some loss function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. These parameters are nothing but they refer to coefficients in Linear Regression in machine learning and weights in neural networks in deep learning.
Let’s see how it works:
1.      It tries to calculate what a small change in each individual weight would do to the loss function (i.e. which direction should the hiker walk-in)
2.      Then it adjusts each individual weight based on its gradient (i.e. take a small step in the determined direction)
3.      It keeps iterating step 1 and step 2 until the loss function gets as low as possible and get the best model.
Note: The ultimate aim of this algorithm is to reach to the global minima and do not get stuck at the local minima.
So, you might be thinking what is Gradient and descent are in gradient descent algorithm?
As of now, you may know, Gradients are nothing but partial derivatives wrt weights and loss and are a measure of change. And Descent means in which direction we should move to achieve the global minima. They connect the loss function and the weights; they tell us what specific operation we should perform to our weights – add 6, subtract .06, or anything else which helps us to lower the output of the loss function and thereby make our model more accurate.
There are some other elements which play an important role that makeup Gradient Descent and also generalize to other optimizers.

## Learning rate

Learning rate is nothing but the size of the steps. Its plays a very important role in optimizing our model. With a high value of learning rate, we can capture more ground in each step, but we may risk overshooting the minima point as the slope of the hill is constantly changing. On the other hand, with a very low learning rate, we can move in the direction of the negative gradient as we are recalculating it so frequently.
A low learning rate is more precise, but it’s a time-consuming, so it will take us a very long time to achieve the global minima point. (lowest point) and sometimes it also gets stuck at the local minima.
So, choosing the correct value of the learning rate plays an important role in our model performance.

## Types of Gradient descent optimizers

There are mainly three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.
1. Batch gradient descent/ Vanilla gradient descent
2. SGD (Stochastic gradient descent)
3. Mini-batch gradient descent
In this article, I will not go into the details of the above optimizers. Later I will try to write a separate article on each topic.

## Other Types of Optimizers

As we already know, how popular gradient descent algorithm is, and how it’s used in machine learning and even up to complex neural networks in deep learning problems. (Backpropagation is basically gradient descent implemented on a network).
There are some other types of optimizers available and used widely. Some of them are listed below:
3. RMSProp
Adagrad is an algorithm for gradient-based optimization which adapts the learning rate to the parameters, using low learning rates for parameters associated with frequently occurring features, and using high learning rates for parameters associated with infrequent features.
So, it is well-suited for when we are dealing with sparse data.
But the same update rate may not be suitable for all parameters. For example, some parameters may have reached the stage where only fine-tuning is needed, but some parameters need to be adjusted a lot due to the small number of corresponding samples.
Adagrad proposed this problem, an algorithm that adaptively assigns different learning rates to various parameters among them. The implication is that for each parameter, as its total distance updated increases, its learning rate also slows.
GloVe word embedding uses adagrad where infrequent words required a greater update and frequent words require smaller updates.
Adagrad eliminates the need to manually tune the learning rate.
There are mainly three problems arises with the Adagrad algorithm.
• The learning rate is monotonically decreasing.
• The learning rate in the late training period is very small.
• It requires manually setting a global initial learning rate.
It does this by restricting the window of the past accumulated gradient to some fixed size of w. Running average at time t then depends on the previous average and the current gradient.
In Adadelta, we do not need to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient.
3. RMSProp
The full name of the RMSProp algorithm is called Root Mean Square Prop, which is an adaptive learning rate optimization algorithm proposed by Geoff Hinton.
RMSProp tries to resolve Adagrad’s radically diminishing learning rates by using a moving average of the squared gradient. It utilizes the magnitude of the recent gradient descents to normalize the gradient.
Adagrad will accumulate all previous gradient squares, and RMSprop just calculates the corresponding average value, so it can alleviate the problem that the learning rate of the Adagrad algorithm drops quickly.
The difference is that RMSProp calculates the differential squared weighted average of the gradient. This method is beneficial to eliminate the direction of large swing amplitude and is used to correct the swing amplitude so that the swing amplitude in each dimension is smaller. On the other hand, it also makes the network function converge faster.
In RMSProp learning rate gets adjusted automatically and it chooses a different learning rate for each parameter.
RMSProp divides the learning rate by the average of the exponential decay of squared gradients.
Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop.
Adam also keeps an exponentially decaying average of past gradients, similar to momentum.
Adam can be viewed as a combination of Adagrad and RMSprop,(Adagrad) which works well on sparse gradients and (RMSProp) which works well in online and nonstationary settings respectively.
Adam implements the exponential moving average of the gradients to scale the learning rate instead of a simple average as in Adagrad. It keeps an exponentially decaying average of past gradients.
Adam is computationally efficient and has very less memory requirement.
Adam optimizer is one of the most popular and famous gradient descent optimization algorithms.

## Let’s visually see, how different optimizers works?

### How we can to choose the optimizers?

Choosing the optimizers depends on many factors such as problem objective etc. But below mentioned are some of the conventions what we can consider while choosing it.
• RMSprop, Adadelta, Adam have similar effects in many cases.
• Adam just added bias-correction and momentum on the basis of RMSprop,
• As the gradient becomes sparse, Adam will perform better than RMSprop.
Note: Overall, Adam is the best choice.
SGD is used in many papers, without momentum, etc. Although SGD can reach a minimum value, it takes longer than other algorithms and may be trapped in the saddle point.
• If faster convergence is needed, or deeper and more complex neural networks are trained, an adaptive algorithm is needed.

## Evolutionary map of Optimizers

In the below figure it shows an evolutionary map of how these optimizers evolved from the simple vanilla stochastic gradient descent (SGD), down to the variants of Adam.
SGD initially branched out into two main types of optimizers: those which act on (i) the learning rate component, through momentum and (ii) the gradient component, through AdaGrad.
Down the generation line, we see the birth of Adam, a combination of momentum and RMSprop, a successor of AdaGrad. You don’t have to agree with me, but this is how I see them.
After reading this article, finally, you came to know the importance of Optimizers and its different types. In the next articles, I will come with a detailed explanation of each optimizer. For more blogs/courses on data science, machine learning, artificial intelligence, and new technologies do visit us at InsideAIML.