All Courses

#### Master's In Artificial Intelligence Job Guarantee Program

4.5 (1,292 Ratings)

559 Learners

#### Why You Should Learn Data Science in 2023?

Jun 8th (7:00 PM) 289 Registered
More webinars

Neha Kumawat

a year ago

• Introduction

## Introduction

In my previous article "Optimizers in Machine Learning and Deep Learning". I gave a brief introduction about Adagrad optimizers. In this article, I will try to give an in-depth explanation of the optimizer’s algorithm.
If you didn’t read my previous article. I recommend you to first go through my previous articles on optimizers mentioned below and then come back to this article for  better understanding:
Let’s start…

Therefore, it is well suited when dealing with sparse data..
But the same update rate may not fit all parameters. For example, some parameters may have reached a stage where only fine adjustment is required, but some parameters need to be adjusted significantly due to the small number of matching samples.
Adagrad raised this issue, an algorithm that offers different learning rates at different parameters between them. What this means is that for each parameter, as its total distance updated increases, its learning rate is also slow.
GloVe word embedding uses Adagrad where rare words need major updates and common words need a little update.
• The learning rate is monotonically decreasing.
• The learning rate in the late training period is very small.
• it needs to be set by doing the initial global learning rate.
Let’s see how it works
In this algorithm, we try to change the learning rate (alpha) for each update. The learning rate changes during each update as it will decrease if the weight is significantly updated in the short term again and it will increase if the weight is not significantly updated.
First, each weight has its own cache value, which collects the squares of the gradients up to the current point.
The cache value will continue to increase as training continues. Now a new update formula can be provided as mentioned below:
The above formula is the same as the original gradient descent formula except that here the learning rate (alpha) constantly changes throughout the training process. The E in the denominator which is shown in the above formula is a very small value which helps us to ensure that the division by zero does not occur.
Essentially what’s happening here is that if a weight has been having very huge updates, its cache value is also going to increase. As a result, the learning rate will be lower and the size of the weight update will decrease over time.
On the other hand, if a weight has not been having any significant update, its cache value is going to be very less, and hence its learning rate will increase, forcing it to take bigger updates. This is the basic principle of the Adagrad optimizer.
However, the disadvantage of this algorithm is that even if there are previous weight gradients, the cache will always increase by a certain amount because the square cannot be negative. Therefore the learning rate of all weights will eventually drop to a very low level until the training is less intense.
``````def adagrad():

weights, bais, eta = init_w, init_b, 0.1
v_w, v_b, eps = 0, 0, 1e-8

for i in range(max_epochs):

dw, db = 0, 0
for x,y in zip(X,Y):

dw += grad_w(weights, bais, x, y)
db += grad_b(weights, bais, x, y)

v_w = v_w + dw**2
v_b = v_b + db**2

weights_2 = weights - (eta / np.sqrt(v_w + eps)) * dw
bais_2 = bais - (eta / np.sqrt(v_b + eps)) * db``````

In Adadelta, we do not need to set a default reading rate as we take the effective rate of past steps to the current gradient.
• The learning rate is monotonically decreasing.
• The learning rate during the late training period is very low.
• it needs to be set by doing the initial global learning rate.
It does this by limiting the gradient window that has been exceeded to a certain size w.  Running average at time t then depends on the previous average and the current gradient.
In Adadelta, we don't have to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient.
Let’s see and understand how its work
In the Adadelta optimizer algorithm, it will try not to accumulate all past squared gradients values. It instead tries to restrict the window of accumulated past gradients to some fixed size (say w).
Here, it Instead of inefficiently storing w previous squared gradients value, the sum of gradients is recursively defined as a decaying average of all past squared gradients.
The running average E[g2]t at time step t then depends (as a fraction γ similarly to the Momentum term) only on the previous average and the current gradient value:
Next, we set γ to a similar value as the momentum term, say around 0.8. To be more specific, lets now rewrite our vanilla SGD update as shown in the image below according  to the parameter update vector Δθt:
The parameter update vector of Adagrad that we derived previously can also be written as shown below:
Now we can simply replace the diagonal matrix Gt with the decaying average over past squared gradients E[g2]t as shown below:
Now as the denominator is just the root mean squared (RMS) error of the gradient, we can replace it with the criterion short-hand as shown below:
Note: The units in this update (as well as in SGD, Momentum, or Adagrad)  are incompatible, meaning that the update must have the same assumptions as of the parameter. To realize this, we have to first define another exponentially decaying average, this time not of squared gradients but of squared parameter updates. It is shown below:
The root mean squared error of parameter updates can be given by as follows:
Since RMS[Δθ]t is unknown, we approximate it with the RMS of parameter updates until the previous time step. Replacing the learning rate η in the previous update rule with RMS[Δθ]t−1. Finally, the Adadelta update rule can be given as shown below:
Note: With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule.
``````# Adadalta
eps_stable = 1e-5
for param, sqr, delta in zip(params, sqrs, deltas):
sqr[:] = rho * sqr + (1. - rho) * nd.square(g)
cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g
delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta

# update weight
param[:] -= cur_delta
``````