#### World's Best AI Learning Platform with profoundly Demanding Certification Programs

Designed by IITian's, only for AI Learners.

Download our e-book of Introduction To Python

about installing the softwere Exception Type: JSONDecodeError at /update/ Exception Value: Expecting value: line 1 column 1 (char 0) How to leave/exit/deactivate a Python virtualenvironment What is the difference between covariance and correlation? TypeError: 'int' object is not subscriptable Which are different modes to open a file ? How to handle imbalanced data and achieve good performance? What does it mean to cross-validate a machine learning model? Join Discussion

4 (4,001 Ratings)

218 Learners

Neha Kumawat

6 months ago

- Introduction
- Adagrad
- Adadelta

In
my previous article "Optimizers in Machine Learning and Deep Learning".
I gave a brief introduction about Adagrad optimizers. In this article, I will
try to give an in-depth explanation of the optimizer’s algorithm.

If
you didn’t read my previous article. I recommend you to first go through my
previous articles on optimizers mentioned below and then come back to this
article for better understanding:

Let’s
start…

Therefore, it is well suited when dealing with sparse data..

But the same update rate may not fit all parameters. For example, some parameters may have reached a stage where only fine adjustment is required, but some parameters need to be adjusted significantly due to the small number of matching samples.

Adagrad raised this issue, an algorithm that offers different learning rates at different parameters between them. What this means is that for each parameter, as its total distance updated increases, its learning rate is also slow.

GloVe word embedding uses **Adagrad **where rare words need major updates and common words need a little update.

Adagrad
removes the need to manually tune the learning rate.

There
are mainly three problems that arise with the Adagrad algorithm.

- The learning rate is monotonically decreasing.
- The learning rate in the late training period is very small.
- it needs to be set by doing the initial global learning rate.

Let’s
see how it works

In this algorithm, we try to change the learning rate (alpha) for each
update. The learning rate changes during each update as it will decrease if the weight is significantly updated in the short term again and it will increase if the weight is not significantly updated.

First,
each weight has its own cache value, which collects the squares of
the gradients up to the current point.

The cache value will continue to increase as training continues. Now a new update formula can be provided as mentioned below:

The above formula is the same as the original gradient descent formula except that
here the learning rate (alpha) constantly changes throughout the training
process. The E in the denominator which is shown in the above formula is a very small value which helps us to ensure that the division by zero does not occur.

Essentially
what’s happening here is that if a weight has been having **very huge
updates**, its cache value is also going to **increase**.
As a result, the learning rate will be lower and the size of the weight update will decrease over time.

On the other hand, if a weight has not been having any significant update, its
cache value is going to be very less, and hence its learning rate will increase,
forcing it to take bigger updates. This is the basic principle of the
Adagrad optimizer.

However, the disadvantage of this algorithm is that even if there are previous weight gradients, the cache will always increase by a certain amount because the square cannot be negative. Therefore the learning rate of all weights will eventually drop to a very low level until the training is less intense.

Adagrad
can be visualize as:

```
def adagrad():
weights, bais, eta = init_w, init_b, 0.1
v_w, v_b, eps = 0, 0, 1e-8
for i in range(max_epochs):
dw, db = 0, 0
for x,y in zip(X,Y):
dw += grad_w(weights, bais, x, y)
db += grad_b(weights, bais, x, y)
v_w = v_w + dw**2
v_b = v_b + db**2
weights_2 = weights - (eta / np.sqrt(v_w + eps)) * dw
bais_2 = bais - (eta / np.sqrt(v_b + eps)) * db
```

To overcome the problem of Adagrad there is many other optimizers algorithm available. One of
them is **Adadelta**.

In Adadelta, we do not need to set a default reading rate as we take the effective rate of past steps to the current gradient.

There are three major problems that arise with the Adagrad algorithm.

- The learning rate is monotonically decreasing.
- The learning rate during the late training period is very low.
- it needs to be set by doing the initial global learning rate.

Adadelta is an extension of Adagrad and also tries to reduce Adagrad's rate of learning, excessively.

It does this by limiting the gradient window that has been exceeded to a certain size *w*. Running average at time *t* then depends on the previous average and the current gradient.

In
Adadelta, we** **don't have to set the default learning rate as we take the ratio
of the running average of the previous time steps to the current gradient.

Let’s
see and understand how its work

In the Adadelta optimizer
algorithm, it will try not to accumulate all past squared gradients values. It
instead tries to restrict the window of accumulated past gradients to some
fixed size **(say w).**

Here, it Instead of
inefficiently storing w previous squared
gradients value, the sum of gradients is recursively defined as a decaying
average of all past squared gradients.

The running
average **E[g2]t** at time step** t** then depends (as a fraction** γ** similarly to the Momentum term) only on the
previous average and the current gradient value:

Next, we set **γ **to a similar value as the momentum term, say around
0.8. To be more specific, lets now rewrite our vanilla SGD update as shown in the
image below according to the parameter update vector Δθt:

The parameter update vector of
Adagrad that we derived previously can also be written as shown below:

Now we can simply
replace the diagonal matrix** Gt** with the decaying average over past squared
gradients **E[g2]t **as shown below:

Now as the denominator
is just the root mean squared (RMS) error of the gradient, we can replace it
with the criterion short-hand as shown below:

Note: The units in this update
(as well as in SGD, Momentum, or Adagrad) are incompatible, meaning that the update must have the same assumptions as of the parameter. To realize this, we have to first
define another exponentially decaying average, this time not of squared
gradients but of squared parameter updates. It is shown below:

The **root mean squared error**
of parameter updates can be given by as follows:

Since **RMS[Δθ]t **is unknown, we approximate it with the RMS of
parameter updates until the previous time step. Replacing the learning rate** η **in the previous update rule with **RMS[Δθ]t−1**. Finally, the Adadelta update rule can be
given as shown below:

Note: With Adadelta, we do not even need to
set a default learning rate, as it has been eliminated from the update rule.

```
# Adadalta
def adadelta(params, sqrs, deltas, rho, batch_size):
eps_stable = 1e-5
for param, sqr, delta in zip(params, sqrs, deltas):
g = param.grad / batch_size
sqr[:] = rho * sqr + (1. - rho) * nd.square(g)
cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g
delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta
# update weight
param[:] -= cur_delta
```

I hope after reading this article, finally, you came to know **what is Adagrad and Adadelta optimizers algorithms and their importance**.
In the next articles, I will come with a detailed explanation of some other types of optimizers.

For more blogs/courses on data science, machine learning,
artificial intelligence and new technologies do visit us at **InsideAIML.**

Thanks for reading…