Adagrad and Adadelta Optimizer: In-Depth Explanation

Neha Kumawat

12 days ago

Global and Local Minima | insideaiml
Global and Local Minima | insideaiml
In my previous article "Optimizers in Machine Learning and Deep Learning". I gave a brief introduction about Adagrad optimizers. In this article, I will try to give an in-depth explanation of the optimizer’s algorithm.
If you didn’t read my previous article. I recommend you to first go through my previous articles on optimizers mentioned below and then come back to this article for  better understanding:
Let’s start…

Adagrad

Adagrad stands for an adaptive gradient. Adagrad is an effective algorithm for gradient-based optimization. It adapts the learning rate to the parameters, using low learning rates for parameters associated with frequently occurring features, and using high learning rates for parameters associated with rare features.
Therefore, it is well suited when dealing with sparse data..
But the same update rate may not fit all parameters. For example, some parameters may have reached a stage where only fine adjustment is required, but some parameters need to be adjusted significantly due to the small number of matching samples.
Adagrad raised this issue, an algorithm that offers different learning rates at different parameters between them. What this means is that for each parameter, as its total distance updated increases, its learning rate is also slow.
GloVe word embedding uses Adagrad where rare words need major updates and common words need a little update.
Adagrad removes the need to manually tune the learning rate.
There are mainly three problems that arise with the Adagrad algorithm.
  • The learning rate is monotonically decreasing.
  • The learning rate in the late training period is very small.
  • it needs to be set by doing the initial global learning rate.
Let’s see how it works
In this algorithm, we try to change the learning rate (alpha) for each update. The learning rate changes during each update as it will decrease if the weight is significantly updated in the short term again and it will increase if the weight is not significantly updated.
First, each weight has its own cache value, which collects the squares of the gradients up to the current point.
Cache updating for Adagrad | insideaiml
Cache updating for Adagrad | insideaiml
The cache value will continue to increase as training continues. Now a new update formula can be provided as mentioned below:
Adagrad Update Formula | insideaiml
Adagrad Update Formula | insideaiml
The above formula is the same as the original gradient descent formula except that here the learning rate (alpha) constantly changes throughout the training process. The E in the denominator which is shown in the above formula is a very small value which helps us to ensure that the division by zero does not occur.
Essentially what’s happening here is that if a weight has been having very huge updates, its cache value is also going to increase. As a result, the learning rate will be lower and the size of the weight update will decrease over time.
On the other hand, if a weight has not been having any significant update, its cache value is going to be very less, and hence its learning rate will increase, forcing it to take bigger updates. This is the basic principle of the Adagrad optimizer.
However, the disadvantage of this algorithm is that even if there are previous weight gradients, the cache will always increase by a certain amount because the square cannot be negative. Therefore the learning rate of all weights will eventually drop to a very low level until the training is less intense.
Adagrad can be visualize as:
Adagrad | insideaiml
Adagrad | insideaiml
Python code for Adagrad
def adagrad():

    weights, bais, eta = init_w, init_b, 0.1
    v_w, v_b, eps = 0, 0, 1e-8

    for i in range(max_epochs):

        dw, db = 0, 0
        for x,y in zip(X,Y):

            dw += grad_w(weights, bais, x, y)
            db += grad_b(weights, bais, x, y)

        v_w = v_w + dw**2
        v_b = v_b + db**2

        weights_2 = weights - (eta / np.sqrt(v_w + eps)) * dw
        bais_2 = bais - (eta / np.sqrt(v_b + eps)) * db
To overcome the problem of Adagrad there is many other optimizers algorithm available. One of them is Adadelta.

Adadelta

In Adadelta, we do not need to set a default reading rate as we take the effective rate of past steps to the current gradient.
There are three major problems that arise with the Adagrad algorithm.
  • The learning rate is monotonically decreasing.
  • The learning rate during the late training period is very low.
  • it needs to be set by doing the initial global learning rate.
Adadelta is an extension of Adagrad and also tries to reduce Adagrad's rate of learning, excessively.
It does this by limiting the gradient window that has been exceeded to a certain size w.  Running average at time t then depends on the previous average and the current gradient.  
In Adadelta, we don't have to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient.
Let’s see and understand how its work
In the Adadelta optimizer algorithm, it will try not to accumulate all past squared gradients values. It instead tries to restrict the window of accumulated past gradients to some fixed size (say w).
Here, it Instead of inefficiently storing w previous squared gradients value, the sum of gradients is recursively defined as a decaying average of all past squared gradients.
The running average E[g2]t at time step t then depends (as a fraction γ similarly to the Momentum term) only on the previous average and the current gradient value:
current gradient value | insideaiml
current gradient value | insideaiml
Next, we set γ to a similar value as the momentum term, say around 0.8. To be more specific, lets now rewrite our vanilla SGD update as shown in the image below according  to the parameter update vector Δθt:
update vector | insideaiml
update vector | insideaiml
The parameter update vector of Adagrad that we derived previously can also be written as shown below:
vector of Adagrad | insideaiml
vector of Adagrad | insideaiml
Now we can simply replace the diagonal matrix Gt with the decaying average over past squared gradients E[g2]t as shown below:
past squared gradients | insideaiml
past squared gradients | insideaiml
Now as the denominator is just the root mean squared (RMS) error of the gradient, we can replace it with the criterion short-hand as shown below:
root mean squared (RMS) error of the gradient | insideaiml
root mean squared (RMS) error of the gradient | insideaiml
Note: The units in this update (as well as in SGD, Momentum, or Adagrad)  are incompatible, meaning that the update must have the same assumptions as of the parameter. To realize this, we have to first define another exponentially decaying average, this time not of squared gradients but of squared parameter updates. It is shown below:
squared gradients | insideaiml
squared gradients | insideaiml
The root mean squared error of parameter updates can be given by as follows:
root mean squared error | insideaiml
root mean squared error | insideaiml
Since RMS[Δθ]t is unknown, we approximate it with the RMS of parameter updates until the previous time step. Replacing the learning rate η in the previous update rule with RMS[Δθ]t−1. Finally, the Adadelta update rule can be given as shown below:
RMS of parameter updates | insideaiml
RMS of parameter updates | insideaiml
Note: With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule.
Python code to write your own function for Adadelta
# Adadalta
def adadelta(params, sqrs, deltas, rho, batch_size):
eps_stable = 1e-5
for param, sqr, delta in zip(params, sqrs, deltas): 
g = param.grad / batch_size
sqr[:] = rho * sqr + (1. - rho) * nd.square(g)    
cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g
delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta

# update weight    
param[:] -= cur_delta
Adadelta can be visualize as:
Adadelta | insideaiml
Adadelta | insideaiml
I hope after reading this article, finally, you came to know what is Adagrad and Adadelta optimizers algorithms and their importance. In the next articles, I will come with a detailed explanation of some other types of optimizers. For more blogs/courses on data science, machine learning, artificial intelligence and new technologies do visit us at InsideAIML.
Thanks for reading…
Related courses for you:
        
Related blogs for you :

Submit Review