#### World's Best AI Learning Platform with profoundly Demanding Certification Programs

Designed by IITian's, only for AI Learners.

Download our e-book of Attention Mechanism

4 (4,001 Ratings)

220 Learners

Jun 20th (6:00 PM) 690 Registered

Sanober Ibrahim

9 months ago

Gradient

All of us
may already familiar with the process involved to train a deep learning neural network.
But let me remind you in a brief.

We train deep learning neural
networks using the gradient descent optimization algorithm which provides the
idea of how and in which direction we can move to achieve minimum error so that
the deep learning model performs the best.

In this optimization algorithm,
the error for the current state of the model is estimated repeatedly. Now it requires
to choose which **error function** (**loss function) **that can be used
to calculate the loss of the model so that the weights can be updated and the
loss can be reduced for the next evaluation process.

Now you may have the basic idea
about how a deep neural network gets trained. Let’s move further and try to
understand it in a better way.

In a simplest way, we can say that a
loss function is a method of evaluating how well your algorithm models your
dataset.

In terms of optimization techniques,
the function which is used to evaluate a solution is referred to as the **objective
function.** Now we may want to maximize or minimize the objective function so
to get the highest or lowest score respectively.

Typically, for deep
learning neural network, we want to minimize the error value and hence the
objective function here is known as a **cost function** or a **loss
function** and the value of this objection function is simply referred as the
**“loss”.**

I
want to clear you this here – although the cost
function and the loss function are
synonymous and used interchangeably but they are a little bit different actually.

When
we have a single training example then it is known as loss function. It is also
sometimes called as an **error function**. On the other hand, a cost
function is the **average loss** over the entire training dataset.

Now that we are familiar with what is loss function and loss,
we need to know what functions to use when and why?

There are mainly three types of loss functions we have and again
these loss functions are further divided which is shown as below

- Mean Squared Error Loss
- Mean Squared Logarithmic Error Loss
- Mean Absolute Error Loss
- L1 Loss
- L2 Loss
- Huber Loss
- Pseudo Huber Loss

- Binary Cross-Entropy
- Hinge Loss
- Squared Hinge Loss

- Multi-class Cross Entropy Loss
- Sparse Multiclass Cross-Entropy Loss
- Kullback Leibler Divergence Loss

As of
now, you must be quite familiar with linear regression problems. Linear
Regression problem deals with mapping a linear relationship between a dependent
variable, Y,
and several independent
variables, X’s.
So, we essentially fit a line in space on these variables to get the best model
with minimum error. Basically, a regression problem involves predicting a
real-valued quantity.

In this
article I will try to take you through some of the loss functions and later I
will try to write separate article on each of the different loss functions.

L1* *and* L2* are two common loss functions in machine
learning/deep learning which are mainly used to minimize the error.

L1 loss function is also known as Least Absolute Deviations in short LAD. L2 loss function is
also known as Least square errors in
short LS.

Let's get brief idea about these two loss functions

It is used to minimize the error which is the sum of all the
absolute differences in between the true value and the predicted value.

L1 loss function

L1 loss is also known as the **Absolute Error **and the cost is the Mean of these Absolute Errors (MAE).

It is also used to minimize the error which is the sum of all the
squared differences in between the true value and the predicted value.

L2 loss function

The corresponding cost function is the Mean of these Squared
Errors (MSE).

For
example, the true value is 1, the prediction is 10 times, the prediction value
is 1000 once, and the prediction value of the other times is about 1, obviously
the loss value is mainly dominated by 1000.

```
#import libraries
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
x_pre = tf.lin_space(-1., 1., 100)
x_actual = tf.constant(0,dtype=tf.float32)
l1_loss = tf.abs((x_pre - x_actual))
l2_loss = tf.square((x_pre - x_actual))
with tf.Session() as sess:
x_,l1_,l2_ = sess.run([x_pre, l1_loss, l2_loss])
plt.plot(x_,l1_,label='l1_loss')
plt.plot(x_,l2_,label='l2_loss')
plt.legend()
plt.show()
```

Loss curve

Huber
Loss is often used in regression problems. Compared with L2 loss, Huber Loss is
less sensitive to outliers (because if the residual is too large, it is a
piecewise function, loss is a linear function of the residual).

The Huber loss combines the best properties of MSE and MAE.
It is quadratic for smaller errors and is linear otherwise (and similarly for
its gradient). It is identified by its delta parameter

Huber loss function

Among them, 𝛿 is a set parameter, 𝑦 represents the real
value and f(x) represent the predicted value.

The advantage of this is
that when the residual is small, the loss function is L2 norm, and when the
residual is large, it is a linear function of L1 norm.

A smooth approximation of Huber loss to ensure that each order is
differentiable.

Pseudo-Huber loss function

Where δ is the set parameter, the larger the value, the steeper the linear part on both sides. You can observe it with the help of plot given below.

Loss curves

Binary Classification is
simply classifying an object into one of two classes. This classification is
based on a rule applied to the input feature vector. For example, classifying that
today rain will happen or not happen, say its subject line, this is binary
classification problem. Let’s see some of the loss functions associated with
it.

Hinge loss is often used for binary classification problems, such
as ground true: t = **1 or -1**, predicted value

In the svm classifier, the definition of hinge loss is:

definition of hinge loss

In other
words, the closer the y is to t, the smaller the loss will be.

Cross-entropy loss

The above
is mainly to say that cross-entropy loss is mainly applied to binary
classification problems. The predicted value is a probability value and the
loss is defined according to the cross-entropy. Note the value range of the
above value: the predicted value of y should be a probability and the value
range are [0,1].

Graph of Cross-entropy loss

The above
cross-entropy loss requires that the predicted value is a probability.
Generally, we calculate scores=x * w+b. Entering this value into the sigmoid function can compress
the value range to (0,1).

It can be
seen that the sigmoid function smooths the predicted value(such as directly
inputting 0.1 and 0.01 and inputting 0.1, 0.01 sigmoid and then entering, the
latter will obviously have a much smaller change value), which makes the predicted value of sigmoid-ce far from the label loss growth is not so steep.

Sigmoid-Cross-entropy loss

First, the softmax function
can convert a set of fraction vectors into corresponding probability vectors.
Here is the definition of softmax function.

Softmax cross-entropy loss

As above, softmax also
implements a vector of 'squashes' k-dimensional real value to the [0,1] range
of k-dimensional, while ensuring that the cumulative sum is 1.

According to the
definition of cross entropy, probability is required as input. Sigmoid cross entropy
loss uses sigmoid to convert the score vector into a probability vector, and
softmax-cross-entropy-loss uses a softmax function to convert the score vector
into a probability vector.

According to the
definition of cross-entropy loss.

definition of cross-entropy loss

As above, softmax also
implements a vector of 'squashes' k-dimensional real value to the [0,1] range
of k-dimensional, while ensuring that the cumulative sum is 1.

According to the
definition of cross entropy, probability is required as input. Sigmoid cross entropy
loss uses sigmoid to convert the score vector into a probability vector, and
softmax-cross-entropy-loss uses a softmax function to convert the score vector
into a probability vector.

According to the
definition of cross entropy loss.

definition of cross entropy loss

Where fj is the score of all possible categories,
and fyi is the
score of ground true class.

I hope you enjoyed reading this article and finally, you came
to know about **Loss Functions in Deep Learning.**

For more such blogs/courses on data science, machine
learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.

Thanks for reading…

Happy Learning…