#### Machine Learning with Python & Statistics

4 (4,001 Ratings)

220 Learners

#### Join AI Webinar With Top AI Experts on 20th June 2021 at 06 PM

Jun 20th (6:00 PM) 690 Registered
More webinars
##### Loss Functions in Deep Learning

Sanober Ibrahim

9 months ago

All of us may already familiar with the process involved to train a deep learning neural network. But let me remind you in a brief.
We train deep learning neural networks using the gradient descent optimization algorithm which provides the idea of how and in which direction we can move to achieve minimum error so that the deep learning model performs the best.
In this optimization algorithm, the error for the current state of the model is estimated repeatedly. Now it requires to choose which error function (loss function) that can be used to calculate the loss of the model so that the weights can be updated and the loss can be reduced for the next evaluation process.
Now you may have the basic idea about how a deep neural network gets trained. Let’s move further and try to understand it in a better way.
What Is a Loss Function?
In a simplest way, we can say that a loss function is a method of evaluating how well your algorithm models your dataset.
In terms of optimization techniques, the function which is used to evaluate a solution is referred to as the objective function. Now we may want to maximize or minimize the objective function so to get the highest or lowest score respectively.
Typically, for deep learning neural network, we want to minimize the error value and hence the objective function here is known as a cost function or a loss function and the value of this objection function is simply referred as the “loss”.
##### NOTE: Is there any difference between a Loss Function and a Cost Function?
I want to clear you this here – although the cost function and the loss function are synonymous and used interchangeably but they are a little bit different actually.
When we have a single training example then it is known as loss function. It is also sometimes called as an error function. On the other hand, a cost function is the average loss over the entire training dataset.
Now that we are familiar with what is loss function and loss, we need to know what functions to use when and why?

## Types of loss functions

There are mainly three types of loss functions we have and again these loss functions are further divided which is shown as below
##### Regression Loss Functions
• Mean Squared Error Loss
• Mean Squared Logarithmic Error Loss
• Mean Absolute Error Loss
• L1 Loss
• L2 Loss
• Huber Loss
• Pseudo Huber Loss
##### Binary Classification Loss Functions
• Binary Cross-Entropy
• Hinge Loss
• Squared Hinge Loss
##### Multi-class Classification Loss Functions
• Multi-class Cross Entropy Loss
• Sparse Multiclass Cross-Entropy Loss
• Kullback Leibler Divergence Loss

## Regression Loss Functions

As of now, you must be quite familiar with linear regression problems. Linear Regression problem deals with mapping a linear relationship between a dependent variable, Y, and several independent variables, X’s. So, we essentially fit a line in space on these variables to get the best model with minimum error. Basically, a regression problem involves predicting a real-valued quantity.
In this article I will try to take you through some of the loss functions and later I will try to write separate article on each of the different loss functions.

## Squared Error Loss

### L1 and L2 loss

L1 and L2 are two common loss functions in machine learning/deep learning which are mainly used to minimize the error.
L1 loss function is also known as Least Absolute Deviations in short LAD. L2 loss function is also known as Least square errors in short LS.
Let's get brief idea about these two loss functions

### L1 Loss function

It is used to minimize the error which is the sum of all the absolute differences in between the true value and the predicted value.
L1 loss function
L1 loss is also known as the Absolute Error and the cost is the Mean of these Absolute Errors (MAE).

### L2 Loss Function

It is also used to minimize the error which is the sum of all the squared differences in between the true value and the predicted value.
L2 loss function
The corresponding cost function is the Mean of these Squared Errors (MSE).
Note: The disadvantage of the L2 norm is that when there are outliers, these points will account for the main component of the loss.
For example, the true value is 1, the prediction is 10 times, the prediction value is 1000 once, and the prediction value of the other times is about 1, obviously the loss value is mainly dominated by 1000.
Plotting L1 and L2 loss using TensorFlow
``````#import libraries
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
x_pre = tf.lin_space(-1., 1., 100)
x_actual = tf.constant(0,dtype=tf.float32)
l1_loss = tf.abs((x_pre - x_actual))
l2_loss = tf.square((x_pre - x_actual))

with tf.Session() as sess:
x_,l1_,l2_ = sess.run([x_pre, l1_loss, l2_loss])
plt.plot(x_,l1_,label='l1_loss')
plt.plot(x_,l2_,label='l2_loss')
plt.legend()
plt.show()
``````
Output: The above code will produce below plot:
Loss curve

## Huber Loss

Huber Loss is often used in regression problems. Compared with L2 loss, Huber Loss is less sensitive to outliers (because if the residual is too large, it is a piecewise function, loss is a linear function of the residual).
The Huber loss combines the best properties of MSE and MAE. It is quadratic for smaller errors and is linear otherwise (and similarly for its gradient). It is identified by its delta parameter
Huber loss function
Among them, 𝛿 is a set parameter, 𝑦 represents the real value and f(x) represent the predicted value.
The advantage of this is that when the residual is small, the loss function is L2 norm, and when the residual is large, it is a linear function of L1 norm.

### Pseudo-Huber loss function

A smooth approximation of Huber loss to ensure that each order is differentiable.
Pseudo-Huber loss function
Where δ is the set parameter, the larger the value, the steeper the linear part on both sides. You can observe it with the help of plot given below.
Loss curves

## Binary Classification Loss Functions

Binary Classification is simply classifying an object into one of two classes. This classification is based on a rule applied to the input feature vector. For example, classifying that today rain will happen or not happen, say its subject line, this is binary classification problem. Let’s see some of the loss functions associated with it.

## Hinge Loss

Hinge loss is often used for binary classification problems, such as ground true: t = 1 or -1, predicted value
y = wx + b
In the svm classifier, the definition of hinge loss is:
definition of hinge loss
In other words, the closer the y is to t, the smaller the loss will be.

## Cross-entropy loss

Cross-entropy loss
The above is mainly to say that cross-entropy loss is mainly applied to binary classification problems. The predicted value is a probability value and the loss is defined according to the cross-entropy. Note the value range of the above value: the predicted value of y should be a probability and the value range are [0,1].
Graph of Cross-entropy loss

## Sigmoid-Cross-entropy loss

The above cross-entropy loss requires that the predicted value is a probability. Generally, we calculate scores=x * w+b. Entering this value into the sigmoid function can compress the value range to (0,1).
It can be seen that the sigmoid function smooths the predicted value(such as directly inputting 0.1 and 0.01 and inputting 0.1, 0.01 sigmoid and then entering, the latter will obviously have a much smaller change value), which makes the predicted value of sigmoid-ce far from the label loss growth is not so steep.
Sigmoid-Cross-entropy loss

## Softmax cross-entropy loss

First, the softmax function can convert a set of fraction vectors into corresponding probability vectors. Here is the definition of softmax function.
Softmax cross-entropy loss
As above, softmax also implements a vector of 'squashes' k-dimensional real value to the [0,1] range of k-dimensional, while ensuring that the cumulative sum is 1.
According to the definition of cross entropy, probability is required as input. Sigmoid cross entropy loss uses sigmoid to convert the score vector into a probability vector, and softmax-cross-entropy-loss uses a softmax function to convert the score vector into a probability vector.
According to the definition of cross-entropy loss.
definition of cross-entropy loss
As above, softmax also implements a vector of 'squashes' k-dimensional real value to the [0,1] range of k-dimensional, while ensuring that the cumulative sum is 1.
According to the definition of cross entropy, probability is required as input. Sigmoid cross entropy loss uses sigmoid to convert the score vector into a probability vector, and softmax-cross-entropy-loss uses a softmax function to convert the score vector into a probability vector.
According to the definition of cross entropy loss.
definition of cross entropy loss
Where fj is the score of all possible categories, and fyi is the score of ground true class.