Binary Cross-Entropy

Kajal Pawar

10 months ago

Cross-entropy is one of the most used as a loss function when we try to optimize any classification models.
Classification is a type of supervised learning problem, which involves the prediction of a class label using one or more input variables.
Basically, Classification problems that have just two labels for the target, the variable is referred to as binary classification problems and the problems with more than two labels/classes are referred to as categorical or multi-class classification problems.
What is Binary Cross Entropy?
Binary cross-entropy is a loss function that is used in binary classification problems. The main aim of these tasks is to answer a question with only two choices.
For example, we are classifying
  • Will, it rains today? Yes or No
  • Is the email Spam or Not Spam?
Yes or No
Yes or No
Let me take a simple classification example and explain to you how it actually works.
Let’s take some 10 random data points as shown below and call it as x feature:
x = [-2.2, -1.4, -0.8, 0.2, 0.4, 0.8, 1.2, 2.2, 2.9, 4.6]
Now, let’s we represent these data points on a number line as shown below.
Feature X
Feature X
Let’s assign some different color to different data points which represent different classes or labels as shown below.
Data points with different colors
Data points with different colors
We can see that our classification task here is quite simple and straight-forward.
Task: Given the feature x we have to classify the data points into red or green labels or classes.
Now from the above task, we can see that it is a binary classification task, we can also take this task as: “Is the data point green” or in a better way, “what is the probability of the point being green”?
Basically, green points would have a probability equal to 1 of being green and the red points would have a probability equal to 0 of being green.
In the above-mentioned scenario, green points belong to the positive class, i.e., Yes, they are green, while the red points belong to the negative class, i.e., No, they are not green.
Now, our task is to build a model to perform this classification task, it will predict a probability of being green for each of the data points. Given what we know about the color of the points, how can we evaluate and know how good or bad are the predictions made by our model.
This is where the loss function comes into the picture, which will help us to check whether our model is performing good or bad. It will return high values for bad predictions and low values for good predictions.
In our case, for a binary classification, the typical loss functions are called as the binary cross-entropy / log loss.
Loss Function: Binary Cross-Entropy / Log Loss
The Binary cross-entropy loss function actually calculates the average cross entropy across all examples.
The formula of this loss function can be given by:
Binary Cross-Entropy / Log loss
Binary Cross-Entropy / Log loss
  • Here, y represents the label / class (1 for the green points and 0 for the red points)
  • p(y) represents the predicted probability of the data point being green for all N data points.
Let me explain you, what the formula given above actually tells you.
For each green point (y=1), it adds log(p(y)) to the loss, which means that the log probability of it being green. On the other hand, it will add log(1-p(y)), which means that the log probability of it being red, for each red point (y=0).
Computation of Binary cross-entropy in visual way. 
Let’s see how we can compute Binary cross-entropy in a visual way first and then I will take you through how we can implement it using python.
Let’s consider the above example only.
Data points with different colors
Data points with different colors
   First, let’s split these data points according to their respective classes, i.e., positive and negative as shown below.                     
<b>                                         Figure
5: Spitted data points</b>
                                         Figure 5: Spitted data points
Next, let’s build a Logistic Regression model to classify the given data points. The fitted logistic regression model is a sigmoid curve which is representing the probability of a point being green for any given x. It can be given by:
<b>                               Figure 6: Fitting a logistic regression model</b>
                               Figure 6: Fitting a logistic regression model
Now, you might be thinking what are the predicted probabilities of all the points belonging to the positive class (green) or negative class (red) by our model.
Let’s me show you how it will look actually.
These are the green bars under the sigmoid curve, at the x coordinates corresponding to the points as shown in the below figure.
                            <b>Figure
6: Bar showing probabilities of positive class</b>
                            Figure 6: Bar showing probabilities of positive class
And for the negative class (red) it looks like as shown below:
     <b>                      Figure 7: Bar showing probabilities of Negative class</b>
                           Figure 7: Bar showing probabilities of Negative class
Now combining both the above figure we will get:
                    <b>Figure 8: Bar showing probabilities of Positive and Negative class</b>
                    Figure 8: Bar showing probabilities of Positive and Negative class
Since as of now we have the predicted probabilities and let’s calculate the binary cross-entropy / log loss.
Now we will only take the bars graph which gives us the probabilities which all we need. It can be shown as below:
           <b>             Figure 9: Probabilities of Positive and Negative classes</b>
                        Figure 9: Probabilities of Positive and Negative classes
    Rearranging the bars of the above figure, we will get:
            <b>Figure
10: Rearranged probabilities of Positive and Negative classes</b>
            Figure 10: Rearranged probabilities of Positive and Negative classes
As we are trying to compute a loss, we need to penalize the bad predictions. But how? Let’s see
If the probability associated with the true class is 1.0, we need its loss to be zero. On the other hand, if that probability is low, say, 0.02, we need its loss to be large.
So, taking the (negative) log of the probability suits us well enough for this purpose as the log of values between 0 and 1 is negative, we take the negative log to obtain a positive value for the loss. To get more better understanding we have to understand the math behind. It’s actually comes from the cross-entropy.
From the below plot, we can see the predicted probability of the true class gets closer to zero, the loss increases exponentially
                                  <b>Figure 10: Log loss for different probabilities</b>
                                  Figure 10: Log loss for different probabilities
So, now let’s take the (negative) log of the probabilities — these are the corresponding losses of each and every point.
Finally, then we compute the mean of all these losses as shown below
<b>                                                  Figure 11: The loss</b>
                                                  Figure 11: The loss
                                                  
So, the calculated binary cross-entropy / log loss of the taken example here comes out to be 0.3329.
Implementation of Binary Cross-entropy / log loss using Python
Let’s us implement it on python
# import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
import numpy as np

x = np.array([-2.2, -1.4, -.8, .2, .4, .8, 1.2, 2.2, 2.9, 4.6])
y = np.array([0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0])

logr = LogisticRegression(solver='lbfgs')
logr.fit(x.reshape(-1, 1), y)

y_pred = logr.predict_proba(x.reshape(-1, 1))[:, 1].ravel()
loss = log_loss(y, y_pred)

print('x = {}'.format(x))
print('y = {}'.format(y))
print('p(y) = {}'.format(np.round(y_pred, 2)))
print('Log Loss / Cross Entropy = {:.4f}'.format(loss))

Output:
x = [-2.2 -1.4 -0.8  0.2  0.4  0.8  1.2  2.2  2.9  4.6]
y = [0. 0. 1. 0. 1. 1. 1. 1. 1. 1.]
p(y) = [0.19 0.33 0.47 0.7  0.74 0.81 0.86 0.94 0.97 0.99]
Log Loss / Cross Entropy = 0.3329

Let’s take an and write a full python code and calculate the binary cross-entropy / log loss.
# mlp for the circles problem with cross entropy loss
from sklearn.datasets import make_circles
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from matplotlib import pyplot

# generate 2d classification dataset
X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
# split into train and test
n_train = 500
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]

# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='sigmoid'))
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)

# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss during training
pyplot.subplot(211)
pyplot.title('Loss')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()

# plot accuracy during training
pyplot.subplot(212)
pyplot.title('Accuracy')
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()

Output:
The above code first output the binary cross entropy for the model on the train and test datasets as
	Train: 0.840, Test: 0.853
Then it will plot training and testing loss as shown below:
 <b>Figure 12: Line Plots of Cross Entropy Loss and Classification Accuracy over Training Epochs on the Two Circles Binary Classification Problem</b>
 Figure 12: Line Plots of Cross Entropy Loss and Classification Accuracy over Training Epochs on the Two Circles Binary Classification Problem                                 
After reading this article, finally you came to know the importance of Binary Cross-entropy / log loss. For more blogs/courses in data science, machine learning, artificial intelligence and new technologies do visit us at InsideAIML.
Thanks for reading…

Submit Review