SoftMax Activation Function

Kajal Pawar

8 months ago

normalized exponential function.
normalized exponential function.
In mathematics, the softmax function, also known as softargmax or normalized exponential function. SoftMax function is described as a combination of multiple sigmoid functions. As the sigmoid functions returns the values in the range of 0 and 1, which can be treated as probabilities of a data point belonging to a particular class. That’s why sigmoid functions are mainly used for binary classification problems.
But on the other hand, the SoftMax function can be used for multiclass classification problems. SoftMax Activation function gives the probability for a data point belonging to each individual class.
In deep learning, the term logits is popularly used for the last neuron layer of the neural network for the classification task which produces raw prediction values as real numbers ranging from [-infinity, +infinity]. — Wikipedia

What are logits?

Logits are the raw score values produce by the last layer of the neural network before applying any activation function on it.

Why SoftMax function?

SoftMax function turn logits value into probabilities by taking the exponents of each output and then normalize each number by the sum of those exponents so that the entire output vector adds up to one.
The equation of the SoftMax function can be given as:
equation of the SoftMax function
equation of the SoftMax function
The softmax function is similar to the sigmoid function, except that here in the denominator we sum together all of the things in our raw output. In simple words, when we calculate the value of softmax on a single raw output (e.g. z1) we cannot directly take the of z1 value alone. We have to consider z1, z2, z3, and z4 in the denominator as shown below:
softmax function
softmax function
The softmax function ensures that the sum of all our output probability values will always be equal to one.
That means if we are classifying a dog, cat, boat and airplane and applying a softmax function to our outputs, in order for the network to increase the probability that a particular example is classified as an “ airplane” it needs to decrease the probabilities that the example is classified as some other classes such as a dog, cat or boat. Later we will see its example also.

Comparison between sigmoid and softmax outputs:

Comparison between sigmoid and softmax outputs
Comparison between sigmoid and softmax outputs
The graph can be represented as:
softmax activation function
softmax activation function
From the above graph, we can see there is not much difference between the sigmoid function graph and softmax function graph.
Softmax function has many applications in Multiclass Classification and neural networks. SoftMax is different from the normal max function: the max function only outputs the largest value and SoftMax ensures that smaller values have a smaller probability and will not be discarded directly. The denominator of the SoftMax function combines all factors of the original output value, which means that the different probabilities obtained by the SoftMax function are related to each other.
In the case of binary classification, for Sigmoid, the equation will be:
binary classification, for Sigmoid
binary classification, for Sigmoid
For Softmax when K = 2, the equation will be:
For Softmax when K = 2
For Softmax when K = 2
From the above the equation we can take common, which is:
So, it can be seen from the equation that in the case of binary classification, Softmax is degraded to Sigmoid function.
While we try to build a network for a multiclass the problem, the output layer would have as many neurons as the number of classes in the target as shown below:
multi class classification
multi class classification
For example, if we are having three different classes, so there will be three neurons in the output layer.
Now let suppose you received the output from the neurons as [0.7, 1.5, 4.8].
If we apply the softmax function over the outputs of neurons, then we will get the output as: - [0.01573172, 0.03501159, 0.94925668]. 
These outputs represent the probability for the data belonging to different classes.
Note that the sum of all the output values will always be 1.
Now let’s take an example and understand softmax function in a better way.

A Real Softmax example.

To understand how softmax actually works lets us consider the below example.

compute cross-entropy loss
compute cross-entropy loss
In the above example, our aim is to classify the image whether the image is an image of a dog, cat, boat or airplane.
From the image we can clearly see the image is an “airplane” image. However, let’s see does our softmax function correctly classify it?
As we see from the above figure. Here, I have taken the output of our scoring function f, for each of the four classes, respectively. These scoring values are our unnormalized log probabilities for the four different classes.
Note: Here, I have taken the scoring values randomly for this particular example. But in reality, these values would not be taken randomly instead these values would be the output of your scoring function f.
Now, when we exponentiate the output of the scoring function, this will result in unnormalized probabilities as shown in the figure below:
Exponentiating the output values from the scoring function gives us our unnormalized probabilities
Exponentiating the output values from the scoring function gives us our unnormalized probabilities
Now, our next step is to take the denominator, sum the exponent's values and divide it by the sum which will give us the actual probabilities associated with each of the different class labels shown below:
To obtain the actual probabilities, we divide each individual unnormalized probability by the sum of unnormalized probabilities.
To obtain the actual probabilities, we divide each individual unnormalized probability by the sum of unnormalized probabilities.
Finally, we can take the negative log, which will give us our final loss:
Taking the negative log of the probability for the correct ground-truth class yields the final loss for the data point
Taking the negative log of the probability for the correct ground-truth class yields the final loss for the data point
So, from the above example we can see that our Softmax classifier, correctly classify the image as the “airplane” with 93.15% confidence value.
This is how the Softmax function works behind the scene.
Let see how we can implement softmax function on python with a simple example:
#define the softmax function
def softmax_function(x):
    a = np.exp(x)
    z = a/a.sum()
    return z

#call the function
softmax_function([0.7, 1.5, 4.8])
array([0.01573172, 0.03501159, 0.94925668])

Why Softmax is used in neural networks?

There are mainly some points which is very important about the SoftMax function, when we think about why it is mostly used in neural network.
  • We cannot use argmax directly instead, we have to approximate its outcome with SoftMax because argmax is not differentiable and it is also not continuous. Therefore, argmax cannot be used while training neural networks with gradient descent-based optimization technique.
  • SoftMax is having nice properties with regards to normalization as it can be differentiated. Hence, it’s very useful for optimizing the neural network.

Implementation of SoftMax with Keras

In this example, we will build a deep neural network that will classify data into one of the four classes using Keras that make use of Softmax activation function for classification.
Before jump into coding make sure that you have these dependencies installed before you run this model. You can install them using pip install method as shown below:
pip install keras tensorflow matplotlib numpy scikit-learn
Now as your required libraries and dependencies is installed, So we can now start coding and build the neural network.
Let’s start

  Keras model to the example with Softmax activation function.

#Import libraries
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs

# Set Configuration options
total_num_samples = 1000
training_split = 250
cluster_centers = [(15,0), (15,15), (0,15), (30,15)]
num_classes = len(cluster_centers)
loss_function = 'categorical_crossentropy'

# Generate data for experiment
x, targets = make_blobs(n_samples = total_num_samples, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 1.50)
categorical_targets = to_categorical(targets)
X_training = x[training_split:, :]
X_testing = x[:training_split, :]
Targets_training = categorical_targets[training_split:]
Targets_testing = categorical_targets[:training_split].astype(np.integer)

# Set shape based on data
feature_vector_length = len(X_training[0])
input_shape = (feature_vector_length,)
print(f'Feature shape: {input_shape}')

# Generate scatter plot for training data
plt.scatter(X_training[:,0], X_training[:,1])
plt.title('Nonlinear data')

# Create the model
model = Sequential()
model.add(Dense(12, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(8, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(num_classes, activation='softmax'))

# Configure the model and start training
model.compile(loss=loss_function, optimizer=keras.optimizers.adam(lr=0.001), metrics=['accuracy'])
history =, Targets_training, epochs=30, batch_size=5, verbose=1, validation_split=0.2)

# Test the model after training
results = model.evaluate(X_testing, Targets_testing, verbose=1)
print(f'Results - Loss: {results[0]} - Accuracy


When we run the above code, you should find an extremely well-performing model which produces the result as shown below:
Results - Loss: 0.002027431168 - Accuracy: 100.0%
But in real-world problems, the result will not be always as good as you desire. For the best result, you have to perform many different experiments and trail and then come up with the best one. 
Generally, we use softmax activation instead of sigmoid with the cross-entropy loss because softmax activation distributes the probability throughout each output node. But, since it is a binary classification, using sigmoid is same as softmax. For multi-class classification use sofmax with cross-entropy.

What’s the difference between sigmoid and softmax function?

difference between sigmoid and softmax function
difference between sigmoid and softmax function
In this article, we saw that Softmax is an activation function which converts the inputs and output of the last layer of your neural network into a discrete probability distribution over the target classes. Softmax ensures that the criteria of probability distributions that the probabilities are nonnegative and also that the sum of probabilities should be equal 1.
After reading this article finally you came to know the importance of softmax activation functions. For more blogs/courses in data science, machine learning, artificial intelligence and new technologies do visit us at InsideAIML.
Thanks for reading…

Submit Review