#### World's Best AI Learning Platform with profoundly Demanding Certification Programs

Designed by IITian's, only for AI Learners.

Download our e-book of Attention Mechanism

What is file hashing in python? How to use Enum in python? What is a Bag-of-Words Model ? What is use of rank() function? What are local and global scope? Backpropagation: In second-order methods, would ReLU derivative be 0? and what its effect on training? How to Choose a classification algorithm for particular problem? Balanced Parentheses Check problem Join Discussion

5 (4,001 Ratings)

220 Learners

Jun 20th (6:00 PM) 690 Registered

Kajal Pawar

8 months ago

normalized exponential function.

In mathematics, the **softmax
function,** also known as **softargmax** or **normalized
exponential function**. **SoftMax function** is
described as a combination of multiple sigmoid functions. As the sigmoid
functions returns the values in the range of **0** and **1**, which can
be treated as probabilities of a data point belonging to a particular class.
That’s why sigmoid functions are mainly used for binary classification
problems.

But
on the other hand, the **SoftMax function** can be used for multiclass
classification problems. **SoftMax Activation function** gives the
probability for a data point belonging to each individual class.

In
deep learning, the term **logits **is popularly used for the last neuron layer of the neural network for
the classification task which produces raw prediction values as real numbers
ranging from [-infinity, +infinity]. —
Wikipedia

Logits are the raw score values produce by the last
layer of the neural network before applying any activation function on it.

SoftMax function turn logits value into probabilities by
taking the exponents of each output and then normalize each number by the sum
of those exponents so that the entire output vector adds up to one.

The
equation of the SoftMax function can be given as:

equation of the SoftMax function

The
softmax function is similar to the sigmoid function, except that here in the
denominator we sum together all of the things in our raw output. In simple
words, when we calculate the value of softmax on a single raw output (e.g. z1)
we cannot directly take the of z1 value alone. We have to consider z1, z2, z3,
and z4 in the denominator as shown below:

softmax function

The
softmax function ensures that the sum of all our output probability values will
always be equal to one.

That
means if we are classifying a dog, cat, boat and airplane and applying a
softmax function to our outputs, in order for the network to increase the
probability that a particular example is classified as an “ airplane” it needs
to decrease the probabilities that the example is classified as some other
classes such as a dog, cat or boat. Later we will see its example also.

Comparison between sigmoid and softmax outputs

The graph
can be represented as:

softmax activation function

From the
above graph, we can see there is not much difference between the sigmoid
function graph and softmax function graph.

Softmax function has many
applications in Multiclass Classification and neural networks. SoftMax is
different from the normal max function: the max function only outputs the
largest value and SoftMax ensures that smaller values have a smaller
probability and will not be discarded directly. The denominator of the SoftMax
function combines all factors of the original output value, which means that
the different probabilities obtained by the SoftMax function are related to
each other.

In
the case of binary classification, for Sigmoid, the equation will be:

binary classification, for Sigmoid

For Softmax when K = 2, the equation
will be:

For Softmax when K = 2

From the above the equation we can take common, which is:

equation

So, it can be seen from the equation
that in the case of binary classification, Softmax is degraded to Sigmoid
function.

While we try to build a network for a multiclass the problem, the output layer would have as many neurons as the number of classes
in the target as shown below:

multi class classification

For example, if we are having three different classes,
so there will be three neurons in the output layer.

Now let suppose you received the output from the
neurons as **[0.7, 1.5, 4.8].**

If we apply the softmax function over the outputs of neurons, then we will get the output as: - **[0.01573172, 0.03501159, 0.94925668]. **

These outputs represent the probability for the data belonging to different classes.

Note that the sum of all the output values will always be 1.

Now let’s take an example and understand softmax function in a better way.

compute cross-entropy loss

In the above
example, our aim is to classify the image whether the image is an image of a
dog, cat, boat or airplane.

From the
image we can clearly see the image is an “airplane” image. However, let’s see
does our softmax function correctly classify it?

As we see
from the above figure. Here, I have taken the output of our scoring function f,
for each of the four classes, respectively. These scoring values are our
unnormalized log probabilities for the four different classes.

Now, when we exponentiate the output
of the scoring function, this will result in unnormalized probabilities as shown in
the figure below:

Exponentiating the output values from the scoring function gives us our unnormalized probabilities

Now, our next step is to take the denominator,
sum the exponent's values and divide it by the sum which will give us the
actual probabilities associated with each of the different class labels shown
below:

To obtain the actual probabilities, we divide each individual unnormalized probability by the sum of unnormalized probabilities.

Finally, we can take the negative log, which
will give us our final loss:

Taking the negative log of the probability for the correct ground-truth class yields the final loss for the data point

So, from the above example we can see that our
Softmax classifier, correctly classify the image as the **“airplane”** with **93.15%**
confidence value.

This is how the Softmax function works behind
the scene.

Let see how we can implement softmax function on
python with a simple example:

```
#define the softmax function
def softmax_function(x):
a = np.exp(x)
z = a/a.sum()
return z
#call the function
softmax_function([0.7, 1.5, 4.8])
```

`array([0.01573172, 0.03501159, 0.94925668])`

There are
mainly some points which is very important about the SoftMax function, when we
think about why it is mostly used in neural network.

- We cannot use argmax directly instead, we have to approximate its outcome with SoftMax because argmax is not differentiable and it is also not continuous. Therefore, argmax cannot be used while training neural networks with gradient descent-based optimization technique.

- SoftMax is having nice properties with regards to normalization as it can be differentiated. Hence, it’s very useful for optimizing the neural network.

In this example, we will build a deep
neural network that will classify data into one of the four classes using Keras
that make use of Softmax activation function for classification.

Before jump into coding make sure that you have these
dependencies installed before you run this model. You can install them using pip install
method as shown below:

Now as your required libraries and dependencies is installed,
So we can now start coding and build the neural network.

Let’s start

```
#Import libraries
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
# Set Configuration options
total_num_samples = 1000
training_split = 250
cluster_centers = [(15,0), (15,15), (0,15), (30,15)]
num_classes = len(cluster_centers)
loss_function = 'categorical_crossentropy'
# Generate data for experiment
x, targets = make_blobs(n_samples = total_num_samples, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 1.50)
categorical_targets = to_categorical(targets)
X_training = x[training_split:, :]
X_testing = x[:training_split, :]
Targets_training = categorical_targets[training_split:]
Targets_testing = categorical_targets[:training_split].astype(np.integer)
# Set shape based on data
feature_vector_length = len(X_training[0])
input_shape = (feature_vector_length,)
print(f'Feature shape: {input_shape}')
# Generate scatter plot for training data
plt.scatter(X_training[:,0], X_training[:,1])
plt.title('Nonlinear data')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
# Create the model
model = Sequential()
model.add(Dense(12, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(8, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(num_classes, activation='softmax'))
# Configure the model and start training
model.compile(loss=loss_function, optimizer=keras.optimizers.adam(lr=0.001), metrics=['accuracy'])
history = model.fit(X_training, Targets_training, epochs=30, batch_size=5, verbose=1, validation_split=0.2)
# Test the model after training
results = model.evaluate(X_testing, Targets_testing, verbose=1)
print(f'Results - Loss: {results[0]} - Accuracy
```

When we run
the above code, you should find an extremely well-performing model which
produces the result as shown below:

`Results - Loss: 0.002027431168 - Accuracy: 100.0%`

But in real-world problems, the result will not be always as
good as you desire. For the best result, you have to perform many different
experiments and trail and then come up with the best one.

Generally,
we use softmax activation instead
of sigmoid with
the cross-entropy loss because softmax activation distributes the probability
throughout each output node. But, since it is a binary classification, using
sigmoid is same as softmax. For multi-class classification use sofmax with
cross-entropy.

difference between sigmoid and softmax function

In this article, we saw that Softmax is an activation
function which converts the inputs and output of the last layer of your neural
network into a discrete probability distribution over the target classes. Softmax
ensures that the criteria of probability distributions that the probabilities
are nonnegative and also that the sum of probabilities should be equal 1.

After
reading this article finally you came to know the **importance of softmax activation
functions**. For more blogs/courses in data science, machine learning, artificial
intelligence and new technologies do visit us at InsideAIML.

Thanks
for reading…