#### World's Best AI Learning Platform with profoundly Demanding Certification Programs

Designed by IITians, only for AI Learners.

Designed by IITians, only for AI Learners.

New to InsideAIML? Create an account

Employer? Create an account

Download our e-book of Introduction To Python

Exception Type: JSONDecodeError at /update/ Exception Value: Expecting value: line 1 column 1 (char 0) How can I write Python code to change a date string from "mm/dd/yy hh: mm" format to "YYYY-MM-DD HH: mm" format? How to leave/exit/deactivate a Python virtualenvironment Change the hyperparameter means ? Why is my f1_scores different when i calculate them manually vs output by sklearn.metrics What is the method for assigning a non-scalar value to a PyTorch tensor? Find the Largest and smallest items in a collection. How Python interpreter Works? Join Discussion

4.5 (1,292 Ratings)

559 Learners

Kajal Pawar

a month ago

- What are logits?

- Why SoftMax function?

- Comparison between sigmoid and softmax outputs:

- Why Softmax is used in neural networks?

- Implementation of SoftMax with Keras

- Keras model to the example with Softmax activation function.

- Results

- What’s the difference between sigmoid and softmax function?

In mathematics, the **softmax
function,** is also known as **softargmax** or **normalized
exponential function**. The **softMax function** is
described as a combination of multiple sigmoid functions. As the sigmoid
functions return values in the range of **0** and **1**, which can
be treated as probabilities of a data point belonging to a particular class.
That’s why sigmoid functions are mainly used for binary classification
problems.

But
on the other hand, the **SoftMax function** can be used for multiclass
classification problems. **SoftMax Activation function** gives the
probability for a data point belonging to each individual class.

In
deep learning, the term **logits **is popularly used for the last neuron layer of the neural network for
the classification task which produces raw prediction values as real numbers
ranging from [-infinity, +infinity]. —
Wikipedia

Logits are the raw score values produced by the last
layer of the neural network before applying any activation function on it.

SoftMax function turns logits value into probabilities by
taking the exponents of each output and then normalizing each number by the sum
of those exponents so that the entire output vector adds up to one.

The
equation of the SoftMax function can be given as:

The
softmax function is similar to the sigmoid function, except that here in the
denominator we sum together all of the things in our raw output. In simple
words, when we calculate the value of softmax on a single raw output (e.g. z1)
we cannot directly take the of z1 value alone. We have to consider z1, z2, z3,
and z4 in the denominator as shown below:

The
softmax function ensures that the sum of all our output probability values will
always be equal to one.

That
means if we are classifying a dog, cat, boat and airplane and applying a
softmax function to our outputs, in order for the network to increase the
probability that a particular example is classified as an “ airplane” it needs
to decrease the probabilities that the example is classified as some other
classes such as a dog, cat or boat. Later we will see its example also.

The graph
can be represented as:

From the
above graph, we can see there is not much difference between the sigmoid
function graph and softmax function graph.

Softmax function has many
applications in Multiclass Classification and neural networks. SoftMax is
different from the normal max function: the max function only outputs the
largest value and SoftMax ensures that smaller values have a smaller
probability and will not be discarded directly. The denominator of the SoftMax
function combines all factors of the original output value, which means that
the different probabilities obtained by the SoftMax function are related to
each other.

In
the case of binary classification, for Sigmoid, the equation will be:

For Softmax when K = 2, the equation
will be:

From the above the equation we can take common, which is:

So, it can be seen from the equation
that in the case of binary classification, Softmax is degraded to Sigmoid
function.

While we try to build a network for a multiclass the problem, the output layer would have as many neurons as the number of classes
in the target as shown below:

For example, if we are having three different classes,
so there will be three neurons in the output layer.

Now let suppose you received the output from the
neurons as **[0.7, 1.5, 4.8].**

If we apply the softmax function over the outputs of neurons, then we will get the output as: - **[0.01573172, 0.03501159, 0.94925668]. **

These outputs represent the probability for the data belonging to different classes.

Note that the sum of all the output values will always be 1.

Now let’s take an example and understand softmax function in a better way.

To understand how softmax activation function actually works lets
us consider the below example.

In the above
example, our aim is to classify the image whether the image is an image of a
dog, cat, boat or airplane.

From the
image we can clearly see the image is an “airplane” image. However, let’s see
does our softmax function correctly classify it?

As we see
from the above figure. Here, I have taken the output of our scoring function f,
for each of the four classes, respectively. These scoring values are our
unnormalized log probabilities for the four different classes.

Now, when we exponentiate the output
of the scoring function, this will result in unnormalized probabilities as shown in
the figure below:

Now, our next step is to take the denominator,
sum the exponent's values and divide it by the sum which will give us the
actual probabilities associated with each of the different class labels shown
below:

Finally, we can take the negative log, which
will give us our final loss:

So, from the above example we can see that our
Softmax classifier, correctly classify the image as the **“airplane”** with **93.15%**
confidence value.

This is how the Softmax function works behind
the scene.

Let see how we can implement softmax function on
python with a simple example:

```
#define the softmax function
def softmax_function(x):
a = np.exp(x)
z = a/a.sum()
return z
#call the function
softmax_function([0.7, 1.5, 4.8])
```

`array([0.01573172, 0.03501159, 0.94925668])`

There are
mainly some points which is very important about the SoftMax function, when we
think about why it is mostly used in neural network.

- We cannot use argmax directly instead, we have to approximate its outcome with SoftMax because argmax is not differentiable and it is also not continuous. Therefore, argmax cannot be used while training neural networks with gradient descent-based optimization technique.

- SoftMax is having nice properties with regards to normalization as it can be differentiated. Hence, it’s very useful for optimizing the neural network.

In this example, we will build a deep
neural network that will classify data into one of the four classes using Keras
that make use of Softmax activation function for classification.

Before jump into coding make sure that you have these
dependencies installed before you run this model. You can install them using pip install
method as shown below:

Now as your required libraries and dependencies is installed,
So we can now start coding and build the neural network.

Let’s start

```
#Import libraries
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
# Set Configuration options
total_num_samples = 1000
training_split = 250
cluster_centers = [(15,0), (15,15), (0,15), (30,15)]
num_classes = len(cluster_centers)
loss_function = 'categorical_crossentropy'
# Generate data for experiment
x, targets = make_blobs(n_samples = total_num_samples, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 1.50)
categorical_targets = to_categorical(targets)
X_training = x[training_split:, :]
X_testing = x[:training_split, :]
Targets_training = categorical_targets[training_split:]
Targets_testing = categorical_targets[:training_split].astype(np.integer)
# Set shape based on data
feature_vector_length = len(X_training[0])
input_shape = (feature_vector_length,)
print(f'Feature shape: {input_shape}')
# Generate scatter plot for training data
plt.scatter(X_training[:,0], X_training[:,1])
plt.title('Nonlinear data')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
# Create the model
model = Sequential()
model.add(Dense(12, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(8, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(num_classes, activation='softmax'))
# Configure the model and start training
model.compile(loss=loss_function, optimizer=keras.optimizers.adam(lr=0.001), metrics=['accuracy'])
history = model.fit(X_training, Targets_training, epochs=30, batch_size=5, verbose=1, validation_split=0.2)
# Test the model after training
results = model.evaluate(X_testing, Targets_testing, verbose=1)
print(f'Results - Loss: {results[0]} - Accuracy
```

When we run
the above code, you should find an extremely well-performing model which
produces the result as shown below:

`Results - Loss: 0.002027431168 - Accuracy: 100.0%`

But in real-world problems, the result will not be always as
good as you desire. For the best result, you have to perform many different
experiments and trail and then come up with the best one.

Generally,
we use softmax activation instead
of sigmoid with
the cross-entropy loss because softmax activation distributes the probability
throughout each output node. But, since it is a binary classification, using
sigmoid is same as softmax. For multi-class classification use sofmax with
cross-entropy.

Softmax Function | Sigmoid Function |

Used for multi-classification in logistic regression model. | Used for binary classification in logistic regression model. |

The probabilities sum will be 1 | The probabilities sum need not be 1 |

Used in the different layers of neural networks | Used as activation function while building neural networks |

The high value will have the higher probability than other values . | The high value will have the high probability but not the higher probability. |

In this article, we saw that Softmax is an activation
function which converts the inputs and output of the last layer of your neural
network into a discrete probability distribution over the target classes. Softmax
ensures that the criteria of probability distributions that the probabilities
are nonnegative and also that the sum of probabilities should be equal 1.

After
reading this article finally you came to know the **importance of softmax activation
functions**. For more blogs/courses in data science, machine learning, artificial
intelligence, and new technologies do visit us at InsideAIML.

Next Article: Activation Functions in Neural Network

Previous Article: Sigmoid Activation Function

+