In our
previous article’s or tutorial, we saw how neural networks and what is
Artificial neural network (ANN) and does it actually work. In this tutorial we
will try to learn about one of another type of neural networks architecture
known as Convolutional neural networks or commonly called as CNN.
We can
think Convolutional neural networks like a combination of biology and math with
a little CS sprinkled in, but these networks have been some of the most
influential innovations in the field of computer vision.
In 2012,
was the first year that neural nets grew to prominence as Alex Krizhevsky used
them to win that year’s ImageNet competition (basically, the annual Olympics of
computer vision), dropping the classification error record from 26% to 15%, an
astounding improvement at the time. Ever since then, a host of companies have
been using deep learning at the core of their services. Facebook uses neural
nets for their automatic tagging algorithms, Google for their photo search,
Amazon for their product recommendations, Pinterest for their home feed personalization,
and Instagram for their search infrastructure.
So, you
might be thinking that what are the applications of Convolutional neural
networks?
Convolutional neural network (ConvNets or CNNs) have many
applications. Some of the most used application of the CNN are to do images
recognition, images classifications, Objects detections, Recognition faces and
so on.
As you got a brief introduction about how CNN came and its
different applications where it is mostly used. So, now let’s get deep dive and
try to learn it in detail.
What is Convolutional neural network or
CNN?
Convolutional Neural
Networks are very much similar to ordinary Neural Networks. They are made up of
neurons that have learnable weights and biases. Each neuron receives some
inputs, performs a dot product and optionally follows it with a non-linearity.
The whole network still expresses a single differentiable score function: from
the raw image pixels on one end to class scores at the other. And they still
have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and
all the tips/tricks we developed for learning regular Neural Networks still
apply.
So, what’s the main
difference between them?
Convolutional Neural
Networks architectures make the explicit assumption that the inputs are images,
which allows us to encode certain properties into the architecture. These then
make the forward function more efficient to implement and vastly reduce the number
of parameters in the network.
Let’s visualize it.
Simple Neural Network
Figure: Simple neural
network with input layer, two hidden layers and an output layer.
Convolutional Neural
Network
Figure: Convolutional Neural Network
So, we can see from the image that a simple 3-layer Neural Network and how Convolutional Neural
networks arranges its neurons in three dimensions (width, height, depth), as
visualized in one of the layers. Every layer of a CNN transforms the 3D input
volume to a 3D output volume of neuron activations. In this example, the red
input layer holds the image, so its width and height would be the dimensions of
the image, and the depth would be 3 which represents the RGB channels of the
image. (Red, Green, Blue channels).
Let’s take an example of image classification problem and try
to understand how Convolutional Neural Networks works.
Imagine, you have a dog and a cat at you home and you have
many different photos of them. Let’s imagine they are quite small and looks
somewhat similar to each other’s. While looking at the photos of them, you are
able to differentiate both of them quite easily. But how computer can do this.
How it can differentiate the image and tells you that it’s a cat image or a dog
image. Here comes the deep learning Convolutional neural network (CNNs).
CNN image classifications take an input
as an image, process it and classify it under certain categories (E.g., Dog,
Cat). Computers sees an input image as an array of pixels and it depends on the
image resolution. Based on the image resolution, it will see h x w x d (h =
Height, w = Width, d = Dimension).
For example,
say, when a computer sees an image (takes an image as input), it will see an
array of pixel values. Depending on the resolution and size of the image, say it
will see a 28 x 28 x 3 array of numbers (The 3 refers to RGB values). Let's say
we have a color image in JPG format and its size is 300 x 300. The
representative array will be 300 x 300 x 3. Each of these numbers is given a
value from 0 to 255 which represents the pixel intensity values at that point.
These numbers, while meaningless to us when we perform image classification,
are the only inputs available to the computer. The idea is that you give
the computer this array of numbers and it will output numbers that describe the
probability of the image being a certain class (.80 for cat, .20 for dog, etc.).
Convolutional Neural Network (CNNs) basically
contains three types of layers:
1.
Convolutional layer
2.
Pooling layer
3.
Fully connected layer
Basically, to train a deep learning CNN
models, each input image will pass through a series of convolution layers
with filters (Kernels),Pooling layer, fully connected layers
(FC) and then apply an activation function say SoftMax activation function
to classify an image with probabilistic values between 0 and 1. The class which
is having the maximum probability value, the image is classified as that class.
Let’s see the complete flow of CNN to
process an input image and classifies the objects based on values with the help
of an image.
Figure 2:
Neural network with many convolutional layers
Lets now try to understand, each layer
in details and learn how CNNs works.
Convolutional
layer
The Convolutional
layer is the core building block of a any Convolutional Neural Network
that does most of the computational heavy lifting. The first layer in a CNN is
always a Convolutional
Layer. It preserves the relationship between
pixels by learning image features using small squares of input data. It is a
mathematical operation a dot product that takes two inputs such as image matrix
and a filter or kernel and compute elements wise multiplication between them.
Figure: Image matrix multiplies with kernel or filter matrix
Let’s consider a 5 x 5 image matrix whose
pixel values are 0 and 1 and a filter matrix of 3 x 3 with some random weights
(here say 0 and 1) as shown in the figure.
Figure: Image matrix
multiplies kernel or filter matrix
As the filter is sliding, or convolving,
around the input image matrix, it is multiplying the values in the filter with
the original pixel values of the image (aka computing element wise multiplications)
and produces a matrix called “Feature Map” or “Activation map” as
shown in the figure below
Figure: 3 x
3 Feature map
We can apply different convolution on
an image with different filters available to perform operations such as edge
detection, blur and sharpen by applying different filters. The below example
shows various convolution image after applying different types of filters
(Kernels).
Figure: Different filters/kernels
What are Strides?
Stride is the number of pixels shifts over the input matrix. When the
stride is equal to 1 then we move the filters to 1 pixel at a time. When the
stride is equal to 2 then we move the filters to 2 pixels at a time and so on.
The below figure shows how the convolution would work with a stride equal to 2.
So we can see that, at first
convolution, the image pixel values and the weights of the filter/kernels is
multiplied elements wise and summed up (say 108 from the image) and filled in
the top right corner, then the filter moved two steps right and again image
pixel values and filter/kernels weights
get multiplied and summed up (say 126 from image) and filled up. Similarly,
this process continues and when its complete sliding all over the image, it
produces feature map.
Next, an important term comes into
picture “padding”.
SO, what is Padding and why it is required?
Sometimes filter or kernels does not perfectly fit to
input image so at time we have to apply some padding to them image to solve is
problem.
Padding is an additional layer that
we can add to the border of an image. For an example see the figure below there
one more layer added to the 4*4 image and now it has converted in to 5*5 image
Figure: Padding
So, now we have more frame that covers
the edge pixels of an image. More information means more accuracy that’s how
neural network works.
But well, apart from that, now we are
getting an end image that is larger than the original image. Still the
shrinking will happen but we can get kind of a good image than going forward
like before without the padding. So that’s how padding works.
We have some options to apply padding:
·
Pad the picture with
zeros (zero-padding) so that it fits
·
Drop the part of the
image where the filter did not fit. This is called valid padding which keeps
only valid part of the image.
As of now, we get our feature map so
our next step is to apply activation function on it (say ReLU activation
function).
ReLU stands for
Rectified Linear Unit for a non-linear operation.
The output ReLU
function is given as
ƒ(x) =
max(0,x).
So, the next
question arise Why ReLU is important?
ReLU’s activation
function is applied to introduce non-linearity in our Convolutional Neural
networks. Since, the real-world data would want our CNNs to learn non-negative
linear values.
Figure: ReLU operation
There are different nonlinear functions
are available such as tanh or sigmoid that can also be used
instead of ReLU. But most of the data scientists, researchers use ReLU
activation function since performance wise ReLU is better than the other
activation functions.
Now next the pooling layer in
Convolution Neural networks.
Pooling Layer
Pooling layers is applied to reduce the number of
parameters when the images are too large. Spatial pooling also called
subsampling or down-sampling which reduces the dimensionality of each map but
retains important information. Spatial pooling can be of different types:
·
Max Pooling
·
Average Pooling
·
Sum Pooling
Max pooling takes the largest
element from the rectified feature map. Taking the largest element could also
take the average pooling. Sum of all elements in the feature map call as sum
pooling.
Figure: Max Pooling
Next come the final layer, fully connected layer which is
applied mainly for classification.
Fully Connected Layer
The layer we call as FC layer, we
flattened our matrix into vector and feed it into a fully connected layer like
a neural network.
Figure: After pooling layer, flattened as FC layer
In the above diagram, the feature map matrix will be
converted as vector (x1, x2, x3, …). With the fully connected layers, we
combined these features together to create a model. Finally, we apply an
activation function such as SoftMax or sigmoid to classify the outputs as cat,
dog, etc.
Enjoyed reading this blog? Then why not share it with others. Help us make this AI community stronger.
To learn more about such concepts related to Artificial Intelligence, visit our insideAIML blog page.
You can also ask direct queries related to Artificial Intelligence, Deep Learning, Data Science and Machine Learning on our live insideAIML discussion forum.