In
our previous article’s or tutorial, we saw how neural networks and what is
Artificial neural network (ANN) and does it actually work. In this tutorial we
will try to learn about one of another type of neural networks architecture
known as Convolutional neural networks or commonly called as CNN.
We can think Convolutional neural networks like a combination
of biology and math with a little CS sprinkled in, but these networks have been
some of the most influential innovations in the field of computer vision.
In 2012, was the first year that neural nets grew to
prominence as Alex Krizhevsky used them to win that year’s ImageNet competition
(basically, the annual Olympics of computer vision), dropping the
classification error record from 26% to 15%, an astounding improvement at the
time. Ever since then, a host of companies have been using deep learning at the
core of their services. Facebook uses neural nets for their automatic tagging
algorithms, Google for their photo search, Amazon for their product
recommendations, Pinterest for their home feed personalization, and Instagram
for their search infrastructure.
So, you might be thinking that what are the applications of
Convolutional neural networks?
Convolutional neural network (ConvNets
or CNNs) have many applications. Some of the most used applications of CNN
are to do images recognition, image classifications, object detections, Recognition
faces, and so on.
As you got a brief introduction about
how CNN came and its different applications where it is mostly used. So, now
let’s get deep dive and try to learn it in detail.
What is Convolutional Neural Network or
CNN?
Convolutional
Neural Networks are very much similar to ordinary Neural Networks. They are
made up of neurons that have learnable weights and biases. Each neuron receives
some inputs, performs a dot product, and optionally follows it with a
non-linearity. The whole network still expresses a single differentiable score
function: from the raw image pixels on one end to class scores at the other.
And they still have a loss function (e.g. SVM/Softmax) on the last
(fully-connected) layer and all the tips/tricks we developed for learning
regular Neural Networks still apply.
So,
what’s the main difference between them?
Convolutional
Neural Networks architectures make the explicit assumption that the inputs are
images, which allows us to encode certain properties into the architecture.
These then make the forward function more efficient to implement and vastly
reduce the number of parameters in the network.
Let’s
visualize it.
Simple Neural Network
Convolutional
Neural Network
So, we can see from the image that a simple 3-layer Neural
Network and how Convolutional Neural networks arrange its neurons in three
dimensions (width, height, depth), as visualized in one of the layers. Every
layer of a CNN transforms the 3D input volume to a 3D output volume of neuron
activations. In this example, the red input layer holds the image, so its width
and height would be the dimensions of the image, and the depth would be 3 which
represents the RGB channels of the image. (Red, Green, Blue channels).
Let’s
take an example of an image classification problem and try to understand how
Convolutional Neural Networks work.
Imagine,
you have a dog and a cat at your home and you have many different photos of
them. Let’s imagine they are quite small and look somewhat similar to each
other. While looking at the photos of them, you are able to differentiate
both of them quite easily. But how a computer can do this. How it can differentiate
the image and tells you that it’s a cat image or a dog image. Here comes the
deep learning Convolutional neural network (CNNs).
CNN image classifications take input as an image,
process it, and classify it under certain categories (E.g., Dog, Cat). Computers
see an input image as an array of pixels and it depends on the image
resolution. Based on the image resolution, it will see h x w x d (h = Height, w
= Width, d = Dimension).
For example, say, when a
computer sees an image (takes an image as input), it will see an array of pixel
values. Depending on the resolution and size of the image, say it will see a 28
x 28 x 3 array of numbers (The 3 refers to RGB values). Let's say we have a
color image in JPG format and its size is 300 x 300. The representative array
will be 300 x 300 x 3. Each of these numbers is given a value from 0 to 255
which represents the pixel intensity values at that point. These numbers, while
meaningless to us when we perform image classification, are the only inputs
available to the computer. The idea is that you give the computer this
array of numbers and it will output numbers that describe the probability of
the image being a certain class (.80 for a cat, .20 for a dog, etc.).
Convolutional
Neural Network (CNNs) basically contains three types of layers:
1.
Convolutional layer
2.
Pooling layer
3.
Fully connected layer
Basically, to train deep learning CNN models, each
input image will pass through a series of convolution layers with filters
(Kernels),Pooling layer, fully connected layers (FC) and then
apply an activation function say SoftMax activation function to classify an
image with probabilistic values between 0 and 1. The class which is having the
maximum probability value, the image is classified as that class.
Let’s see the complete flow of CNN to process an input
image and classifies the objects based on values with the help of an image.
Lets now try to understand, each layer in details and
learn how CNN's works.
Convolutional Layer
The
Convolutional
layer is
the core building block of any Convolutional Neural Network that does most of
the computational heavy lifting. The first layer in CNN is always a Convolutional Layer.
It preserves the relationship
between pixels by learning image features using small squares of input data. It
is a mathematical operation a dot product that takes two inputs such as image
matrix and a filter or kernel and compute elements wise multiplication between
them.
Let’s consider a 5 x 5 image matrix whose pixel values
are 0 and 1 and a filter matrix of 3 x 3 with some random weights (here say 0
and 1) as shown in the figure.
As
the filter is sliding, or convolving, around the input image matrix, it is
multiplying the values in the filter with the original pixel values of the
image (aka computing element-wise multiplications) and produces a
matrix called “Feature Map” or “Activation map” as shown in the
figure below
We can apply different convolution on an image with
different filters available to perform operations such as edge detection, blur , and sharpen by applying different filters. The below example shows various
convolution image after applying different types of filters (Kernels).
What are Strides?
Stride is the number of pixels shifts over the input matrix. When the
stride is equal to 1 then we move the filters to 1 pixel at a time. When the
stride is equal to 2 then we move the filters to 2 pixels at a time and so on.
The below figure shows how the convolution would work with a stride equal to 2.
So we can see that, at first convolution, the image
pixel values and the weights of the filter/kernels is multiplied elements wise
and summed up (say 108 from the image) and filled in the top right corner, then
the filter moved two steps right and again image pixel values and filter/kernels weights get multiplied and
summed up (say 126 from image) and filled up. Similarly, this process continues
and when it's complete sliding all over the image, it produces a feature map.
Next, an important term comes into picture “padding”.
SO, what is Padding and why it is required?
Sometimes filter or
kernels do not perfectly fit input image so at the time we have to apply some
padding to the image to solve is a problem.
Padding is an additional layer that we can add to the border of an image. For an
example see the figure below there one more layer added to the 4*4 image and
now it has converted into 5*5 image
So, now we have more frame that covers the edge pixels
of an image. More information means more accuracy that’s how a neural network
works.
But well, apart from that, now we are getting an end
image that is larger than the original image. Still, the shrinking will happen
but we can get kind of a good image than going forward like before without the
padding. So that’s how padding works.
We have some options to apply
padding:
Pad the picture
with zeros (zero-padding) so that it fits.
·Drop the part of
the image where the filter did not fit. This is called valid padding which
keeps only valid part of the image.
As of now, we get our feature map so our next step is
to apply activation function on it (say ReLU activation function).
ReLU stands for Rectified Linear Unit for a
non-linear operation.
The output ReLU function is given as
ƒ(x)
= max(0,x).
So, the next question arises Why ReLU is
important?
ReLU’s activation function is applied to
introduce non-linearity in our Convolutional Neural networks. Since the real-world
data would want our CNNs to learn non-negative linear values.
There are different nonlinear functions are available
such as tanh or sigmoid that can also be used instead of ReLU.
But most of the data scientists, researchers use ReLU activation function since
performance wise ReLU is better than the other activation functions.
Now next the pooling layer in Convolution Neural
networks.
Pooling Layer
Pooling layers is
applied to reduce the number of parameters when the images are too large.
Spatial pooling also called subsampling or down-sampling which reduces the
dimensionality of each map but retains important information. Spatial pooling
can be of different types:
Max Pooling
Average Pooling
Sum Pooling
Max pooling takes the largest element from the rectified feature map.
Taking the largest element could also take the average pooling. Sum of all
elements in the feature map call as sum pooling.
Next
come the final layer, a fully connected layer which is applied mainly for classification.
Fully Connected Layer
The layer we call as FC layer, we flattened our matrix
into a vector and feed it into a fully connected layer like a neural network.
In the above diagram, the feature map
matrix will be converted as a vector (x1, x2, x3, …). With the fully connected
layers, we combined these features together to create a model. Finally, we apply
an activation function such as SoftMax or sigmoid to classify the outputs as a cat, dog, etc.
Complete Convolutional Neural Network
Summary
1.
Provide
input image into convolution layer
2.
Choose
parameters, apply filters with strides, padding if requires. Perform convolution
on the image and apply ReLU activation to the matrix.
3.
Perform
pooling to reduce dimensionality size
4.
Add as
many convolutional layers as per requirement
5.
Flatten
the output and feed into a fully connected layer (FC Layer)
6.
Output the
class using an activation function (Logistic Regression with cost functions)
and classifies images.
There are many architectures available such as AlexNet,
VGGNet, GoogLeNet, and ResNet based on Convolutional Neural networks. Later, I
will try to explain to you each architecture in detail.
I hope after reading this article, finally, you came to know about
what is Convolutional Neural Networks is and different terminologies used in
CNN's and how it actually works?
In the next articles, I will come with a detailed explanation
of some other topics.For more blogs/courses on data science, machine
learning, artificial intelligence, and new technologies do visit us atInsideAIML.