Convolutional neural networks

Kajal Pawar

8 months ago

Table of Content
  • What is Convolutional neural network or CNN?
  • Simple Neural Network
  • Convolutional Neural Network -
1.    Convolutional layer
2.    Pooling layer
3.    Fully connected layer
        In our previous article’s or tutorial, we saw how neural networks and what is Artificial neural network (ANN) and does it actually work. In this tutorial we will try to learn about one of another type of neural networks architecture known as Convolutional neural networks or commonly called as CNN.
We can think Convolutional neural networks like a combination of biology and math with a little CS sprinkled in, but these networks have been some of the most influential innovations in the field of computer vision.
In 2012, was the first year that neural nets grew to prominence as Alex Krizhevsky used them to win that year’s ImageNet competition (basically, the annual Olympics of computer vision), dropping the classification error record from 26% to 15%, an astounding improvement at the time. Ever since then, a host of companies have been using deep learning at the core of their services. Facebook uses neural nets for their automatic tagging algorithms, Google for their photo search, Amazon for their product recommendations, Pinterest for their home feed personalization, and Instagram for their search infrastructure.
So, you might be thinking that what are the applications of Convolutional neural networks?
Convolutional neural network (ConvNets or CNNs) have many applications. Some of the most used application of the CNN are to do images recognition, images classifications, Objects detections, Recognition faces and so on.
As you got a brief introduction about how CNN came and its different applications where it is mostly used. So, now let’s get deep dive and try to learn it in detail.

What is Convolutional neural network or CNN?

        Convolutional Neural Networks are very much similar to ordinary Neural Networks. They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply.
So, what’s the main difference between them?
Convolutional Neural Networks architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the number of parameters in the network.
Let’s visualize it.

Simple Neural Network

Figure: Simple neural network with input layer, two hidden layers and an output layer.

Convolutional Neural Network

                               Figure: Convolutional Neural Network
         So, we can see from the image that a simple 3-layer Neural Network and how Convolutional Neural networks arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a CNN transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 which represents the RGB channels of the image. (Red, Green, Blue channels).
Let’s take an example of image classification problem and try to understand how Convolutional Neural Networks works.
Imagine, you have a dog and a cat at you home and you have many different photos of them. Let’s imagine they are quite small and looks somewhat similar to each other’s. While looking at the photos of them, you are able to differentiate both of them quite easily. But how computer can do this. How it can differentiate the image and tells you that it’s a cat image or a dog image. Here comes the deep learning Convolutional neural network (CNNs).
CNN image classifications take an input as an image, process it and classify it under certain categories (E.g., Dog, Cat). Computers sees an input image as an array of pixels and it depends on the image resolution. Based on the image resolution, it will see h x w x d (h = Height, w = Width, d = Dimension).
For example, say, when a computer sees an image (takes an image as input), it will see an array of pixel values. Depending on the resolution and size of the image, say it will see a 28 x 28 x 3 array of numbers (The 3 refers to RGB values). Let's say we have a color image in JPG format and its size is 300 x 300. The representative array will be 300 x 300 x 3. Each of these numbers is given a value from 0 to 255 which represents the pixel intensity values at that point. These numbers, while meaningless to us when we perform image classification, are the only inputs available to the computer.  The idea is that you give the computer this array of numbers and it will output numbers that describe the probability of the image being a certain class (.80 for cat, .20 for dog, etc.).
Convolutional Neural Network (CNNs) basically contains three types of layers:
1.    Convolutional layer
2.    Pooling layer
3.    Fully connected layer
Basically, to train a deep learning CNN models, each input image will pass through a series of convolution layers with filters (Kernels), Pooling layer, fully connected layers (FC) and then apply an activation function say SoftMax activation function to classify an image with probabilistic values between 0 and 1. The class which is having the maximum probability value, the image is classified as that class.
Let’s see the complete flow of CNN to process an input image and classifies the objects based on values with the help of an image.
                             Figure 2: Neural network with many convolutional layers
Lets now try to understand, each layer in details and learn how CNNs works.
Convolutional layer
The Convolutional layer is the core building block of a any Convolutional Neural Network that does most of the computational heavy lifting. The first layer in a CNN is always a Convolutional Layer. It preserves the relationship between pixels by learning image features using small squares of input data. It is a mathematical operation a dot product that takes two inputs such as image matrix and a filter or kernel and compute elements wise multiplication between them.
                      Figure: Image matrix multiplies with kernel or filter matrix
Let’s consider a 5 x 5 image matrix whose pixel values are 0 and 1 and a filter matrix of 3 x 3 with some random weights (here say 0 and 1) as shown in the figure.
                             Figure: Image matrix multiplies kernel or filter matrix
As the filter is sliding, or convolving, around the input image matrix, it is multiplying the values in the filter with the original pixel values of the image (aka computing element wise multiplications) and produces a matrix called “Feature Map” or “Activation map” as shown in the figure below
                                                        Figure: 3 x 3 Feature map
We can apply different convolution on an image with different filters available to perform operations such as edge detection, blur and sharpen by applying different filters. The below example shows various convolution image after applying different types of filters (Kernels).
                                          Figure: Different filters/kernels                                                      
What are Strides?
Stride is the number of pixels shifts over the input matrix. When the stride is equal to 1 then we move the filters to 1 pixel at a time. When the stride is equal to 2 then we move the filters to 2 pixels at a time and so on. The below figure shows how the convolution would work with a stride equal to 2.
                                               Figure: Stride equal to 2
So we can see that, at first convolution, the image pixel values and the weights of the filter/kernels is multiplied elements wise and summed up (say 108 from the image) and filled in the top right corner, then the filter moved two steps right and again image pixel values  and filter/kernels weights get multiplied and summed up (say 126 from image) and filled up. Similarly, this process continues and when its complete sliding all over the image, it produces feature map.
Next, an important term comes into picture “padding”.
SO, what is Padding and why it is required?
Sometimes filter or kernels does not perfectly fit to input image so at time we have to apply some padding to them image to solve is problem.
Padding is an additional layer that we can add to the border of an image. For an example see the figure below there one more layer added to the 4*4 image and now it has converted in to 5*5 image
                                                                  Figure: Padding 
So, now we have more frame that covers the edge pixels of an image. More information means more accuracy that’s how neural network works.
But well, apart from that, now we are getting an end image that is larger than the original image. Still the shrinking will happen but we can get kind of a good image than going forward like before without the padding. So that’s how padding works.
We have some options to apply padding:
·       Pad the picture with zeros (zero-padding) so that it fits
·       Drop the part of the image where the filter did not fit. This is called valid padding which keeps only valid part of the image.
As of now, we get our feature map so our next step is to apply activation function on it (say ReLU activation function).
ReLU stands for Rectified Linear Unit for a non-linear operation.
The output ReLU function is given as
                                                                ƒ(x) = max(0,x).
So, the next question arise Why ReLU is important?
ReLU’s activation function is applied to introduce non-linearity in our Convolutional Neural networks. Since, the real-world data would want our CNNs to learn non-negative linear values.
                                                              Figure: ReLU operation
There are different nonlinear functions are available such as tanh or sigmoid that can also be used instead of ReLU. But most of the data scientists, researchers use ReLU activation function since performance wise ReLU is better than the other activation functions.
Now next the pooling layer in Convolution Neural networks.
Pooling Layer
Pooling layers is applied to reduce the number of parameters when the images are too large. Spatial pooling also called subsampling or down-sampling which reduces the dimensionality of each map but retains important information. Spatial pooling can be of different types:
·       Max Pooling
·       Average Pooling
·       Sum Pooling
Max pooling takes the largest element from the rectified feature map. Taking the largest element could also take the average pooling. Sum of all elements in the feature map call as sum pooling.
                                                       Figure: Max Pooling
Next come the final layer, fully connected layer which is applied mainly for classification.
Fully Connected Layer
The layer we call as FC layer, we flattened our matrix into vector and feed it into a fully connected layer like a neural network.
                                    Figure: After pooling layer, flattened as FC layer
In the above diagram, the feature map matrix will be converted as vector (x1, x2, x3, …). With the fully connected layers, we combined these features together to create a model. Finally, we apply an activation function such as SoftMax or sigmoid to classify the outputs as cat, dog, etc.
Enjoyed reading this blog? Then why not share it with others. Help us make this AI community stronger. 
To learn more about such concepts related to Artificial Intelligence, visit our insideAIML blog page.
You can also ask direct queries related to Artificial Intelligence, Deep Learning, Data Science and Machine Learning on our live insideAIML discussion forum.
Keep Learning. Keep Growing.

Submit Review