All Courses

Convolutional Neural Networks Explained

Neha Kumawat

2 years ago

Convolutional Neural Networks | insideAIML
Table of Contents
  • Introduction
  • What is Convolutional Neural Network or CNN?
  • Simple Neural Network
  • Convolutional Neural Network
  • Convolutional Layer
  • Pooling Layer
  • Fully Connected Layer
  • Complete Convolutional Neural Network
  • Summary


          In our previous article’s or tutorial, we saw how neural networks and what is Artificial neural network (ANN) and does it actually work. In this tutorial we will try to learn about one of another type of neural networks architecture known as Convolutional neural networks or commonly called as CNN.
We can think Convolutional neural networks like a combination of biology and math with a little CS sprinkled in, but these networks have been some of the most influential innovations in the field of computer vision.
In 2012, was the first year that neural nets grew to prominence as Alex Krizhevsky used them to win that year’s ImageNet competition (basically, the annual Olympics of computer vision), dropping the classification error record from 26% to 15%, an astounding improvement at the time. Ever since then, a host of companies have been using deep learning at the core of their services. Facebook uses neural nets for their automatic tagging algorithms, Google for their photo search, Amazon for their product recommendations, Pinterest for their home feed personalization, and Instagram for their search infrastructure.
So, you might be thinking that what are the applications of Convolutional neural networks?
Convolutional neural network (ConvNets or CNNs) have many applications. Some of the most used applications of CNN are to do images recognition, image classifications, object detections, Recognition faces, and so on.
As you got a brief introduction about how CNN came and its different applications where it is mostly used. So, now let’s get deep dive and try to learn it in detail.

What is Convolutional Neural Network or CNN?

          Convolutional Neural Networks are very much similar to ordinary Neural Networks. They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product, and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply.
So, what’s the main difference between them?
Convolutional Neural Networks architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the number of parameters in the network.
Let’s visualize it.

Simple Neural Network

Simple neural network with input layer, two hidden layers and an output layer | insideAIML

Convolutional Neural Network

Convolutional Neural Network | insideAIML
          So, we can see from the image that a simple 3-layer Neural Network and how Convolutional Neural networks arrange its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a CNN transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 which represents the RGB channels of the image. (Red, Green, Blue channels).
Let’s take an example of an image classification problem and try to understand how Convolutional Neural Networks work.
Imagine, you have a dog and a cat at your home and you have many different photos of them. Let’s imagine they are quite small and look somewhat similar to each other. While looking at the photos of them, you are able to differentiate both of them quite easily. But how a computer can do this. How it can differentiate the image and tells you that it’s a cat image or a dog image. Here comes the deep learning Convolutional neural network (CNNs).
CNN image classifications take input as an image, process it, and classify it under certain categories (E.g., Dog, Cat). Computers see an input image as an array of pixels and it depends on the image resolution. Based on the image resolution, it will see h x w x d (h = Height, w = Width, d = Dimension).
For example, say, when a computer sees an image (takes an image as input), it will see an array of pixel values. Depending on the resolution and size of the image, say it will see a 28 x 28 x 3 array of numbers (The 3 refers to RGB values). Let's say we have a color image in JPG format and its size is 300 x 300. The representative array will be 300 x 300 x 3. Each of these numbers is given a value from 0 to 255 which represents the pixel intensity values at that point. These numbers, while meaningless to us when we perform image classification, are the only inputs available to the computer.  The idea is that you give the computer this array of numbers and it will output numbers that describe the probability of the image being a certain class (.80 for a cat, .20 for a dog, etc.).
Convolutional Neural Network (CNNs) basically contains three types of layers:
1.     Convolutional layer
2.     Pooling layer
3.     Fully connected layer
Basically, to train deep learning CNN models, each input image will pass through a series of convolution layers with filters (Kernels), Pooling layer, fully connected layers (FC) and then apply an activation function say SoftMax activation function to classify an image with probabilistic values between 0 and 1. The class which is having the maximum probability value, the image is classified as that class.
Let’s see the complete flow of CNN to process an input image and classifies the objects based on values with the help of an image.
Neural network with many convolutional layers | insideAIML
Lets now try to understand, each layer in details and learn how CNN's works.

Convolutional Layer

          The Convolutional layer is the core building block of any Convolutional Neural Network that does most of the computational heavy lifting. The first layer in CNN is always a Convolutional Layer. It preserves the relationship between pixels by learning image features using small squares of input data. It is a mathematical operation a dot product that takes two inputs such as image matrix and a filter or kernel and compute elements wise multiplication between them.
Image matrix multiplies with kernel or filter matrix | insideAIML
Let’s consider a 5 x 5 image matrix whose pixel values are 0 and 1 and a filter matrix of 3 x 3 with some random weights (here say 0 and 1) as shown in the figure.
Image matrix multiplies kernel or filter matrix | insideAIML
As the filter is sliding, or convolving, around the input image matrix, it is multiplying the values in the filter with the original pixel values of the image (aka computing element-wise multiplications) and produces a matrix called “Feature Map” or “Activation map” as shown in the figure below
3 x 3 Feature map | insideAIML
We can apply different convolution on an image with different filters available to perform operations such as edge detection, blur , and sharpen by applying different filters. The below example shows various convolution image after applying different types of filters (Kernels).
Different filters/kernels | insideAIML
What are Strides?
          Stride is the number of pixels shifts over the input matrix. When the stride is equal to 1 then we move the filters to 1 pixel at a time. When the stride is equal to 2 then we move the filters to 2 pixels at a time and so on. The below figure shows how the convolution would work with a stride equal to 2.
Stride equal to 2 | insideAIML
So we can see that, at first convolution, the image pixel values and the weights of the filter/kernels is multiplied elements wise and summed up (say 108 from the image) and filled in the top right corner, then the filter moved two steps right and again image pixel values  and filter/kernels weights get multiplied and summed up (say 126 from image) and filled up. Similarly, this process continues and when it's complete sliding all over the image, it produces a feature map.
Next, an important term comes into picture “padding”.
SO, what is Padding and why it is required?
Sometimes filter or kernels do not perfectly fit input image so at the time we have to apply some padding to the image to solve is a problem.
Padding is an additional layer that we can add to the border of an image. For an example see the figure below there one more layer added to the 4*4 image and now it has converted into 5*5 image
Padding | insideAIML
So, now we have more frame that covers the edge pixels of an image. More information means more accuracy that’s how a neural network works.
But well, apart from that, now we are getting an end image that is larger than the original image. Still, the shrinking will happen but we can get kind of a good image than going forward like before without the padding. So that’s how padding works.
We have some options to apply padding:
  • Pad the picture with zeros (zero-padding) so that it fits.
  • ·Drop the part of the image where the filter did not fit. This is called valid padding which keeps only valid part of the image.
As of now, we get our feature map so our next step is to apply activation function on it (say ReLU activation function).
ReLU stands for Rectified Linear Unit for a non-linear operation.
The output ReLU function is given as
                                                                ƒ(x) = max(0,x).
So, the next question arises Why ReLU is important?
ReLU’s activation function is applied to introduce non-linearity in our Convolutional Neural networks. Since the real-world data would want our CNNs to learn non-negative linear values.
ReLU operation | insideAIML
There are different nonlinear functions are available such as tanh or sigmoid that can also be used instead of ReLU. But most of the data scientists, researchers use ReLU activation function since performance wise ReLU is better than the other activation functions.
Now next the pooling layer in Convolution Neural networks.

Pooling Layer

          Pooling layers is applied to reduce the number of parameters when the images are too large. Spatial pooling also called subsampling or down-sampling which reduces the dimensionality of each map but retains important information. Spatial pooling can be of different types:
  • Max Pooling
  • Average Pooling
  • Sum Pooling
Max pooling takes the largest element from the rectified feature map. Taking the largest element could also take the average pooling. Sum of all elements in the feature map call as sum pooling.
Max Pooling | insideAIML
Next come the final layer, a fully connected layer which is applied mainly for classification.

Fully Connected Layer

          The layer we call as FC layer, we flattened our matrix into a vector and feed it into a fully connected layer like a neural network.
After pooling layer, flattened as FC layer | insideAIML
In the above diagram, the feature map matrix will be converted as a vector (x1, x2, x3, …). With the fully connected layers, we combined these features together to create a model. Finally, we apply an activation function such as SoftMax or sigmoid to classify the outputs as a cat, dog, etc.

Complete Convolutional Neural Network

Complete Convolutional Neural Network | insideAIML


1.     Provide input image into convolution layer
2.     Choose parameters, apply filters with strides, padding if requires. Perform convolution on the image and apply ReLU activation to the matrix.
3.     Perform pooling to reduce dimensionality size
4.     Add as many convolutional layers as per requirement
5.     Flatten the output and feed into a fully connected layer (FC Layer)
6.     Output the class using an activation function (Logistic Regression with cost functions) and classifies images.
There are many architectures available such as AlexNet, VGGNet, GoogLeNet, and ResNet based on Convolutional Neural networks. Later, I will try to explain to you each architecture in detail.
I hope after reading this article, finally, you came to know about what is Convolutional Neural Networks is and different terminologies used in CNN's and how it actually works?
In the next articles, I will come with a detailed explanation of some other topics. For more blogs/courses on data science, machine learning, artificial intelligence, and new technologies do visit us at InsideAIML.
Thanks for reading…

Submit Review