Convolution Neural Network (CNN) are particularly useful for spatial data analysis, image recognition, computer vision, natural language processing, signal processing and variety of other different purposes. They are biologically motivated by functioning of neurons in visual cortex to a visual stimuli.
What makes CNN much more powerful compared to the other feedback forward networks for image recognition is the fact that they do not require as much human intervention and parameters as some of the other networks such as MLP do. This is primarily driven by the fact that CNNs have neurons arranged in three dimensions.
CNNs make all of this magic happen by taking a set of input and passing it on to one or more of following main hidden layers in a network to generate an output.
- Convolution Layers
- Pooling Layers
- Fully Connected Layers
Let’s dig deeper into utility of each of the above layers.
Convolution Layers– Before we move this discussion any further, let’s remember that any image or similar object can be represented as a matrix of numbers ranging between 0-255. Size of this matrix will be determined by the size the image in the following fashion-
Height X Width X Channels
Channels =1 for grey-scale images
Channels =3 for colored images
For example, if we feed an image which is 28 by 28 square in pixels and on the grey scale. This image will be a matrix of numbers in the below fashion-
28*28*1. Each of the 784 pixels can any values between 0-255 depending on the intensity of grey-scale.
Now let’s talk about what happens in a convolution layer. The main objective of this layer is to derive features of an image by sliding smaller matrix called kernel or filter over the entire image through convolution.
What is convolution? Convolution is taking a dot product between the filter and the local regions
Kernels can be many types such as edge detection, blob of color, sharpening, blurring etc. You can find some main kernels over here. Please note that we can specify the number of filters during the network training process, however network will learn the filters on its own.
As a result of this convolution layers, the network creates numbers of features maps. The size of feature maps depends on the # of filters (kernels), size of filters, padding (zero padding to preserve size), and strides (steps by which a filter scans the original image). Please note that a non linear activation function such Relu or Tanh is applied at each convolution layer to generate modified feature maps.
Pooling Layer– The arrays generated from the convolution layers are generally very big and hence pooling layer is used predominantly to reduce the feature maps and retain the most important aspect. In other words this facilitate “Downsampling” using algorithms such as max pooling or average pooling etc. Moreover, as the numbers of parameters in the network are truncated, this layer also helps in avoiding over fitting. It is common to have pooling layers in between different convolution layers.
Fully Connected Layer– This enables every neuron in the layers to be interconnected to the neurons from the previous and next layer to take the matrix inputs from the previous layers and flatten it to pass on to the output layer. Which in turn will make prediction such as classification probability.
Here is an excellent write-up which provides further details on all of the above steps.
Since we know enough about how a CNN works, let’s code now-
In this example, we will be working with MNIST dataset and build a CNN to recognize handwritten digits from 0-9. We will be using classification accuracy as a metric to evaluate the model’s performance. Please see link for MNIST CNN working
Please note that CNN need very high amount of computational power and memory and hence it’s recommended that you run this in GPUs or Cloud. CPUs may not be able to fit the model. Furthermore, you may need to reduce batch size to a lower level to ensure algorithm runs successfully.
As you can see, the above model gives 99%+ accuracy in the classification.