How a CNN (Convolutional Neural Network) Works
Convolutional Neural Network
A convolutional neural network (abbreviated as CNN or ConvNet from English convolutional neural network) is a type of feed-forward neural network inspired by the organization of the visual cortex.
As we will see later, a CNN consists of multiple stages, and similar to what happens in the visual cortex, each stage specializes in different tasks. Without going into detail, we can say that the human brain simplifies information to allow us to recognize objects, and our neural network will do the same. For additional information, one of the intermediate stages in the brain specializes in extracting shapes or characteristics of the image being viewed. We will find the same thing in CNNs.
A convolutional neural network generally operates like all other feed-forward networks. It consists of an input block, one or more hidden layers (hidden layers) that perform calculations using activation functions (for example, RELU), and an output block that performs the actual classification. The difference from classic feed-forward networks lies in the presence of convolutional layers.
So, what role do these convolutional layers play in the process?
Convolutional layers perform a crucial job as they extract features from images using filters.
Unlike a traditional feed-forward network that processes the “general information of the image”, a CNN classifies the image based on specific characteristics. Depending on the type of filter used, different features can be identified in the reference image, such as the contours of shapes, vertical lines, horizontal lines, diagonals, etc.
Let’s simplify the concept.
Compared to a simple feed-forward network, a CNN can handle more specific information, making it more efficient. To simplify, the operation of a CNN can be represented as:
Input->Conv->ReLU->Pool->Conv->ReLU->Pool->ReLU->Conv->ReLU->Pool->FullyConnected
Furthermore, considering that the ReLU function is an integral part of the Conv layer, we can reduce the CNN to the following schema:
Let’s understand the role of each block.
Typically, each convolutional layer is followed by a Max-Pooling layer, gradually reducing the matrix size while increasing the level of “abstraction”. This transition goes from elementary filters, such as vertical and horizontal lines, to more sophisticated filters capable of recognizing features like headlights, windshields, etc., until the last level where it can distinguish a car from a truck.
The input layer consists of a sequence of neurons capable of receiving the image’s information. This layer will receive the data vector representing the image pixels. For example, for a 32 x 32 color image, the input vector will have a length of 32 x 32 x 3; for every pixel in the 32 x 32 image, there will be three values representing the image’s three colors in RGB format (Red, Green, and Blue).
The Convolutional Layer (Conv) is the main layer of the network. Its goal is to detect patterns, such as curves, angles, circles, or squares depicted in an image with high precision. There can be multiple filters, and the more filters, the more complex the features that can be detected. But how does the convolutional layer work? In practice, a digital filter (a small mask) slides over different positions of the input image; for each position, an output value is generated by performing the dot product between the mask and the covered portion of the input (both treated as vectors).
In the example in the figure, the filter is represented by a 3×3 matrix, so the scanning brush will only consider a 3×3 portion of the input image with which the convolution filter is multiplied.
Following is the result with an example filter:
In conclusion, the image will be scanned piece by piece, resulting in a smaller matrix of values that represents the “characterized image”.
The ReLU Layer (Rectified Linear Units) aims to nullify unuseful values obtained in the previous layers and is placed after the convolutional layers.
The Pooling Layer identifies whether the studied feature is present in the previous layer and roughens the image while retaining the feature used by the convolutional layer. In other words, the pooling layer aggregates information, generating smaller feature maps.
FC Layer (or Fully Connected): This is the layer that effectively classifies the images.
I am passionate about technology and the many nuances of the IT world. Since my early university years, I have participated in significant Internet-related projects. Over the years, I have been involved in the startup, development, and management of several companies. In the early stages of my career, I worked as a consultant in the Italian IT sector, actively participating in national and international projects for companies such as Ericsson, Telecom, Tin.it, Accenture, Tiscali, and CNR. Since 2010, I have been involved in startups through one of my companies, Techintouch S.r.l. Thanks to the collaboration with Digital Magics SpA, of which I am a partner in Campania, I support and accelerate local businesses.
Currently, I hold the positions of:
CTO at MareGroup
CTO at Innoida
Co-CEO at Techintouch s.r.l.
Board member at StepFund GP SA
A manager and entrepreneur since 2000, I have been:
CEO and founder of Eclettica S.r.l., a company specializing in software development and System Integration
Partner for Campania at Digital Magics S.p.A.
CTO and co-founder of Nexsoft S.p.A, a company specializing in IT service consulting and System Integration solution development
CTO of ITsys S.r.l., a company specializing in IT system management, where I actively participated in the startup phase.
I have always been a dreamer, curious about new things, and in search of “new worlds to explore.”
Comments