Page 81 - Read Online

P. 81

Page 75 Bah et al. Intell Robot 2022;2(1):7288 I http://dx.doi.org/10.20517/ir.2021.16

Figure 1. Architecture model of the proposed framework.

dataset.
5. Confirm the validity of the model by the number of parameters, run-time, and accuracy recorded on the
Cohn–Kanade (CK+) and FERGIT datasets.

2. METHODS
In this section, we introduce our improved Convolution Neural Network with Residual Blocks. This model
is an end-to-end deep learning framework to classify emotions on human face. The model has a total of 7
blocks (3 Convolutional Blocks, 3 Residual Blocks and one Classification Block) in number. This study looked
at strategies that can be used indefinitely, such as CNN for quick and responsive systems with short processing
and reaction times.

2.1. Proposed model architecture
In our proposed architecture, the feature extraction part consists of twelve convolutional sub-blocks with a
Rectified Linear Unit (ReLU) activation function and a kernel initializer set to he_normal in the convolutional
layers. A Residual Block is added after every four convolutional layers. This block also called skip connection
or identity mapping consists of two convolutional layers, each one followed by a batch normalization layer,
and the results from all Residual Blocks are added to the previous convolution and activated. In the basic
network, each pair of a layer is followed by a batch normalization layer, max-pooling layer, and dropout layer.
They are then followed by a global average pooling layer and a dense layer as the output. In the final output
layer, we used the softmax activation function to perform the task of classifying the seven emotions. All details
expressed above can be observed in our proposed framework architecture below shown in Figure 1.

In our framework, we located the best positions to use the residual blocks by trial and error means. Thus,
the number of parameters has been reduced considerably compared to the original Deep ResNet [14] , and the
network was fast to train, see Table 1.

2.1.1. Convolution
CNN because of its structure, is arguably the best suitable architecture to use when dealing with computer
visiontasks [21] . Thebasicoperationistheconvolutionoperation, itconsistsofmergingtwosetsofinformation.
The convolutional layer’s job is to multiply the previous layer’s image pixels by a learnable convolutional kernel
at the corresponding place. And then, calculate the weighted sum of the multiplied results [22] . For the first
convolution operation, we applied a convolution filter (kernel) of size 5 × 5 with a stride of 1 and padding of
2, the latter is used to maintain the same shape of the input image. The output shape is obtained using this

76 77 78 79 80 81 82 83 84 85 86