Page 81 - Read Online
P. 81

Page 75                                Bah et al. Intell Robot 2022;2(1):72­88  I http://dx.doi.org/10.20517/ir.2021.16
























                                          Figure 1. Architecture model of the proposed framework.


                  dataset.
               5. Confirm the validity of the model by the number of parameters, run-time, and accuracy recorded on the
                  Cohn–Kanade (CK+) and FERGIT datasets.



               2. METHODS
               In this section, we introduce our improved Convolution Neural Network with Residual Blocks. This model
               is an end-to-end deep learning framework to classify emotions on human face. The model has a total of 7
               blocks (3 Convolutional Blocks, 3 Residual Blocks and one Classification Block) in number. This study looked
               at strategies that can be used indefinitely, such as CNN for quick and responsive systems with short processing
               and reaction times.


               2.1. Proposed model architecture
               In our proposed architecture, the feature extraction part consists of twelve convolutional sub-blocks with a
               Rectified Linear Unit (ReLU) activation function and a kernel initializer set to he_normal in the convolutional
               layers. A Residual Block is added after every four convolutional layers. This block also called skip connection
               or identity mapping consists of two convolutional layers, each one followed by a batch normalization layer,
               and the results from all Residual Blocks are added to the previous convolution and activated. In the basic
               network, each pair of a layer is followed by a batch normalization layer, max-pooling layer, and dropout layer.
               They are then followed by a global average pooling layer and a dense layer as the output. In the final output
               layer, we used the softmax activation function to perform the task of classifying the seven emotions. All details
               expressed above can be observed in our proposed framework architecture below shown in Figure 1.


               In our framework, we located the best positions to use the residual blocks by trial and error means. Thus,
               the number of parameters has been reduced considerably compared to the original Deep ResNet [14] , and the
               network was fast to train, see Table 1.


               2.1.1. Convolution
               CNN because of its structure, is arguably the best suitable architecture to use when dealing with computer
               visiontasks [21] . Thebasicoperationistheconvolutionoperation, itconsistsofmergingtwosetsofinformation.
               The convolutional layer’s job is to multiply the previous layer’s image pixels by a learnable convolutional kernel
               at the corresponding place. And then, calculate the weighted sum of the multiplied results [22] . For the first
               convolution operation, we applied a convolution filter (kernel) of size 5 × 5 with a stride of 1 and padding of
               2, the latter is used to maintain the same shape of the input image. The output shape is obtained using this
   76   77   78   79   80   81   82   83   84   85   86