Page 84 - Read Online
P. 84

Bah et al. Intell Robot 2022;2(1):72­88  I http://dx.doi.org/10.20517/ir.2021.16          Page 78


               2.1.4. Batch-normalization
               Batch-Normalization (BN) is a regularization technique [29]  that speeds up and stabilizes the training of Deep
               Neural Networks (DNN). BN avoids the problem of massive gradient updates, which cause divergent loss and
               uncontrollable activation as network depth increases. As a result, it entails using the current batch’s mean and
               variance to normalize activation vectors from hidden layers [30] . In this research, we placed the BN layer after
               the activation in the simple Convolutional Blocks and before the activation in the Residual blocks, see Figure 1.

               2.1.5. Max pooling
               Pooling is performed to reduce the dimensionality of the convolved image [24] . By applying pooling operation,
               we reduce the number of parameters and fight against overfitting. Max pooling concerns taking the maximum
               pixels in the size of the given windows [31] . During this process, the model does not learn. In our work, we
               took a 2 × 2 window size and strides of 2 for the whole max-pooling layers. The output size is also given by
               the Equation (1), where padding is 0. Using these parameters, we divide the height and width of each feature
               map by 2.

               2.1.6. Dropout
               Dropout [32] is by far the most used Deep Neural Network regularization approach. It boosts the accuracy of
               the model and avoids overfitting. The idea of using dropout is to randomly prevent some neurons at one step
               to fire with a frequency of rate    [33] , while the other neurons are scaled up by 1/(1 −   ) so that the sum inside
               the neuron remains unchanged. The same neuron can be actif at the next step and so on so forth.    is the
               hyper-parameter of the dropout layer, in our study we found out that the best value of    is 0.3 for the early
               layers of the feature extraction part and 0.4 for the last Convolutional Block.

               2.1.7. Residual block
               The Residual Block also known as identity shortcut connection was used in our study. It has a function of

                                                        (  ) =   (  ) +                                (8)

               Where   (  ) represents the output learning,    is the input and   (  ) is the residual layer [14] . The advantage of
               this network in our study is that it reduced considerably the loss during the training and increased the accuracy
               on the test set. The residual block is used to solve the problem of vanishing gradients. By skipping some
               connections, we will allow the back-propagation towards the entire network and so give better performance.
               In our implementation, we discovered that using the shortcut branch of 1 × 1 convolution is not suitable as it
               does not help to reduce the overfitting, see Figure 1.


               2.1.8. Global average pooling
               Most of the research in CNN use flatten layer [34]  to wrap up into a 1D vector the extracted features from
               previous convolutional layers and forward them to the fully connected layers. Global Average Pooling is a
               pooling technique used to substitute fully connected layers in traditional CNNs [22] . In this study using the
               average pooling layer, the resulting vector, the average of each feature map is fed directly into the softmax layer
               instead of constructing fully connected layers on top of the feature maps.

               2.2. Data description
               In this study, we mainly used the FERGIT dataset which is a combination of the FER-2013 and muxspace
               datasets. The FER2013 database was collected from the internet, and most pictures were captured in the wild
               using search engine research. It appears to be a low human FER system with an accuracy of about 65% [35] .
               The FERGIT dataset comprises 49,300 detected faces in a grayscale of 48-by-48 pixels. The images shown in
               Figure 2 are sample emotions from the FER2013 dataset.

               The FER2013 has many problems itself, thus making it very difficult for deep learning architecture to achieve
   79   80   81   82   83   84   85   86   87   88   89