Page 83 - Read Online
P. 83

Page 77                                Bah et al. Intell Robot 2022;2(1):72­88  I http://dx.doi.org/10.20517/ir.2021.16


               Where      represents the height, or width, assuming that height = width in this study;      the shape one of the
               kernel;    is the padding (here a zero padding is applied) and    represents the stride.
               The input grayscale image of size 48 × 48 going through the first convolution layer will get the same output
               shape of size 48 × 48 see details Equation (2).



                                              48 − 5 + 2 × 2   47
                                              =           + 1 =   + 1 = 47 + 1 = 48                    (2)
                                                   1            1

               We started the convolution with 16 filters and increased it by 2 at each block with the final convolutional layer
               having a filters size of 512. When convolving the first layer to output the feature map, the 5 × 5 kernel was
               chosen to achieve a more detailed extraction of face expression of various scales, and also significantly reduce
               thenumberofparameters. Themorewemovedeeper, themoretheconvolutionkernelsgetbigger, thestronger
               thenetworklearnsfeature, andthehigheristherecognitionaccuracy. Inthiswork, wehavemoderatelychosen
               an appropriate number of filters after several trials that led to the reduction of the number of parameters, thus
               reducing the computational time, and overfitting. The reason for not using that many filters are because, in
               FER, the main parts where the networks should focus on are the mouth, towards the corners of the lips, the
               nose, the eyebrows, the crow’s feet, the eyelids, and the eyes [23] .

               2.1.2. Rectified linear unit
               The convolution operation given by the following formula:
                                                       ∑
                                                             ×    +                                    (3)

               Where    is the input,    the weight and    the bias. Equation (3) is linear, so it follows the mathematical rules:

                                                      (   +   ) =    (  ) +    (  )                    (4)

                                                           (    ) =                                    (5)
               Therefore, to avoid the entire network from collapsing into a single equivalent convolutional layer, the use of a
               nonlinearactivationfunctionisneeded [24] . RectifiedLinearUnit(ReLU) [25] , isoneofthemostusednonlinear
               activation functions for convolution layers from studied literature [25] . Its function is :

                                                         (  ) =       (0,   )                          (6)

               Where    is the input, and the result will be 0 if    < 0 and    if    > 0. We used this activation function in our
               study as we realized that the framework being deep, it reduces considerably the training time.


               2.1.3. Initializers
               Bias in the neural network is like a constant in a linear function, and research has proved that it plays an impor-
               tant role in a Convolutional Neural Network. It helps the model to match the given data better by adjusting the
               output [26] . The goal of initializing the weights and bias is to keep layer activation outputs from bursting or dis-
               appearing during a deep neural network forward pass [27] , because if it does happen, the gradients will be either
               too large or too tiny, causing the network to converge slowly or to not converge at all. He Normal [28]  weight
               initialization has been used in this study. In this case, the weights are randomly initialized and multiplied by
               the following formula:
                                                       √
                                                             2
                                                                                                       (7)
                                                                 _   − 1
                Where         _   − 1 is the size of the layer    − 1. This strategy ensures that the weights are neither too large nor
               too small. The biases are initialized to zero since it’s the common technique and it proved to be efficient.
   78   79   80   81   82   83   84   85   86   87   88