Where represents the height, or width, assuming that height = width in this study; the shape one of the
kernel; is the padding (here a zero padding is applied) and represents the stride.
The input grayscale image of size 48 × 48 going through the first convolution layer will get the same output
shape of size 48 × 48 see details Equation (2).
48 − 5 + 2 × 2 47
= + 1 = + 1 = 47 + 1 = 48 (2)
1 1
We started the convolution with 16 filters and increased it by 2 at each block with the final convolutional layer
having a filters size of 512. When convolving the first layer to output the feature map, the 5 × 5 kernel was
chosen to achieve a more detailed extraction of face expression of various scales, and also significantly reduce
thenumberofparameters. Themorewemovedeeper, themoretheconvolutionkernelsgetbigger, thestronger
thenetworklearnsfeature, andthehigheristherecognitionaccuracy. Inthiswork, wehavemoderatelychosen
an appropriate number of filters after several trials that led to the reduction of the number of parameters, thus
reducing the computational time, and overfitting. The reason for not using that many filters are because, in
FER, the main parts where the networks should focus on are the mouth, towards the corners of the lips, the
nose, the eyebrows, the crow’s feet, the eyelids, and the eyes [23] .
2.1.2. Rectified linear unit
The convolution operation given by the following formula:
× + (3)
Where is the input, the weight and the bias. Equation (3) is linear, so it follows the mathematical rules:
( + ) = ( ) + ( ) (4)
( ) = (5)
Therefore, to avoid the entire network from collapsing into a single equivalent convolutional layer, the use of a
nonlinearactivationfunctionisneeded [24] . RectifiedLinearUnit(ReLU) [25] , isoneofthemostusednonlinear
activation functions for convolution layers from studied literature [25] . Its function is :
( ) = (0, ) (6)
Where is the input, and the result will be 0 if < 0 and if > 0. We used this activation function in our
study as we realized that the framework being deep, it reduces considerably the training time.
2.1.3. Initializers
Bias in the neural network is like a constant in a linear function, and research has proved that it plays an impor-
tant role in a Convolutional Neural Network. It helps the model to match the given data better by adjusting the
output [26] . The goal of initializing the weights and bias is to keep layer activation outputs from bursting or dis-
appearing during a deep neural network forward pass [27] , because if it does happen, the gradients will be either
too large or too tiny, causing the network to converge slowly or to not converge at all. He Normal [28] weight
initialization has been used in this study. In this case, the weights are randomly initialized and multiplied by
the following formula:
_ − 1
Where _ − 1 is the size of the layer − 1. This strategy ensures that the weights are neither too large nor
too small. The biases are initialized to zero since it’s the common technique and it proved to be efficient.