Page 80 - Read Online
P. 80

Bah et al. Intell Robot 2022;2(1):72­88  I http://dx.doi.org/10.20517/ir.2021.16         Page 74


               emotions. FER has used a variety of methodologies to extract the visual highlights of picture layouts such as
               weighted random forest (WRF) [15] . Hasani and Mahoor [16]  utilized a novel network called ResNet-LSTM
               to capture Spatio-temporal data, which combine lower highlights to LSTMs specifically. The deep learning
               networkhasendedupasthemostwidelyutilizedstrategyinFERduetoitspowerfulfeatureextractioncapacity.


               Using histogram of oriented gradients (HOG) in the wavelet domain, Nigam et al. [11]  proposed a four steps
               process for efficient FER (face processing, domain transformation, feature extraction and expression recog-
               nition). In the expression recognition part, the authors used a tree-based multi-class SVM to classify the
               retrieved HOG features in discrete wavelet transform (DWT). The system was trained and tested with CK+,
               JAFFE and Yale datasets. The accuracy observed in the test set of these three (3) datasets are 90%, 71.43% and
               75% respectively.


               Upon deeply analyzing the Facial Expression Recognition problem, Minaee et al. proposed the use of Atten-
               tional Convolutional Neural Network [17]  instead of adding layers/neurons. Aside from that, they also sug-
               gested adding a visualization technique that can find important parts of the face that is necessary for detecting
               different emotions based on the classifier’s output. Their architecture includes a feature extraction part and
               spatial transformer network that takes the input and uses the affine transformation to wrap it to the output.
               They achieved a validation accuracy of 70.02 per cent for the categorization of the 7 classes using the FER2013
               dataset.


               WiththehelpoftheResidualMaskingNetwork [18] , theauthorsfocusedondeeparchitecturewiththeattention
               mechanism. They used a segmentation network to refine feature maps, by enabling the network to focus on
               relevant information to make the correct decision. Their work was divided into 2 parts: the residual masking
               block which contains a residual layer, and the ensemble method for the combination with 7 different CNNs.
               In the end, they managed to get an overall accuracy of 74.14% on the test set of FER2013 dataset.


               Pu and Zhu [19]  developed a FER framework based on the combination of a feature extraction network and
               pre-trained model. The feature extraction consists of supervised learning optical flow based on residual block.
               The classifier is the Inception architecture. By experimenting with their method on CK+ and FER2013 datasets
               they achieved the average accuracy of 95.74% and 73.11% respectively. In order to resolve the fact that CNNs
               require a lot of computation resources to train and process emotional recognition, Chowanda [20]  proposed a
               separable CNN. In the experiment, a comparison of four networks has been made. Networks with and without
               separable modules, using flatten and fully connected layers, and using global average pooling. Their proposed
               architecture was faster, with fewer parameters and achieved an accuracy of 99.4% on the CK+ dataset.


               Deep learning methods have recently sparked a lot of interest, and there is a lot of research going on using
               deep learning methods to recognize emotions from facial expressions. However, this study proposes the accu-
               rate identification of facial emotion using a deep residual-based neural network architecture model. ResNet
               was chosen as the study’s foundation because residual-based network models have shown to be effective in a
               variety of image recognition applications and have also overcome the problem of overfitting. In our work, we
               used emotional expressions such as happiness, surprise, anger, sadness, disgust, neutral, and fear to pick up
               emotional changes on individual faces. Furthermore, the main contribution of this work are:
               1. Propose a lighter version of CNN using Residual Blocks with fewer number of trainable parameters com-
                  pared to over 23 millions for the original ResNet network.
               2. Locate the best position to use the Residual Blocks to avoid overfitting, and finally get a satisfying perfor-
                  mance.
               3. Show the important of using Residual Blocks compared to the architecture without them.
               4. Weight the cross-entropy loss function in order to deal with imbalance problem that suffer the FERGIT
   75   76   77   78   79   80   81   82   83   84   85