Page 15 - Read Online
P. 15

Page 314                        He et al. Intell. Robot. 2025, 5(2), 313-32  I http://dx.doi.org/10.20517/ir.2025.16

               where systems interact with their environment in a meaningful way. In EAI, recognizing and responding
               to human emotions through facial expressions can significantly enhance the adaptability and functionality
               of these systems in various applications such as human-computer interaction, smart healthcare, and safety
               monitoring in driving scenarios.


               The complexity of facial expressions incorporates dynamic changes in facial muscle movements and subtle
               variations in facial features, which reflect a person’s emotional state and intentions. As shown in Figure 1,
               Ekman et al. categorized facial expressions into seven basic emotions: anger, disgust, fear, happiness, sadness,
               surprise, and neutral, whichhavebeen extensively used as a foundationin the development of facial expression
                                          [3]
               recognition (FER) technologies . As computer vision technology has advanced, data-driven approaches to
               FER have progressively become more sophisticated, adapting to the diverse and spontaneous nature of human
               expressions captured in both controlled laboratory environments and in more challenging in-the-wild settings.

               Despite the advancements in FER algorithms, the performance of these systems still requires enhancements
               to cope with the diversity and complexity of real-world human expressions. This ongoing development is
               emblematic of EAI’s core challenge: to create systems that not only perceive but also understand and appro-
               priately respond to human cues in a manner that mirrors human cognitive abilities. Datasets for FER, such as
                                                                          [9]
                            [5]
                   [4]
                                          [6]
                                                    [7]
                                                            [8]
               CK+ , JAFEE , Oulu-CASIA , RAF-DB , SFEW , and FERPlus , play a crucial role in training these
               intelligent systems, offering a spectrum of expressions from controlled to naturalistic environments. These
               datasets help in refining the algorithms to achieve higher accuracy and reliability in expression recognition,
               thus enabling EAI systems to engage more naturally and effectively with humans.
               Convolution neural network (CNN) has achieved a significant performance in FER in other fields. Mollahos-
               seini et al. designed a deep neural network consisting of two convolution layers, two pooling layers, and four
               inception layers, with single-component architecture [10] . It achieved satisfactory results on different public
               datasets. Shao et al. proposed three different kinds of convolutional neural networks: Light-CNN, dual-
               branch CNN, and pre-trained CNN, which achieved robust results for facial expression in the wild [11] . Gurs-
               esli et al. designed a lightweight CNN for facial emotion recognition, called custom lightweight CNN-based
               model(CLCM),basedonMobileNetV2architecture [12] . Itachievedperformancecomparabletoorbetterthan
               MobileNetV2 and ShuffleNetV2.


               The advantage of CNN is that it performs local information exchange in a region through convolution op-
               eration, which focuses on modeling local relationships. Each convolutional filter is made for a small region.
               Although CNN can extract more abstract features at a deep level, it still cannot extract enough global features
               for FER. In contrast, the visual transformer can capture long-distance dependencies between pixels through
               a self-attention mechanism, which have advantages for global feature extraction and can compensate for the
               shortcomings of CNN.


               Attention mechanisms have also been extensively used to solve the problems of occlusion and pose variation
               in FER. In FER tasks, the useful features for recognition mainly focus on key areas such as the eyes, nose,
               and mouth. The attention mechanism improves expression recognition by increasing the weights of these
               key features. Sun et al. proposed AR-TE-CATFFNet integrating three core components: attention-rectified
               convolution for feature selection, local binary pattern (LBP)/GLCM-based texture enhancement, and cross-
               attention transformers to fuse RGB-texture features globally, achieving enhanced accuracy and cross-domain
               generalization [13] . Tao et al. introduced a hierarchical attention network with progressive fusion of local-
               global contexts and illumination-robust gradients through hierarchical attention modules (HAM), adaptively
               amplifying discriminative facial cues while suppressing irrelevant regions for improved robustness [14] . A mul-
               tilayer perceptual attention network was presented by Liu et al. and is capable of learning the potential di-
               versity and essential details of various expressions [15] . Furthermore, the perceptual attention network can
   10   11   12   13   14   15   16   17   18   19   20