Page 15 - Read Online

P. 15

Page 314 He et al. Intell. Robot. 2025, 5(2), 313-32 I http://dx.doi.org/10.20517/ir.2025.16

where systems interact with their environment in a meaningful way. In EAI, recognizing and responding
to human emotions through facial expressions can significantly enhance the adaptability and functionality
of these systems in various applications such as human-computer interaction, smart healthcare, and safety
monitoring in driving scenarios.

The complexity of facial expressions incorporates dynamic changes in facial muscle movements and subtle
variations in facial features, which reflect a person’s emotional state and intentions. As shown in Figure 1,
Ekman et al. categorized facial expressions into seven basic emotions: anger, disgust, fear, happiness, sadness,
surprise, and neutral, whichhavebeen extensively used as a foundationin the development of facial expression
[3]
recognition (FER) technologies . As computer vision technology has advanced, data-driven approaches to
FER have progressively become more sophisticated, adapting to the diverse and spontaneous nature of human
expressions captured in both controlled laboratory environments and in more challenging in-the-wild settings.

Despite the advancements in FER algorithms, the performance of these systems still requires enhancements
to cope with the diversity and complexity of real-world human expressions. This ongoing development is
emblematic of EAI’s core challenge: to create systems that not only perceive but also understand and appro-
priately respond to human cues in a manner that mirrors human cognitive abilities. Datasets for FER, such as
[9]
[5]
[4]
[6]
[7]
[8]
CK+ , JAFEE , Oulu-CASIA , RAF-DB , SFEW , and FERPlus , play a crucial role in training these
intelligent systems, offering a spectrum of expressions from controlled to naturalistic environments. These
datasets help in refining the algorithms to achieve higher accuracy and reliability in expression recognition,
thus enabling EAI systems to engage more naturally and effectively with humans.
Convolution neural network (CNN) has achieved a significant performance in FER in other fields. Mollahos-
seini et al. designed a deep neural network consisting of two convolution layers, two pooling layers, and four
inception layers, with single-component architecture [10] . It achieved satisfactory results on different public
datasets. Shao et al. proposed three different kinds of convolutional neural networks: Light-CNN, dual-
branch CNN, and pre-trained CNN, which achieved robust results for facial expression in the wild [11] . Gurs-
esli et al. designed a lightweight CNN for facial emotion recognition, called custom lightweight CNN-based
model(CLCM),basedonMobileNetV2architecture [12] . Itachievedperformancecomparabletoorbetterthan
MobileNetV2 and ShuffleNetV2.

The advantage of CNN is that it performs local information exchange in a region through convolution op-
eration, which focuses on modeling local relationships. Each convolutional filter is made for a small region.
Although CNN can extract more abstract features at a deep level, it still cannot extract enough global features
for FER. In contrast, the visual transformer can capture long-distance dependencies between pixels through
a self-attention mechanism, which have advantages for global feature extraction and can compensate for the
shortcomings of CNN.

Attention mechanisms have also been extensively used to solve the problems of occlusion and pose variation
in FER. In FER tasks, the useful features for recognition mainly focus on key areas such as the eyes, nose,
and mouth. The attention mechanism improves expression recognition by increasing the weights of these
key features. Sun et al. proposed AR-TE-CATFFNet integrating three core components: attention-rectified
convolution for feature selection, local binary pattern (LBP)/GLCM-based texture enhancement, and cross-
attention transformers to fuse RGB-texture features globally, achieving enhanced accuracy and cross-domain
generalization [13] . Tao et al. introduced a hierarchical attention network with progressive fusion of local-
global contexts and illumination-robust gradients through hierarchical attention modules (HAM), adaptively
amplifying discriminative facial cues while suppressing irrelevant regions for improved robustness [14] . A mul-
tilayer perceptual attention network was presented by Liu et al. and is capable of learning the potential di-
versity and essential details of various expressions [15] . Furthermore, the perceptual attention network can

10 11 12 13 14 15 16 17 18 19 20