Page 17 - Read Online

P. 17

Page 316 He et al. Intell. Robot. 2025, 5(2), 313-32 I http://dx.doi.org/10.20517/ir.2025.16

2. RELATED WORK
In recent years, many effective FER methods have been proposed. In this part, we mainly introduce previous
related methods.

2.1. FER based on traditional approaches
Early FER works mainly on hand-crafted features and traditional machine-learning classification methods.
The hand-crafted features can be divided into appearance-based features and geometric features. The com-
[5]
monly used appearance-based features include LBP [17] , Gabor wavelets , and histogram of oriented gradi-
ents (HOG) [18] . Geometric features are obtained by measuring the relative position of significant features,
such as eyes, nose, and mouth [19,20] . Moreover, support vector machine (SVM) [21] is the most common and
effective machine learning classification algorithm. Ghimire et al. proposed a FER method using a combi-
nation of appearance and geometric features with SVM classification [21] . Although the traditional methods
have a good performance on in-the-lab FER datasets, the performance on in-the-wild FER datasets is signifi-
cantly degraded. It can be ascribed that lighting, noise, and other factors can easily affect hand-crafted features.
Moreover, the method provides better results in facial expression datasets.

2.2. FER based on CNN models
Compared with traditional machine learning methods, deep neural networks, especially CNN, can learn di-
rectly from the input reducing the dependence on pre-processing. With the rapid development of deep learn-
ing, manydeepneuralnetworkssuchasAlexNet [22] , VGG [23] , andResNet [24] arewidelyusedinFERtasksand
have shown good performance. Wu et al. proposed FER-CHC with cross-hierarchy contrast to enhance CNN-
basedmodelsbycriticalfeatureexploitation [25] . Tengetal. designedtypicalfacialexpressionnetwork(TFEN)
combining dual 2D/3D CNNs for robust video FER across four benchmarks [26] . Zhao et al. developed a cross-
modality attention CNN (CM-CNN) that fused grayscale, LBP, and depth features via hierarchical attention
mechanisms, effectively addressing illumination/pose variations and enhancing recognition of subtle expres-
sions [27] . Cai et al. introduced probabilistic attribute tree CNN (PAT-CNN) addressing identity-induced
intra-class variations through probabilistic attribute modeling [28] . Liu et al. combined CNN-extracted facial
features with GCN-modeled high-aggregation subgraphs (HASs) to boost recognition robustness [29] .

2.3. FER based on attention mechanism
AttentionmechanismhasbeenwidelyappliedinFERtasksoutoftheireffectivenessinfocusingthenetworkon
useful regions relevant to expression recognition. Zhang et al. proposed a cross-fusion dual-attention network
with three innovations: grouped dual-attention for multi-scale refinement, adaptive C2 activation mitigating
computational bottlenecks, and distillation-residual closed-loop framework enhancing feature purity [30] . Li et
al. developed SPWFA-SE combining Slide-Patch/Whole-Face attention with SE blocks to jointly capture local
details and global contexts, improving FER accuracy [31] . Zhang et al. designed lightweight GSDNet using
gradual self-distillation for inter-layer knowledge transfer and ACAM with learnable coefficients for adaptive
enhancement [32] . Tao et al. introduced a hierarchical attention network integrating local-global gradient fea-
tures via multi-context aggregation, employing attention gates to amplify discriminative regions [14] . Chen
et al. introduced a hierarchical attention network integrating local-global gradient features via multi-context
aggregation, employing attention gates to amplify discriminative regions [33] .

2.4. FER based on visual transformer
Transformers [34] havebeenwidelyusedinnaturallanguageprocessing(NLP)tasksandhaveshownsignificant
performance. They are good at capturing the long-distance relation between words by their self-attention
mechanism. Inspired by the success of transformers, Dosovitsliy et al. proposed Vit [35] , a pure transformer,
appliedtoimagepatchesonclassificationtasksandhasshownsignificantperformanceinthefieldofcomputing
vision, such as object detection [36] , object tracking [37] , and instance segmentation [38] . Visual transformers
are also applied to FER by some researchers. Ma et al. introduced a transformer-augmented network (TAN)

12 13 14 15 16 17 18 19 20 21 22