Page 17 - Read Online
P. 17

Page 316                        He et al. Intell. Robot. 2025, 5(2), 313-32  I http://dx.doi.org/10.20517/ir.2025.16


               2. RELATED WORK
               In recent years, many effective FER methods have been proposed. In this part, we mainly introduce previous
               related methods.


               2.1. FER based on traditional approaches
               Early FER works mainly on hand-crafted features and traditional machine-learning classification methods.
               The hand-crafted features can be divided into appearance-based features and geometric features. The com-
                                                                            [5]
               monly used appearance-based features include LBP [17] , Gabor wavelets , and histogram of oriented gradi-
               ents (HOG) [18] . Geometric features are obtained by measuring the relative position of significant features,
               such as eyes, nose, and mouth [19,20] . Moreover, support vector machine (SVM) [21]  is the most common and
               effective machine learning classification algorithm. Ghimire et al. proposed a FER method using a combi-
               nation of appearance and geometric features with SVM classification [21] . Although the traditional methods
               have a good performance on in-the-lab FER datasets, the performance on in-the-wild FER datasets is signifi-
               cantly degraded. It can be ascribed that lighting, noise, and other factors can easily affect hand-crafted features.
               Moreover, the method provides better results in facial expression datasets.


               2.2. FER based on CNN models
               Compared with traditional machine learning methods, deep neural networks, especially CNN, can learn di-
               rectly from the input reducing the dependence on pre-processing. With the rapid development of deep learn-
               ing, manydeepneuralnetworkssuchasAlexNet [22] , VGG [23] , andResNet [24]  arewidelyusedinFERtasksand
               have shown good performance. Wu et al. proposed FER-CHC with cross-hierarchy contrast to enhance CNN-
               basedmodelsbycriticalfeatureexploitation [25] . Tengetal. designedtypicalfacialexpressionnetwork(TFEN)
               combining dual 2D/3D CNNs for robust video FER across four benchmarks [26] . Zhao et al. developed a cross-
               modality attention CNN (CM-CNN) that fused grayscale, LBP, and depth features via hierarchical attention
               mechanisms, effectively addressing illumination/pose variations and enhancing recognition of subtle expres-
               sions [27] . Cai et al. introduced probabilistic attribute tree CNN (PAT-CNN) addressing identity-induced
               intra-class variations through probabilistic attribute modeling [28] . Liu et al. combined CNN-extracted facial
               features with GCN-modeled high-aggregation subgraphs (HASs) to boost recognition robustness [29] .


               2.3. FER based on attention mechanism
               AttentionmechanismhasbeenwidelyappliedinFERtasksoutoftheireffectivenessinfocusingthenetworkon
               useful regions relevant to expression recognition. Zhang et al. proposed a cross-fusion dual-attention network
               with three innovations: grouped dual-attention for multi-scale refinement, adaptive C2 activation mitigating
               computational bottlenecks, and distillation-residual closed-loop framework enhancing feature purity [30] . Li et
               al. developed SPWFA-SE combining Slide-Patch/Whole-Face attention with SE blocks to jointly capture local
               details and global contexts, improving FER accuracy [31] . Zhang et al. designed lightweight GSDNet using
               gradual self-distillation for inter-layer knowledge transfer and ACAM with learnable coefficients for adaptive
               enhancement [32] . Tao et al. introduced a hierarchical attention network integrating local-global gradient fea-
               tures via multi-context aggregation, employing attention gates to amplify discriminative regions [14] . Chen
               et al. introduced a hierarchical attention network integrating local-global gradient features via multi-context
               aggregation, employing attention gates to amplify discriminative regions [33] .


               2.4. FER based on visual transformer
               Transformers [34] havebeenwidelyusedinnaturallanguageprocessing(NLP)tasksandhaveshownsignificant
               performance. They are good at capturing the long-distance relation between words by their self-attention
               mechanism. Inspired by the success of transformers, Dosovitsliy et al. proposed Vit [35] , a pure transformer,
               appliedtoimagepatchesonclassificationtasksandhasshownsignificantperformanceinthefieldofcomputing
               vision, such as object detection [36] , object tracking [37] , and instance segmentation [38] . Visual transformers
               are also applied to FER by some researchers. Ma et al. introduced a transformer-augmented network (TAN)
   12   13   14   15   16   17   18   19   20   21   22