Page 18 - Read Online
P. 18

He et al. Intell. Robot. 2025, 5(2), 313-32  I http://dx.doi.org/10.20517/ir.2025.16  Page 317


































               Figure 2. The overview of our proposed method for FER. The proposed method consists of three components, including the LFEM, the GFEM
               and GLFM. The images and labels are from RAF-DB. FER: Facial expression recognition; LFEM: local feature extraction module; GFEM: global
               feature extraction module; GLFM: global-local feature fusion module.


               combining intra-patch transformers (position-disentangled attention) and inter-patch transformers to capture
               local features and cross-region dependencies, integrated with online label correction for noise reduction [39] .
               Zhangetal. developedtransformer-basedmultimodalemotionalperception(T-MEP)withtripletransformers
               foraudio/image/textfeatures, aligningmultimodalsemanticsthroughfusedself/cross-attentioninvisuallatent
               space [40] . Liu et al. proposed patch attention convolutional vision transformer (PACVT) that also extracts
               local and global features, but uses simple add operation to combine different features [41] . Different from them,
               our method can obtain rich emotional information from local and global features without a cropping strategy,
               and employs a learnable way to integrate the relation of local and global features. It is effective to recognize
               facial expression images.




               3. PROPOSED METHOD
               3.1. Overview
               As shown in Figure 2, our proposed MSAFNet consists of three components, including the LFEM, the GFEM,
               and the GLFM. The LFEM introduces a CNN (ResNet) as the backbone to extract local features. Specifically,
               an original facial expression image is fed into the LFEM and the GFEM as input. Concurrently, a MSA block
               is embedded into the local module to direct the model to focus more on regions that are crucial for expression
               recognition. The GFEM also converts the original facial image to different tokens with positional information
               by a linear embedding layer. Then the tokens are fed to the transformer encoder, which can extract global
               features of the original image. Finally, the GLFM compatibly models the relationship between local and global
               features, and the output features are utilized for FER.


               3.2. LFEM
               Inspired by Res2Net [42] , we design a MSA block. In this module, we manipulate ResNet18 [24] as the backbone
               from given facial images for the LFEM, and the average pool and fully connected layer are removed after the
               Conv5 block. To optimize the salient information in local regions, this module designs a MSA block which is
   13   14   15   16   17   18   19   20   21   22   23