Page 18 - Read Online
P. 18
He et al. Intell. Robot. 2025, 5(2), 313-32 I http://dx.doi.org/10.20517/ir.2025.16 Page 317
Figure 2. The overview of our proposed method for FER. The proposed method consists of three components, including the LFEM, the GFEM
and GLFM. The images and labels are from RAF-DB. FER: Facial expression recognition; LFEM: local feature extraction module; GFEM: global
feature extraction module; GLFM: global-local feature fusion module.
combining intra-patch transformers (position-disentangled attention) and inter-patch transformers to capture
local features and cross-region dependencies, integrated with online label correction for noise reduction [39] .
Zhangetal. developedtransformer-basedmultimodalemotionalperception(T-MEP)withtripletransformers
foraudio/image/textfeatures, aligningmultimodalsemanticsthroughfusedself/cross-attentioninvisuallatent
space [40] . Liu et al. proposed patch attention convolutional vision transformer (PACVT) that also extracts
local and global features, but uses simple add operation to combine different features [41] . Different from them,
our method can obtain rich emotional information from local and global features without a cropping strategy,
and employs a learnable way to integrate the relation of local and global features. It is effective to recognize
facial expression images.
3. PROPOSED METHOD
3.1. Overview
As shown in Figure 2, our proposed MSAFNet consists of three components, including the LFEM, the GFEM,
and the GLFM. The LFEM introduces a CNN (ResNet) as the backbone to extract local features. Specifically,
an original facial expression image is fed into the LFEM and the GFEM as input. Concurrently, a MSA block
is embedded into the local module to direct the model to focus more on regions that are crucial for expression
recognition. The GFEM also converts the original facial image to different tokens with positional information
by a linear embedding layer. Then the tokens are fed to the transformer encoder, which can extract global
features of the original image. Finally, the GLFM compatibly models the relationship between local and global
features, and the output features are utilized for FER.
3.2. LFEM
Inspired by Res2Net [42] , we design a MSA block. In this module, we manipulate ResNet18 [24] as the backbone
from given facial images for the LFEM, and the average pool and fully connected layer are removed after the
Conv5 block. To optimize the salient information in local regions, this module designs a MSA block which is

