Page 16 - Read Online
P. 16

He et al. Intell. Robot. 2025, 5(2), 313-32  I http://dx.doi.org/10.20517/ir.2025.16  Page 315
























                                  Figure 1. The samples of 7 basic emotions from RAF-DB, FERPlus and FER2013.


               adaptively focus on the local regions with robustness to different datasets. In addition, Zhao et al. designed
               a geometry-guided framework integrating GCN-transformers, constructing spatial-temporal graphs from fa-
               cial landmarks to model local/non-local dependencies, and employing spatiotemporal attention to prioritize
               critical regions/frames for video emotion recognition [16] .



               To overcome the above shortcomings, in this paper, we propose an end-to-end multi-scale attention (MSA)
               and convolution-transformer fusion network (MSAFNet) for FER tasks, which can learn local and global fea-
               tures and adaptively model the relationship between them. Our proposed network has three components: the
               local feature extraction module (LFEM), the global feature extraction module (GFEM), and the global-local
               featurefusionmodule(GLFM).AMSAblockisembeddedintotheLFEM,whichcanadaptivelycapturetheim-
               portance of relevant regions of FER, effectively overcoming the inherent limitations of traditional single-scale
               feature modeling. The proposed MSA block can capture key facial information from different perspectives
               and improve the performance in occlusion and pose variation conditions. The GFEM can compensate for
               the shortcomings of the LFEM by capturing long-distance relationships from global images. We designed the
               GLFM to model the relationship between local and global features. The synergistic operation of these mod-
               ules significantly enhances micro-expression sensitivity and cross-domain generalization capabilities, with the
               fusion mechanism dynamically recalibrating feature importance to optimize recognition performance under
               real-world complexities.




               In summary, the main contributions of our work are as follows:


               1. A MSAFNet for FER tasks has been proposed, which can capture key information from local and global
                  features and adaptively model the relationship between them.
               2. A LFEM and a GFEM have been proposed. Furthermore, a MSA block is designed to embed in the LFEM,
                  which can combine the attention information of spatial dimension with channel dimension without crop-
                  ping strategies and facial landmark detectors.
               3. A GLFM to model the relationship between local and global features has been designed, which can effec-
                  tively improve recognition performance.
               4. Experimental results on three different FER datasets show that MSAFNet obtains competitive results com-
                  pared with other state-of-the-art methods, proving our model’s validity.
   11   12   13   14   15   16   17   18   19   20   21