Page 29 - Read Online
P. 29
Page 328 He et al. Intell. Robot. 2025, 5(2), 313-32 I http://dx.doi.org/10.20517/ir.2025.16
Table 9. The number of parameters, FLOPs and accuracy on RAF-DB dataset
Method Params FLOPs Accuracy (%)
MA-Net [52] 50.54 M 3.65 G 88.40
AMP-Net [15] 105.67 M 4.73 G 89.25
our MSAFNet 23.42 M 3.60 G 90.06
FLOPs: Floating point operations; MA-Net:
multi-scale and local attention network; AMP-
Net: adaptive multilayer perceptual attention
network; MSAFNet: multi-scale attention and
convolution-transformer fu-sion network.
Figure 8. The CAM of LFEM and GFEM. The images and labels are from FER2013 and RAF-DB. CAM: Class activation mapping; LFEM: local
feature extraction module; GFEM: global feature extraction module.
M and 3.60 G. The parameters and FLOPs of our method are significantly lower than those of MA-Net [52]
and AMP-Net [15] . These results demonstrate that our MSAFNet has lower complexity and achieves better
performance than other methods.
4.6. Visualization
In this section, in order to better validate the performance of MSA, we utilize gradient-weighted class activa-
tion mapping (Grad-CAM) [66] to visualize SE, CBAM, ECA, and our MSA respectively. As shown in Figure 8,
LFEM generates highly localized activations focusing on fine-grained facial components, while GFEM pro-
duces broader activation patterns capturing holistic facial structure. This contrast validates the complemen-
tary roles of LFEM in micro-feature extraction and GFEM in macro-context modeling. As shown in Figure 9,
our MSA enables the network to better focus on the key areas, such as the eyes, nose, and mouth. For facial
occlusion or variant poses, MSA can still focus on eyes, nose, and mouth regions, and other attention methods
only pay attention to eyes, nose, or mouth regions. The results can further illustrate that our MSA can capture
the important information of the regions related to FER, verifying the effectiveness of our method.
5. CONCLUSION
In this paper, we propose an end-to-end MSAFNet for FER tasks that can learn local and global features and
adaptivelymodeltherelationshipbetweenthem. Ournetworkcontainsthreemodulesthatcanobtaindifferent
facial information and are robust to real-world facial expression datasets, including the LFEM, the GFEM, and
the GLFM. And a MSA block is designed to adaptively capture the importance of relevant regions of FER. The

