Page 29 - Read Online
P. 29

Page 328                        He et al. Intell. Robot. 2025, 5(2), 313-32  I http://dx.doi.org/10.20517/ir.2025.16

                                   Table 9. The number of parameters, FLOPs and accuracy on RAF-DB dataset

                                             Method    Params  FLOPs  Accuracy (%)
                                            MA-Net  [52]  50.54 M  3.65 G  88.40
                                           AMP-Net  [15]  105.67 M  4.73 G  89.25
                                           our MSAFNet  23.42 M  3.60 G  90.06

                                            FLOPs: Floating point operations; MA-Net:
                                            multi-scale and local attention network; AMP-
                                            Net:  adaptive  multilayer  perceptual attention
                                            network;  MSAFNet:  multi-scale attention and
                                            convolution-transformer fu-sion network.



























               Figure 8. The CAM of LFEM and GFEM. The images and labels are from FER2013 and RAF-DB. CAM: Class activation mapping; LFEM: local
               feature extraction module; GFEM: global feature extraction module.


               M and 3.60 G. The parameters and FLOPs of our method are significantly lower than those of MA-Net [52]
               and AMP-Net [15] . These results demonstrate that our MSAFNet has lower complexity and achieves better
               performance than other methods.


               4.6. Visualization
               In this section, in order to better validate the performance of MSA, we utilize gradient-weighted class activa-
               tion mapping (Grad-CAM) [66]  to visualize SE, CBAM, ECA, and our MSA respectively. As shown in Figure 8,
               LFEM generates highly localized activations focusing on fine-grained facial components, while GFEM pro-
               duces broader activation patterns capturing holistic facial structure. This contrast validates the complemen-
               tary roles of LFEM in micro-feature extraction and GFEM in macro-context modeling. As shown in Figure 9,
               our MSA enables the network to better focus on the key areas, such as the eyes, nose, and mouth. For facial
               occlusion or variant poses, MSA can still focus on eyes, nose, and mouth regions, and other attention methods
               only pay attention to eyes, nose, or mouth regions. The results can further illustrate that our MSA can capture
               the important information of the regions related to FER, verifying the effectiveness of our method.



               5. CONCLUSION
               In this paper, we propose an end-to-end MSAFNet for FER tasks that can learn local and global features and
               adaptivelymodeltherelationshipbetweenthem. Ournetworkcontainsthreemodulesthatcanobtaindifferent
               facial information and are robust to real-world facial expression datasets, including the LFEM, the GFEM, and
               the GLFM. And a MSA block is designed to adaptively capture the importance of relevant regions of FER. The
   24   25   26   27   28   29   30   31   32   33   34