Page 23 - Read Online
P. 23
Page 322 He et al. Intell. Robot. 2025, 5(2), 313-32 I http://dx.doi.org/10.20517/ir.2025.16
Table 1. Comparison with other methods on RAF-DB dataset
Methods Year Accuracy (%)
gACNN [49] 2018 85.07
RAN [45] 2020 86.90
SCN [48] 2020 87.01
OADN [50] 2020 87.16
DACL [51] 2021 87.78
MA-Net [52] 2021 88.40
FDRL [53] 2021 89.47
VTFF [54] 2021 88.14
AMP-Net [15] 2022 89.25
ADDL [55] 2022 89.34
PACVT [41] 2023 88.21
GSDNet [32] 2024 90.91
DBFN [56] 2024 87.65
MSAFNet(ours) 2025 90.06
The bold format is used to indi-
cate the best (highest) accuracy.
gACNN: Region attention mecha-
nism; RAN: region attention net-
works; SCN: self-cure networks;
OADN: occlusion-adaptive deep
network; DACL: deep attentive
center loss; MA-Net: multi-scale
and local attention network;
FDRL: feature decomposition and
reconstruction learning; VTFF:
visual transformers with fea-
ture fusion; AMP-Net: adaptive
multilayer perceptual attention
network; ADDL: adaptive deep
disturbance-disentangled learn-
ing; PACVT: patch attention
convolutional vision transformer;
DBFN: dual-branch fusion net-
work; MSAFNet: multi-scale
attention and convolution-
transformer fusion network.
4.3. Comparison with state-of-the-arts
This section compares the proposed approach MSAFNet with several state-of-the-art methods on RAF-DB,
FERPlus, FER2013, Occlusion-RAF-DB, Pose-RAF-DB, Occlusion-FERPlus, and Pose-FERPlus. MSAFNet
consistently achieves high accuracy and demonstrates stable performance across these benchmarks. Notably,
it exhibits strong generalization capabilities, particularly in complex scenarios involving diverse facial expres-
sions and emotion categories.
4.3.1 Results on RAF-DB
Comparison results with other state-of-the-art methods on RAF-DB in recent years with seven emotion cat-
egories are shown in Table 1. Multi-scale and local attention network (MA-Net) [52] utilized global and local
features to address the issues both occlusion and pose variation and got an accuracy of 88.40%. Adaptive mul-
tilayer perceptual attention network (AMP-Net) [15] uses different fine-grained features to extract global, local
and salient features and obtained recognition accuracy of 89.25% on RAF-DB dataset. As shown in Table 1,
our proposed method MSAFNet obtains the recognition accuracy of 89.77% on RAF-DB and achieves 1.66%
and 0.81% improvement compared with the MA-Net [52] and AMP-Net [15] , respectively. Compared to the vi-
sual transformers with feature fusion (VTFF) [54] which used transformers and attention selective fusion, our
method has 1.92% improvement. PACVT [41] can also extract local and global features with attention weights

