Page 21 - Read Online
P. 21
Page 320 He et al. Intell. Robot. 2025, 5(2), 313-32 I http://dx.doi.org/10.20517/ir.2025.16
3.4. GLFM
In this field, GLFM effectively integrates features from the LFEM and the GFEM. The primary objective is
to combine local and global features to capture richer spatial and channel information, thereby enhancing
FER performance. Compared with traditional neural networks that separately process local and global fea-
tures, our model enables the complementarity between local and global features. As shown in Figure 4, this
approach allows for a more accurate description of both the details and overall characteristics of facial expres-
sions. Given local feature maps ∈ 1× 1× 1 extraction from the LFEM, where 1, 1, 1 are the
channel dimension, height and width, and global feature maps ∈ 2× 2 extracted from the GFEM,
where 2, 2 are channel dimensions and token numbers. The local feature maps ∈ 1× 1× 1 are re-
arrange to ∈ 1× 1 where 1 = 1 × 1. This rearrangement of local feature maps is an essential step
in the process. By reshaping the feature maps, local and global information can be effectively fused within the
same dimensional space, thereby optimizing subsequent operations. This paper presents the first method to
fuse local feature maps and global feature maps via an element-wise summation for information
interaction:
(9)
= ⊕
Where ∈ 3× 3 is the interactive features, and ⊕ denotes element-wise summation. For purpose
of performing information interaction better, the interaction features are transposed into ∈
3× 3 and fed into MLP block including two fully connected layers and a non-linearity layer that can c the
informationonthetokenlevel. Employingthismethodcanacquirethefeature ∈ 3× 3 andtranspose
1
it into 1 ∈ 3× 3 . And then the features 1 are fed into two different MLP blocks to interact the
information on the channel level. The details can be defined as follows:
( ( ))
= 4 GELU 3 (10)
1
( ( ))
2 = 6 5 1 (11)
( ( ))
3 = 8 7 1 (12)
where denotes layer normalization, which is the process of normalizing the output of a specific layer in a
neural network; it helps maintain stability during training by preventing difficulties caused by large differences
in the outputs of different layers. This can reduce training time, improve stability, and enhance convergence.
Additionally, denotestheGELUactivationfunction,whichissimilartoReLUbutwithsmootheroutput
for small negative values. It helps mitigate issues such as gradient explosion or vanishing gradients during
training and is commonly used in deep learning to transform the model’s output. 3 , 4 , 5 , 6 , 7 and 8
denote the fully connected operation. And we can get the last fusion features as:
( ( )) ( ( ))
= ⊗ 2 ⊕ ⊗ 3 (13)
where ⊗ denotes element-wise multiplication, and ⊕ denotes element-wise summation.
4. EXPERIMENT RESULTS
4.1. Datasets
RAF-DB [7] is a real-world FER dataset containing facial images downloaded from the Internet. The facial
images are labeled with seven classes of basic expressions or 12 classes of compound expressions by 40 trained
human coders. In our experiment, the proposed method only utilizes seven basic expressions, including anger,
disgust, fear, happiness, sadness, surprise, and neutral. It involves 12,271 images for training and 3,069 images
for testing.
FER2013 [44] is collected from the Internet that was first for ICML 2013 Challenges in representation learning.
It contains 385,887 facial images collected by Google search engine with 28,709 images for training, 3,589

