Page 21 - Read Online
P. 21

Page 320                        He et al. Intell. Robot. 2025, 5(2), 313-32  I http://dx.doi.org/10.20517/ir.2025.16


               3.4. GLFM
               In this field, GLFM effectively integrates features from the LFEM and the GFEM. The primary objective is
               to combine local and global features to capture richer spatial and channel information, thereby enhancing
               FER performance. Compared with traditional neural networks that separately process local and global fea-
               tures, our model enables the complementarity between local and global features. As shown in Figure 4, this
               approach allows for a more accurate description of both the details and overall characteristics of facial expres-
               sions. Given local feature maps               ∈      1×  1×  1  extraction from the LFEM, where   1,   1,   1 are the
               channel dimension, height and width, and global feature maps                 ∈      2×  2  extracted from the GFEM,
               where   2,   2 are channel dimensions and token numbers. The local feature maps               ∈      1×  1×  1  are re-
               arrange to               ∈      1×  1  where   1 =   1 ×  1. This rearrangement of local feature maps is an essential step
               in the process. By reshaping the feature maps, local and global information can be effectively fused within the
               same dimensional space, thereby optimizing subsequent operations. This paper presents the first method to
               fuse local feature maps               and global feature maps                 via an element-wise summation for information
               interaction:
                                                                                                        (9)
                                                                   =               ⊕                
               Where                  ∈      3×  3  is the interactive features, and ⊕ denotes element-wise summation. For purpose
               of performing information interaction better, the interaction features                  are transposed into                      ∈
                    3×  3  and fed into MLP block including two fully connected layers and a non-linearity layer that can c the
               informationonthetokenlevel. Employingthismethodcanacquirethefeature        ∈      3×  3 andtranspose
                                                                                              1
               it into                 1 ∈      3×  3 . And then the features                 1 are fed into two different MLP blocks to interact the
               information on the channel level. The details can be defined as follows:
                                                              (     (      ))
                                                   =    4 GELU    3                                    (10)
                                                           1                        
                                                              (     (      ))
                                                            2 =    6             5                      1  (11)
                                                              (     (       ))
                                                            3 =    8             7                      1  (12)
               where      denotes layer normalization, which is the process of normalizing the output of a specific layer in a
               neural network; it helps maintain stability during training by preventing difficulties caused by large differences
               in the outputs of different layers. This can reduce training time, improve stability, and enhance convergence.
               Additionally,          denotestheGELUactivationfunction,whichissimilartoReLUbutwithsmootheroutput
               for small negative values. It helps mitigate issues such as gradient explosion or vanishing gradients during
               training and is commonly used in deep learning to transform the model’s output.    3 ,    4 ,    5 ,    6 ,    7 and    8
               denote the fully connected operation. And we can get the last fusion features as:

                                           (         (      ))  (          (      ))
                                               =               ⊗                    2  ⊕                 ⊗                    3  (13)
               where ⊗ denotes element-wise multiplication, and ⊕ denotes element-wise summation.



               4. EXPERIMENT RESULTS
               4.1. Datasets
               RAF-DB  [7]  is a real-world FER dataset containing facial images downloaded from the Internet. The facial
               images are labeled with seven classes of basic expressions or 12 classes of compound expressions by 40 trained
               human coders. In our experiment, the proposed method only utilizes seven basic expressions, including anger,
               disgust, fear, happiness, sadness, surprise, and neutral. It involves 12,271 images for training and 3,069 images
               for testing.

               FER2013 [44]  is collected from the Internet that was first for ICML 2013 Challenges in representation learning.
               It contains 385,887 facial images collected by Google search engine with 28,709 images for training, 3,589
   16   17   18   19   20   21   22   23   24   25   26