Page 20 - Read Online
P. 20

He et al. Intell. Robot. 2025, 5(2), 313-32  I http://dx.doi.org/10.20517/ir.2025.16  Page 319



























               Figure 3. The details of MSA block. MSA: Multi-scale attention; GAP: global average polling; FC: fully connected layer; Sig: sigmoid function.


               3.3. GFEM
               Given the significant performance of vision transformers (ViT) [35]  in numerous computer vision tasks, sev-
               eral transformer architectures have been widely adopted. Transformer architecture is good at capturing the
               long-distance dependencies between pixels. Therefore, the GFEM utilizes a transformer architecture follow-
               ing Mixformer [43] . As shown in Figure 2, the GFEM combines local-window self-attention (W-MSA) with
               depth-wise convolution in a parallel design, providing complementary information for each branch. The ViT
               branch employs W-MSA to model global facial semantics: non-overlapping patches undergo dynamic cor-
               relation analysis. Channel interaction (CLout) enhances critical regions via element-wise recalibration. The
               CNN branch extracts localized details via depthwise convolution, with spatial interaction (SLout) amplifying
               discriminative regions. Fused features combine both streams channel-wise, processed by the MIX function
               for robust classification. This dual-stream design enables complementary modeling: W-MSA captures spatial
               dependencies, while CNN optimizes channel-wise features, achieving multi-scale representation through hi-
               erarchical fusion. The channel interaction contains a global average pooling layer and two 1×1 convolution
               layers with BN layer and GELU activation function. And a sigmoid function is used to generate attention.
               At last, the channel interaction is applied to the value in W-MSA. The spatial interaction involves two 1×1
               convolution layers with BN and GELU, followed by a sigmoid function used to generate attention.


                                                  =   (         1×1 (        (         1×1 (      (          )))))  (5)

                                                      =   (         1×1 (        (         1×1 (          ))))  (6)
               Where          ,             represent the input and output of channel interaction;         ,             represent the input and
               output of spatial interaction;          denotes activation function;    denotes the sigmoid function.

               Based on the parallel design, the mix transformer block can be formulated as follows:


                                                                             
                                                                    
                                         ˆ      +1  =             (   −       (   ),         (   )) +        (7)
                                                     +1  =       (    (ˆ      +1 )) + ˆ      +1         (8)

               where              represents a function that mixes the feature between W-MSA and depth-wise convolution.
                        denotes depth-wise convolution;      represents layer normalization; ˆ      +1  and      +1  denote the output
               features of the              and the       , respectively.
   15   16   17   18   19   20   21   22   23   24   25