Page 19 - Read Online
P. 19

Page 318                        He et al. Intell. Robot. 2025, 5(2), 313-32  I http://dx.doi.org/10.20517/ir.2025.16

               embedded after the Conv5 block.

               As shown in Figure 3, the feature maps extracted after the Conv5 block are fed into the attention block. After a
               1×1convolution, thefeaturemaps   1 ∈      1×  1×  1  areobtained, where   1,   1, and   1representthenumber
               of channels, width, and height of the feature map after the convolution operation, respectively. The module
               splits the feature maps   1 into    groups feature map subsets, denoted by      , where    ∈ {0, 1, 2, · · · ,    − 1}. The
               spatial size of each group feature map subset is the same as the input feature maps   , while the number of
                                                                                ′
                           ′                                                    1 ×  1×  1
               channels is   1 =   1/  . The    −   ℎ group feature map subset import       ∈     ,    ∈ {0, 1, 2, · · · ,    − 1}.
               Each       is processed by a corresponding 3 × 3 convolution designated by       (·), and the output is denoted by
                      ′
                     ∈      1 ×  1×  1 . According to the module, the input and output have the same dimension. When    ≥ 1, the
                  −   ℎ group feature subset is computed with the output of      −1 and then fed as the input of       (·). Thus,       can
               be written as:
                                                  {
                                                          (      )     = 0
                                                    =                                                   (1)
                                                          (      +      −1 )  1 ≤    ≤    − 1
               When each group feature subset {      , 0 ≤    ≤    − 1} goes through 3 × 3 convolution, the output {      , 1 ≤
                  ≤    − 1} can acquire a larger receptive field than {      ,    ≤   }. After that, each       can contain characteristic
               information of feature subsets with different receptive field scales and different scales, thus obtaining multi-
               scale spatial information. Different sizes of    can learn different information, and larger    may get richer
               scale information. This module sets the size of    as 4, which was carefully chosen through comprehensive
               ablationstudiestobalancecomputationalcomplexityandmodelperformance. Thissystematicanalysisjustifies
               our design choice of    = 4 as the optimal configuration that achieves superior accuracy while maintaining
               reasonable computational demands.

               Following the multi-scale spatial attention information, we subsequently compute attention weights along
               channel dimensions. By using global average pooling, each output       from a convolution operation for each
               group feature subset       is condensed into a vector. Then, we employ two fully connected layers to model the
               channel correlations. In neural networks, activation functions are primarily used to introduce non-linearity,
               enabling the network to learn complex patterns. Therefore, we use a sigmoid activation function to normalize
                                                                                                ′
               the output, which can obtain the channel attention weight of each group feature subset       ∈      1 ×1×1 . It can
               be defined as:
                                                                                                        (2)
                                                          =   (   2   (   1       ))
               where    denotesthesigmoidfunction,whichnormalizestheoutputasarangeof[0,1],effectivelytransforming
               the output into a probability value. Additionally,    denotes the ReLU activation function, a typical non-linear
               function, defined as    (  ) = max(0,   ), which maps the input signal to the feature space;    1 and    2 denote
               the FC operation;       denotes the channel attention weight of different group feature subsets. Furthermore, it
               splices the attention weights to acquire the final MSA weights    ∈      1×1×1 :

                                                 =             ([   0 ,    2 ,    3 , · · ·      −1 ])  (3)
                                                                    =   1 ⊗                             (4)
               Finally, we acquire the output               by multiplying feature maps    with the MSA weights   .


               Different from Res2Net, our MSA not only captures the multi-scale information from feature map subsets,
               but also calculates channel information and aggregates all information from feature map subsets, which can
               make the attention information richer. It considers both spatial semantic information and channel semantic
               information and effectively combines information from both spatial and channel dimensions. This design,
               leveragingself-attention,reducesredundancy,acceleratestraining,andimprovesconvergence. Byemphasizing
               comprehensivefeaturefusionanddiverseinteractions,ourMSAensuresmoreefficientinformationflow,better
               addressing gradient vanishing and enhancing training stability in deep architectures.
   14   15   16   17   18   19   20   21   22   23   24