Page 19 - Read Online
P. 19
Page 318 He et al. Intell. Robot. 2025, 5(2), 313-32 I http://dx.doi.org/10.20517/ir.2025.16
embedded after the Conv5 block.
As shown in Figure 3, the feature maps extracted after the Conv5 block are fed into the attention block. After a
1×1convolution, thefeaturemaps 1 ∈ 1× 1× 1 areobtained, where 1, 1, and 1representthenumber
of channels, width, and height of the feature map after the convolution operation, respectively. The module
splits the feature maps 1 into groups feature map subsets, denoted by , where ∈ {0, 1, 2, · · · , − 1}. The
spatial size of each group feature map subset is the same as the input feature maps , while the number of
′
′ 1 × 1× 1
channels is 1 = 1/ . The − ℎ group feature map subset import ∈ , ∈ {0, 1, 2, · · · , − 1}.
Each is processed by a corresponding 3 × 3 convolution designated by (·), and the output is denoted by
′
∈ 1 × 1× 1 . According to the module, the input and output have the same dimension. When ≥ 1, the
− ℎ group feature subset is computed with the output of −1 and then fed as the input of (·). Thus, can
be written as:
{
( ) = 0
= (1)
( + −1 ) 1 ≤ ≤ − 1
When each group feature subset { , 0 ≤ ≤ − 1} goes through 3 × 3 convolution, the output { , 1 ≤
≤ − 1} can acquire a larger receptive field than { , ≤ }. After that, each can contain characteristic
information of feature subsets with different receptive field scales and different scales, thus obtaining multi-
scale spatial information. Different sizes of can learn different information, and larger may get richer
scale information. This module sets the size of as 4, which was carefully chosen through comprehensive
ablationstudiestobalancecomputationalcomplexityandmodelperformance. Thissystematicanalysisjustifies
our design choice of = 4 as the optimal configuration that achieves superior accuracy while maintaining
reasonable computational demands.
Following the multi-scale spatial attention information, we subsequently compute attention weights along
channel dimensions. By using global average pooling, each output from a convolution operation for each
group feature subset is condensed into a vector. Then, we employ two fully connected layers to model the
channel correlations. In neural networks, activation functions are primarily used to introduce non-linearity,
enabling the network to learn complex patterns. Therefore, we use a sigmoid activation function to normalize
′
the output, which can obtain the channel attention weight of each group feature subset ∈ 1 ×1×1 . It can
be defined as:
(2)
= ( 2 ( 1 ))
where denotesthesigmoidfunction,whichnormalizestheoutputasarangeof[0,1],effectivelytransforming
the output into a probability value. Additionally, denotes the ReLU activation function, a typical non-linear
function, defined as ( ) = max(0, ), which maps the input signal to the feature space; 1 and 2 denote
the FC operation; denotes the channel attention weight of different group feature subsets. Furthermore, it
splices the attention weights to acquire the final MSA weights ∈ 1×1×1 :
= ([ 0 , 2 , 3 , · · · −1 ]) (3)
= 1 ⊗ (4)
Finally, we acquire the output by multiplying feature maps with the MSA weights .
Different from Res2Net, our MSA not only captures the multi-scale information from feature map subsets,
but also calculates channel information and aggregates all information from feature map subsets, which can
make the attention information richer. It considers both spatial semantic information and channel semantic
information and effectively combines information from both spatial and channel dimensions. This design,
leveragingself-attention,reducesredundancy,acceleratestraining,andimprovesconvergence. Byemphasizing
comprehensivefeaturefusionanddiverseinteractions,ourMSAensuresmoreefficientinformationflow,better
addressing gradient vanishing and enhancing training stability in deep architectures.

