Page 41 - Read Online
P. 41

Page 510                          Liu et al. Intell Robot 2024;4(4):503-23  I http://dx.doi.org/10.20517/ir.2024.29





































                              Figure 4. Illustration of the proposed SAFE module. SAFE: Scale-adaptive feature extraction.


               Dynamic selection: Features of different scales have varying representation capabilities for salient objects. To
               measure this difference, we perform dynamic selection after completing the multi-scale feature interaction.
               We use an element-wise summation to integrate the feature maps output by different branches, namely,

                                                                 
                                                              Õ
                                                       ′           ′                                    (3)
                                                        =    1 +     .
                                                                    
                                                                =2
               Here we use element-wise summation instead of concatenation because the concatenation operation will
               greatly increase the number of channels, resulting in heavier computational complexity and more network
               parameters. Next, we process    with a 3 × 3 convolution and then perform the dynamic measurement mod-
                                          ′
               ule (DMM) operation. DMM consists of a global average pooling (GAP) operation and an MLP. We gather
               the global contextual information with channelwise statistics by using GAP. This process embeds the input     ′
                                                                      ′
               to a learnable latent vector    ∈ R   ×1×1  by performing GAP on    over the spatial dimension. Thus, the   -th
               component of    can be given as follows:

                                                          
                                               1   Õ Õ
                                                           ′
                                              =             (  ,   ,   ),    = 0, 1, ...,    − 1.       (4)
                                                ×   
                                                     =1   =1
               where    stands for height, which represents the number of pixels of    in the vertical direction, and    stands
                                                                         ′
               for width, which represents the number of pixels of    in the horizontal direction. Because each element
                                                              ′
               in    indicates the importance of the corresponding feature slice of    ,    can be used as the channel-wise
                                                                          ′
               attention of all branches. Next, we shall perform an additional embedding regarding    via a MLP consisting
               of two fully-connected layers, a non-linearity ReLU, and a softmax operation. After the MLP, a vector of size
               (   ×   ) × 1 × 1 will be output, and then we will split it into    parts corresponding to    different branches
               through the split operation, and the   -th part is       ∈ R   ×1×1 . Since MLP is learnable, different attention
               weights can be dynamically assigned to each scale feature. The dynamic attention weight       of the   -th branch
   36   37   38   39   40   41   42   43   44   45   46