Page 42 - Read Online
P. 42

Liu et al. Intell Robot 2024;4(4):503-23  I http://dx.doi.org/10.20517/ir.2024.29   Page 511

























               Figure 5. Illustration of the proposed MFA module. MFA: Multi-scale feature aggregation; DDS: dilated DSConv3×3 operation; c ⃝: a
               concatenation operation.


               is calculated as follows:

                                         = Split (                 (     (         (     (  ))))) ,    = 1, 2, ...,     (5)

               where       ∈ R   ×1×1 . After that, we use channel-wise multiplication to combine       with the corresponding
               branch features and integrate the multiplied results by element-wise summation to obtain the feature map   :
                                                                                                        ˜
                                                                
                                                             Õ
                                                 ˜                     ′                                (6)
                                                    =    1 ×    1 +        ×    .
                                                                        
                                                               =2
                     ˜
               After    undergoes a Conv1×1, it is added to the original input feature map    through a residual connection
               to obtain the final output   , namely,

                                                             ˜
                                                         =       +   ,                                  (7)
               where    denotes a Conv1×1 operation.

               As shown in Figure 3, the SAFE module is the basic unit that forms the backbone network of SANet. The
               number of branches N mentioned above is set as a hyperparameter. In the subsequent ablation experiments,
               we prove that the higher the resolution of the input feature map, the more branches are needed, so we will set
               different numbers of branches at multiple stages of the network.

               3.3. MFA-based decoder
               This paper first uses the backbone network composed of SAFE modules to extract image features, and then we
               design the decoder. As shown in Figure 3, the decoder consists of two parts: feature fusion and MFA modules.

               The feature fusion module fuses features from three directions and uses a depthwise separable   ×   dilated con-
               volution to integrate the fused results. After feature fusion, it is not enough to simply use a layer of convolution
               for processing. In particular, the output features of the PPM module will go through a relatively large scale
               span (maximum span of 16 times) during the upsampling process, and it is necessary to establish connections
               across scales reasonably. Therefore, we designed a MFA module to process further the fused features.

               As shown in Figure 5, we use the output feature map of the feature fusion module as the input of the MFA
               module and split it into four parts by channel, represented by      , where    ∈ {1, 2, 3, 4}. Then, information
   37   38   39   40   41   42   43   44   45   46   47