Page 42 - Read Online
P. 42
Liu et al. Intell Robot 2024;4(4):503-23 I http://dx.doi.org/10.20517/ir.2024.29 Page 511
Figure 5. Illustration of the proposed MFA module. MFA: Multi-scale feature aggregation; DDS: dilated DSConv3×3 operation; c ⃝: a
concatenation operation.
is calculated as follows:
= Split ( ( ( ( ( ))))) , = 1, 2, ..., (5)
where ∈ R ×1×1 . After that, we use channel-wise multiplication to combine with the corresponding
branch features and integrate the multiplied results by element-wise summation to obtain the feature map :
˜
Õ
˜ ′ (6)
= 1 × 1 + × .
=2
˜
After undergoes a Conv1×1, it is added to the original input feature map through a residual connection
to obtain the final output , namely,
˜
= + , (7)
where denotes a Conv1×1 operation.
As shown in Figure 3, the SAFE module is the basic unit that forms the backbone network of SANet. The
number of branches N mentioned above is set as a hyperparameter. In the subsequent ablation experiments,
we prove that the higher the resolution of the input feature map, the more branches are needed, so we will set
different numbers of branches at multiple stages of the network.
3.3. MFA-based decoder
This paper first uses the backbone network composed of SAFE modules to extract image features, and then we
design the decoder. As shown in Figure 3, the decoder consists of two parts: feature fusion and MFA modules.
The feature fusion module fuses features from three directions and uses a depthwise separable × dilated con-
volution to integrate the fused results. After feature fusion, it is not enough to simply use a layer of convolution
for processing. In particular, the output features of the PPM module will go through a relatively large scale
span (maximum span of 16 times) during the upsampling process, and it is necessary to establish connections
across scales reasonably. Therefore, we designed a MFA module to process further the fused features.
As shown in Figure 5, we use the output feature map of the feature fusion module as the input of the MFA
module and split it into four parts by channel, represented by , where ∈ {1, 2, 3, 4}. Then, information

