Page 43 - Read Online

P. 43

Page 512 Liu et al. Intell Robot 2024;4(4):503-23 I http://dx.doi.org/10.20517/ir.2024.29

on different scales is extracted through a depth-separable convolution with a dilation rate of , forming four
′

different branches and obtaining feature maps , where ∈ {1, 2, 3, 4}. So, can be expressed as follows:
′
′
′

′
= $ ( ) , = 1, 2, 3, 4, (8)

where $ represents DSConv3×3 with different dilation rates at branch . To achieve cross-scale feature aggre-
gation, we use residual connections between different branches and get by element-wise summation:
′′

(
+ −1 , = 2
′
′

′′
= (9)

′
′′
+ −1 , = 3, 4.

Then, each branch’s results are merged as output through a concatenation operation, namely,

˜
′′
= , , , ′′ . (10)
′
′′
1 2 3 4
Let represent the feature maps output by the decoder at each stage, and represent the feature maps output
by the encoder at each stage, where ∈ {1, 2, 3, 4, 5}. So,

5 = MFA × ( ( (PPM ( 5 ))) + 5 ) , (11)
5
where MFA represents MFA module, × means DSConv × at the fifth stage, and indicates upsampling
5
operation. In summary, we have

= MFA × ( ( (PPM ( 5 ))) + ( ( +1 )) + ) , = 1, 2, 3, 4. (12)
where × represents DSConv × at the -th stage.
3.4. Saliency reasoning
We use deep supervision to improve the transparency of the hidden layer learning process. As shown in
Figure 3, for the fusion features at different stages, we use a Conv1×1 and sigmoid activation function to
generate multiple predictions, namely , where ∈ {1, 2, 3, 4, 5}. We adopt the standard binary cross-entropy
loss for training, which is defined as follows:

5
Õ
= ( 1 , ) + ( , ), (13)
=2
where is the standard binary cross-entropy loss function, and denotes the ground-truth saliency map.
denotes the weighting scalar for loss balance, which is set to 0.4.

4. RESULTS

4.1. Experimental setup
4.1.1 Implementation details
This paper uses the PyTorch library to implement the proposed method. Our model is pre-trained on the
ImageNet dataset. The training set ofthe DUTS [45] dataset (DUTS-TR) is used for model training. In addition,
we also validate our proposed method on the traffic dataset TSOD, using its first 2,000 images for training and
the rest for testing. All experiments are performed using the Adam optimizer, with parameters 1 = 0.9,
2 = 0.999, weight decay of 10 , and batch size of 20. We use poly learning rate scheduler so that the learning
−4

, where _ = 5 × 10 and = 0.9. We trained
−4
rate for the -th epoch is _ × 1 −
# ℎ
the proposed model for 300 epochs, i.e., # ℎ = 300. All experiments are run on a server with an NVIDIA
GTX3090 GPU and an AMD Ryzen Threadripper 3960X (2.2GHz) CPU.

38 39 40 41 42 43 44 45 46 47 48