Page 43 - Read Online
P. 43

Page 512                          Liu et al. Intell Robot 2024;4(4):503-23  I http://dx.doi.org/10.20517/ir.2024.29

               on different scales is extracted through a depth-separable convolution with a dilation rate of    , forming four
                                                                                             ′
                                                                                               
               different branches and obtaining feature maps    , where    ∈ {1, 2, 3, 4}. So,    can be expressed as follows:
                                                                                ′
                                                               ′
                                                       ′
                                                                                  
                                                   ′
                                                     = $    (      ) ,    = 1, 2, 3, 4,                 (8)
                                                     
               where $    represents DSConv3×3 with different dilation rates at branch   . To achieve cross-scale feature aggre-
               gation, we use residual connections between different branches and get    by element-wise summation:
                                                                            ′′
                                                                              
                                                       (
                                                           +      −1 ,    = 2
                                                             ′
                                                         ′
                                                           
                                                   ′′
                                                     =                                                  (9)
                                                     
                                                         ′
                                                             ′′
                                                           +      −1 ,    = 3, 4.
                                                           
               Then, each branch’s results are merged as output through a concatenation operation, namely,

                                               ˜
                                                                    ′′
                                                  =                    ,    ,    ,     ′′  .           (10)
                                                             ′
                                                                ′′
                                                             1  2  3   4
               Let       represent the feature maps output by the decoder at each stage, and       represent the feature maps output
               by the encoder at each stage, where    ∈ {1, 2, 3, 4, 5}. So,

                                             5 = MFA      ×    (     (   (PPM (   5 ))) +    5 ) ,     (11)
                                                     5
               where MFA represents MFA module,      ×    means DSConv  ×   at the fifth stage, and      indicates upsampling
                                               5
               operation. In summary, we have

                                    = MFA         ×    (     (   (PPM (   5 ))) +      (   (     +1 )) +       ) ,    = 1, 2, 3, 4.  (12)
               where         ×    represents DSConv  ×   at the   -th stage.
               3.4. Saliency reasoning
               We use deep supervision to improve the transparency of the hidden layer learning process. As shown in
               Figure 3, for the fusion features at different stages, we use a Conv1×1 and sigmoid activation function to
               generate multiple predictions, namely      , where    ∈ {1, 2, 3, 4, 5}. We adopt the standard binary cross-entropy
               loss for training, which is defined as follows:

                                                                 5
                                                                Õ
                                                        =           (   1 ,   ) +               (      ,   ),  (13)
                                                                   =2
               where           is the standard binary cross-entropy loss function, and    denotes the ground-truth saliency map.
                  denotes the weighting scalar for loss balance, which is set to 0.4.


               4. RESULTS

               4.1. Experimental setup
               4.1.1 Implementation details
               This paper uses the PyTorch library to implement the proposed method. Our model is pre-trained on the
               ImageNet dataset. The training set ofthe DUTS [45] dataset (DUTS-TR) is used for model training. In addition,
               we also validate our proposed method on the traffic dataset TSOD, using its first 2,000 images for training and
               the rest for testing. All experiments are performed using the Adam optimizer, with parameters    1 = 0.9,
                  2 = 0.999, weight decay of 10 , and batch size of 20. We use poly learning rate scheduler so that the learning
                                        −4
                                                                  
                                                           , where         _     = 5 × 10 and            = 0.9. We trained
                                                                               −4
               rate for the   -th epoch is         _     × 1 −
                                                #        ℎ  
               the proposed model for 300 epochs, i.e., #        ℎ   = 300. All experiments are run on a server with an NVIDIA
               GTX3090 GPU and an AMD Ryzen Threadripper 3960X (2.2GHz) CPU.
   38   39   40   41   42   43   44   45   46   47   48