Page 39 - Read Online
P. 39

Page 508                          Liu et al. Intell Robot 2024;4(4):503-23  I http://dx.doi.org/10.20517/ir.2024.29






























               Figure 3. Overall encoder-decoder architecture of the proposed SANet. E    represents the encoder at the   -th stage. D    indicates the decoder
               at the   -th stage.       and       denote the output feature maps of the encoder and the decoder at the   -th stage, respectively.       stands for
               the predicted saliency map, and    1 is the final prediction result.    is the ground-truth saliency map. PPM: Pyramid pooling module; MFA:
               multi-scale feature aggregation.


               CNN is that it lacks rotation invariance and therefore requires a large amount of training data [36] . To this end,
               Hinton of Google Brain proposed the capsule network (CapsNet) that can capture structural information [37] .
               CapsNets designed a clever dynamic routing algorithm to capture the part-whole relationship in the image to
               enhance the equivalence of the network. Based on this advantage, related research and applications based on
               the CapsNet structure have been proposed one after another. Saqur etal. proposed a new algorithm CapsGAN
               by showing the weakness of the CNN-based generative adversarial network (GAN) architecture in generating
               3D images [38] . Cheng          . proposed complex-valued dense CapsNet (Cv-CapsNet) and complex-valued di-
               verse CapsNet (Cv-CapsNet++) for image classification [39] . Sun          . proposed a deep tensor capsule network
               that uses a new tensor capsule-based routing algorithm and the corresponding convolution operation [40] .

               Due to the unique advantages of CapsNet, it has also been successfully applied to the task of SOD. For ex-
               ample, Liu          . optimized deep unsupervised SOD by using the part-whole relationship characteristics of
               CapsNet [41] . Zhang          . used the attention mechanism to interact with CNN and CapsNet features to better
               detect salient objects [42] . Liu          . integrated the advantages of CNN and CapsNet, extracted different seman-
               ticinformationrespectively, andinteractedwitheachothertogeneratebettersaliencypredictionmaps [43] . The
               design of the SAFE module in this paper also contains some ideas for CapsNet.



               3. THE PROPOSED METHOD
               In this section, the proposed SOD framework is presented. Section 3.1 describes the overall network structure.
               Section 3.2 introduces the SAFE module, which can adaptively extract and filter features according to the scale
               differences of salient objects. Section 3.3 explains the decoder design based on the MFA module.


               3.1. Overall network architecture
               As shown in Figure 3, the overall network structure of SANet comprises a bottom-up encoder, a top-down
               decoder, and a lateral connection between them. The encoder is built with SAFE modules as units and is
               divided into five stages. In these five stages, we downsample the input using dilated DSConv3×3 with a stride
   34   35   36   37   38   39   40   41   42   43   44