Page 44 - Read Online
P. 44

Liu et al. Intell Robot 2024;4(4):503-23  I http://dx.doi.org/10.20517/ir.2024.29   Page 513

               4.1.2 Datasets
               We validate our proposed method on six common datasets, including DUTS, DUT-OMRON [46] , ECSSD [47] ,
               PASCAL-S [48] , HKU-IS [49] , and SOD [50] . In addition, we also verified the advantages of our method over
               other SOD methods in the traffic field in the traffic dataset TSOD.


               The DUTS dataset comprises two subsets: the training set, DUTS-TR, and the test set, DUTS-TE. DUTS-
               TR is used for SANet training, while DUTS-TE is reserved for testing. DUTS-TR includes 10,553 images
               from ImageNet, each annotated at the pixel level. The test set contains 5,019 images selected from ImageNet
               and SUN and their pixel-level labels. DUT-OMRON features 5,168 images depicting complex scenes with rich
               contents, accompanied by pixel-level labels. ECSSD consists of 1,000 images, with pixel-level labels, presenting
               a high level of interference in both the foreground and background of the images, making it a challenging
               dataset. PASCAL-S includes 850 images and their pixel-level labels, showcasing relatively complex scenes.
               HKU-IS contains 4,447 images and their pixel-level labels, and almost all images have multiple salient objects.
               SOD contains 300 images and their pixel-level labels, where the color contrast between salient objects and the
               background is low. TSOD consists of 2,316 images of traffic scenes with relatively complex content, along with
               their pixel-level labels.

               4.1.3 Evaluation criteria
               This paper evaluates the effectiveness of the proposed model using the maximum F-measure (maxF), average
               F-measure (avgF), mean absolute error (MAE), and S-measure (S). Additionally, the efficiency of the model is
               assessed through the number of model parameters (#Param), the number of FLOPs, and the FPS.


               F-measure is an evaluation method that comprehensively considers precision and recall, which is defined as
               follows:

                                                             2
                                                        1 +     ×    ×   
                                                         =            ,                                (14)
                                                           2
                                                             ×    +   
               where    and    represent precision and recall, respectively. We set    = 0.3 to emphasize the importance of
                                                                         2
               precision.
               MAE aims to measure the difference between the predicted image    and the ground truth   , which is calcu-
               lated as follows:
                                                                     
                                                           1   Õ Õ
                                           MAE (  ,   ) =                      −         ,             (15)
                                                           ×   
                                                                 =1   =1
               where    and    represent the height and width of the saliency map, respectively, and         and         represent the
               pixel values of the   -th row and   -th column of    and   .


                  is used to evaluate the structural similarity between the predicted saliency map and the ground truth and is
               calculated by:
                                                     =    ×    0 + (1 −   ) ×       ,                  (16)

               where    0 represents the target structure similarity,       represents the regional structure similarity, and    is set
               to 0.5.


               In this paper, #Param is measured in million (M) and FLOPs is measured in giga (G). FLOPs are used to
               measure the computational effort of the model. FPS indicates the number of images that the model can infer
               per second when using an NVIDIA GTX3090 GPU. For all SOD methods, we use 336×336 input and the same
               hardware and training strategy.
   39   40   41   42   43   44   45   46   47   48   49