Page 44 - Read Online
P. 44
Liu et al. Intell Robot 2024;4(4):503-23 I http://dx.doi.org/10.20517/ir.2024.29 Page 513
4.1.2 Datasets
We validate our proposed method on six common datasets, including DUTS, DUT-OMRON [46] , ECSSD [47] ,
PASCAL-S [48] , HKU-IS [49] , and SOD [50] . In addition, we also verified the advantages of our method over
other SOD methods in the traffic field in the traffic dataset TSOD.
The DUTS dataset comprises two subsets: the training set, DUTS-TR, and the test set, DUTS-TE. DUTS-
TR is used for SANet training, while DUTS-TE is reserved for testing. DUTS-TR includes 10,553 images
from ImageNet, each annotated at the pixel level. The test set contains 5,019 images selected from ImageNet
and SUN and their pixel-level labels. DUT-OMRON features 5,168 images depicting complex scenes with rich
contents, accompanied by pixel-level labels. ECSSD consists of 1,000 images, with pixel-level labels, presenting
a high level of interference in both the foreground and background of the images, making it a challenging
dataset. PASCAL-S includes 850 images and their pixel-level labels, showcasing relatively complex scenes.
HKU-IS contains 4,447 images and their pixel-level labels, and almost all images have multiple salient objects.
SOD contains 300 images and their pixel-level labels, where the color contrast between salient objects and the
background is low. TSOD consists of 2,316 images of traffic scenes with relatively complex content, along with
their pixel-level labels.
4.1.3 Evaluation criteria
This paper evaluates the effectiveness of the proposed model using the maximum F-measure (maxF), average
F-measure (avgF), mean absolute error (MAE), and S-measure (S). Additionally, the efficiency of the model is
assessed through the number of model parameters (#Param), the number of FLOPs, and the FPS.
F-measure is an evaluation method that comprehensively considers precision and recall, which is defined as
follows:
2
1 + × ×
= , (14)
2
× +
where and represent precision and recall, respectively. We set = 0.3 to emphasize the importance of
2
precision.
MAE aims to measure the difference between the predicted image and the ground truth , which is calcu-
lated as follows:
1 Õ Õ
MAE ( , ) = − , (15)
×
=1 =1
where and represent the height and width of the saliency map, respectively, and and represent the
pixel values of the -th row and -th column of and .
is used to evaluate the structural similarity between the predicted saliency map and the ground truth and is
calculated by:
= × 0 + (1 − ) × , (16)
where 0 represents the target structure similarity, represents the regional structure similarity, and is set
to 0.5.
In this paper, #Param is measured in million (M) and FLOPs is measured in giga (G). FLOPs are used to
measure the computational effort of the model. FPS indicates the number of images that the model can infer
per second when using an NVIDIA GTX3090 GPU. For all SOD methods, we use 336×336 input and the same
hardware and training strategy.

