Page 35 - Read Online
P. 35
Page 504 Liu et al. Intell Robot 2024;4(4):503-23 I http://dx.doi.org/10.20517/ir.2024.29
1. INTRODUCTION
[1]
Salient object detection (SOD) aims to detect the most distinctive objects in natural images . The initial SOD
[2]
model was inspired by cognitive psychology and neuroscience, proposed by Itti . in 1998 . Different
from traditional methods, Liu . formulated SOD as a binary labeling problem to separate salient objects
from the background and proposed a set of new features, including multi-scale contrast, center-surround
[3]
histogram, and color space distribution to describe local, regional, and global salient objects . They also
built the first large-scale image database for the quantitative evaluation of visual attention algorithms that
many inspired researchers began to propose more SOD models. SOD can be used in many fields such as
[4]
[5]
object detection , person re-identification , and especially in transportation. As shown in Figure 1, SOD
[6]
is widely used in road damage detection , assisted driving [7–9] , etc. In autonomous driving vision systems,
SOD can quickly allocate attention to important objects for scene analysis [10,11] . However, heavyweight SOD
methods are difficult to apply in industrial scenarios with low computing power due to their huge amount of
computationandparameters. Inthefieldofautonomousdrivingorassisteddriving,theonboardcomputerwill
process all objects in the traffic scene indiscriminately. This reduces the efficiency of information processing
and prolongs the processing time of some emergencies [12] . In some special scenarios, sometimes only some
special objects need to be detected, such as vehicles in front, traffic signs, pedestrians on the roadside, etc. This
is precisely the advantage of SOD. However, there are still the following difficulties in applying SOD in the
field of intelligent transportation: (1) Since all objects that affect driving should be regarded as salient targets,
there will not be only one salient target in most driving scenes, which puts higher requirements on the model;
(2) Traffic scenes are extremely complex, and the general SOD model cannot achieve good results; (3) Traffic
scenes require a higher model processing speed, and the existing SOD model cannot meet the requirements.
How to design and implement a SOD model that considers both real-time and detection performance remains
a critical challenge.
Traditional SOD methods mainly rely on low-level image features and heuristic priors, but the lack of guid-
ance from high-level semantic information usually leads to limited accuracy. In recent years, with the rise of
convolutional neural networks (CNNs), especially fully convolutional networks (FCNs), deep learning-based
methods have pushed SOD to a new level. However, these outstanding performances are often achieved at
the expense of high computing costs and demanding software and hardware requirements [13] . For example,
multi-scale interactive network (MINet) [14] with VGG-16 backbone contains 162.38 M parameters, and the
floating-point operations (FLOPs) reach 87.1 G. Although it demonstrates good detection performance, it can-
not be deployed in low computing power environments. Therefore, it is necessary to design a lightweight SOD
method with excellent performance to serve actual application scenarios.
Cross-stage cross-scale network (CSNet) [15] , hierarchical visual perception module-incorporated lightweight
SOD network (HVPNet) [16] , and stereoscopically attentive multi-scale network (SAMNet) [17] are three repre-
sentative lightweight SOD methods. CSNet is designed to be lightweight based on the dynamic weight decay
pruning method, while HVPNet and SAMNet achieve model lightweighting by improving the network struc-
ture. Compared with MINet, the parameters of CSNet, HVPNet and SAMNet are only 0.14, 1.24, and 1.33 M,
respectively. However, it is worth noting that although these models are lightweight enough, their detection
effect is poor, as shown in Figure 2, making them difficult to apply in some complex scenarios. Therefore,
realizing a SOD model that considers both lightweight and detection effect is a very challenging task. The
main difficulties this work faces are as follows: (1) The lightweight network has a simple structure and can
process a small feature domain, which cannot comprehensively represent salient objects. Simply using exist-
ing lightweight backbone networks (MobileNet [18,19] or ShuffleNet [20,21] , etc.) directly for SOD tasks does not
produce ideal results, which will be demonstrated in the experiments; (2) In complex scenes, salient objects are
scale-variable. How to make the model adaptively and dynamically perceive and extract the features of salient
objects is another difficult problem we need to deal with; (3) Current mainstream lightweight SOD methods
cannot simultaneously achieve both lightweight design and high performance. Properly balancing these two

