Page 35 - Read Online
P. 35

Page 504                          Liu et al. Intell Robot 2024;4(4):503-23  I http://dx.doi.org/10.20517/ir.2024.29


               1. INTRODUCTION
                                                                                          [1]
               Salient object detection (SOD) aims to detect the most distinctive objects in natural images . The initial SOD
                                                                                               [2]
               model was inspired by cognitive psychology and neuroscience, proposed by Itti          . in 1998 . Different
               from traditional methods, Liu          . formulated SOD as a binary labeling problem to separate salient objects
               from the background and proposed a set of new features, including multi-scale contrast, center-surround
                                                                                              [3]
               histogram, and color space distribution to describe local, regional, and global salient objects . They also
               built the first large-scale image database for the quantitative evaluation of visual attention algorithms that
               many inspired researchers began to propose more SOD models. SOD can be used in many fields such as
                             [4]
                                                    [5]
               object detection , person re-identification , and especially in transportation. As shown in Figure 1, SOD
                                                 [6]
               is widely used in road damage detection , assisted driving [7–9] , etc. In autonomous driving vision systems,
               SOD can quickly allocate attention to important objects for scene analysis [10,11] . However, heavyweight SOD
               methods are difficult to apply in industrial scenarios with low computing power due to their huge amount of
               computationandparameters. Inthefieldofautonomousdrivingorassisteddriving,theonboardcomputerwill
               process all objects in the traffic scene indiscriminately. This reduces the efficiency of information processing
               and prolongs the processing time of some emergencies [12] . In some special scenarios, sometimes only some
               special objects need to be detected, such as vehicles in front, traffic signs, pedestrians on the roadside, etc. This
               is precisely the advantage of SOD. However, there are still the following difficulties in applying SOD in the
               field of intelligent transportation: (1) Since all objects that affect driving should be regarded as salient targets,
               there will not be only one salient target in most driving scenes, which puts higher requirements on the model;
               (2) Traffic scenes are extremely complex, and the general SOD model cannot achieve good results; (3) Traffic
               scenes require a higher model processing speed, and the existing SOD model cannot meet the requirements.
               How to design and implement a SOD model that considers both real-time and detection performance remains
               a critical challenge.

               Traditional SOD methods mainly rely on low-level image features and heuristic priors, but the lack of guid-
               ance from high-level semantic information usually leads to limited accuracy. In recent years, with the rise of
               convolutional neural networks (CNNs), especially fully convolutional networks (FCNs), deep learning-based
               methods have pushed SOD to a new level. However, these outstanding performances are often achieved at
               the expense of high computing costs and demanding software and hardware requirements [13] . For example,
               multi-scale interactive network (MINet) [14]  with VGG-16 backbone contains 162.38 M parameters, and the
               floating-point operations (FLOPs) reach 87.1 G. Although it demonstrates good detection performance, it can-
               not be deployed in low computing power environments. Therefore, it is necessary to design a lightweight SOD
               method with excellent performance to serve actual application scenarios.

               Cross-stage cross-scale network (CSNet) [15] , hierarchical visual perception module-incorporated lightweight
               SOD network (HVPNet) [16] , and stereoscopically attentive multi-scale network (SAMNet) [17]  are three repre-
               sentative lightweight SOD methods. CSNet is designed to be lightweight based on the dynamic weight decay
               pruning method, while HVPNet and SAMNet achieve model lightweighting by improving the network struc-
               ture. Compared with MINet, the parameters of CSNet, HVPNet and SAMNet are only 0.14, 1.24, and 1.33 M,
               respectively. However, it is worth noting that although these models are lightweight enough, their detection
               effect is poor, as shown in Figure 2, making them difficult to apply in some complex scenarios. Therefore,
               realizing a SOD model that considers both lightweight and detection effect is a very challenging task. The
               main difficulties this work faces are as follows: (1) The lightweight network has a simple structure and can
               process a small feature domain, which cannot comprehensively represent salient objects. Simply using exist-
               ing lightweight backbone networks (MobileNet [18,19]  or ShuffleNet [20,21] , etc.) directly for SOD tasks does not
               produce ideal results, which will be demonstrated in the experiments; (2) In complex scenes, salient objects are
               scale-variable. How to make the model adaptively and dynamically perceive and extract the features of salient
               objects is another difficult problem we need to deal with; (3) Current mainstream lightweight SOD methods
               cannot simultaneously achieve both lightweight design and high performance. Properly balancing these two
   30   31   32   33   34   35   36   37   38   39   40