Page 38 - Read Online
P. 38
Liu et al. Intell Robot 2024;4(4):503-23 I http://dx.doi.org/10.20517/ir.2024.29 Page 507
SOD. Refs. [14] and [25] also designed a feature conversion module to improve the effectiveness of horizontal
feature transmission. To embed semantic information into the encoding and decoding processes, Chen .
and Jia . designed different global information enhancement modules respectively [1,26] . The Transformer
model has facilitated a further enhancement in the level of SOD. However, the early transformer-based SOD
models [27] were relatively complex and were not very suitable for high-resolution SOD tasks. Although some
lightweight transformer networks [28] have been proposed in recent years and have reduced the number of
model parameters from level B to around 80 M, this is still not affordable for edge applications. Current SOD
methods based on large models still have the problem of high model complexity.
In summary, relevant research on SOD has accumulated a lot of research results, and the detection effect has
reached the level of practical application. However, this is achieved under ideal laboratory conditions. The
complexity and real-time performance of the model cannot meet the requirements in weak computing and
high real-time scenarios.
2.2. Model lightweighting
Lightweightmodelshaveattractedattentioninvariousfieldsduetotheirlowcomputingresourcerequirements.
There are two main methods to build lightweight models. One is to use network pruning, model quantization,
or knowledge distillation to make complex models lightweight. Network pruning reduces the size of a neural
networkbyremovingunnecessaryconnectionsornodes [29] . Modelquantizationreducesthestoragespaceand
computing resources required by the model by reducing the number of bits of the parameters and representing
the parameters as integers or fixed-point numbers with fewer bits [30] . The knowledge distillation method
achieves model lightweighting by transferring knowledge between large models and small models [31] . The
other approach is to consider lightweight from the network design stage, to design an efficient and lightweight
backbone network. Lightweight network design has been a research hotspot in the field of deep learning
in recent years, aiming to provide efficient neural network models for mobile devices and edge computing.
Representative methods in this category include MobileNets [18,19] , EfficientNets [32,33] , and ShuffleNets [20,21] .
The most prominent feature of MobileNets is the use of depthwise separable convolutions instead of ordinary
convolutions to achieve model lightweighting. The characteristic of EfficientNets is that they use a compound
scaling strategy to design the network, controlling the model complexity by adjusting the model depth, the
network width, and the image resolution. ShuffleNets follow the design concept of sparse connectivity and
reduce computation and parameters by using group pointwise convolution and channel shuffle. In addition,
GhostNet [34] proposedbyHuaweiisalsoanexcellentlightweightnetwork, butthedesignideasandtechnology
used in the model are similar to those mentioned above and will not be elaborated on here.
The research on lightweight models for SOD is still in its infancy. Currently, the more representative ones in-
clude SAMNet and CSNet proposed by Professor Cheng et al. at Nankai University, and ELWNet [35] proposed
by Professor Zhang et al. at Northeastern University. Among them, ELWNet is achieved through feature do-
main conversion, CSNet realizes model lightweighting based on the dynamic weight decay pruning method,
and SAMNet is achieved by optimizing the network structure.
Currently, despitenumerousstudiesonmodellightweighting, therearestillthree problems: (1)Pruning, quan-
tization, and knowledge distillation greatly influence model performance, making the model performance in-
sufficient to meet actual needs; (2) The lightweight backbone network has a small feature domain and cannot
cope with complex detection scenarios; (3) Research on lightweight models for SOD is still in its infancy. To
address these issues, we propose a more efficient and lightweight SOD model - SANet.
2.3. CapsNet
CNN has a dominant position in solving computer vision-related problems. However, it discards much valu-
able information in the pooling process, such as the pose and position of the target. Another disadvantage of

