Page 60 - Read Online
P. 60
Page 281 Zhuang et al. Intell Robot 2024;4(3):276-92 I http://dx.doi.org/10.20517/ir.2024.18
Figure 2. (A) Original IR image; (B) Histogram equalization; (C) Mean filtering; (D) Median filtering; (E) IE-CGAN; (F) IE-CGAN inversion
algorithm. IE-CGAN: Image Enhancement Conditional Generative Adversarial Network.
image, which achieves satisfactory results.
3.2. MobileNetV3-YOLOv4 target detection model
The research in target detection can be broadly split into two leading schools: Two-Stage and One-Stage target
detection. The Two-Stage target detection algorithm represented by Faster R-CNN [29] has an early origin but
generally suffers from large models and slow operation. Redmon et al. proposed the pioneering One-Stage
algorithm called YOLO [10] to address these drawbacks. Our approach is to improve the YOLOv4 to attain a
perfect balance between detection speed and detection accuracy.
The YOLOv4 [30] model proposed by Bochkovskiy et al. has been upgraded in many aspects compared to the
previous version. Figure 3 draws its structure, which can be divided into three components: backbone, neck,
and head. YOLOv4 references Cross Stage Partial Networks (CSPNet) and updates the original backbone
network Darknet53 into CSPDarknet53. CSPDarknet53 can copy the feature map and send it to the next stage
through the dense block, thus separating the feature map of the base layer. This allows the gradient changes
to be integrated into the feature map, effectively solving the problem of gradient disappearance. In Yolov4,
the Spatial Pyramid Pooling (SPP) structure is a new component added to the neck. It first divides the input
feature map into segments. Then, it applies pooling operations with different sizes of pooling kernels in each
segment to obtain pooled results for various sizes and receptive fields. Figure 4 demonstrates the pooling
of three dimensions as an example. The maximum pooling is performed on the feature map to obtain 1 ×
d, 4 × d, and 16 × d features, respectively, representing the feature map’s dimension. These pooled results
are concatenated into a fixed-length vector as the input of the next layer. As SPP processes the input feature
map at multiple scales, it can capture more comprehensive scene information and enhance the adaptability of
the object detection network to objects of different scales. Regarding feature fusion, YOLOv4 adopts a Path
Aggregation Network (PAN), which complements Feature Pyramid Networks (FPN). The deep layer network
responds efficiently to semantic features in convolutional neural networks. Still, it possesses little geometric
information, which is unsuitable for target detection. In contrast, the shallow layer network responds quickly
to image features but possesses few semantic features, unfit for image classification. FPN is a top-down feature

