Page 60 - Read Online
P. 60

Page 281                       Zhuang et al. Intell Robot 2024;4(3):276-92  I http://dx.doi.org/10.20517/ir.2024.18


































               Figure 2. (A) Original IR image; (B) Histogram equalization; (C) Mean filtering; (D) Median filtering; (E) IE-CGAN; (F) IE-CGAN inversion
               algorithm. IE-CGAN: Image Enhancement Conditional Generative Adversarial Network.


               image, which achieves satisfactory results.


               3.2. MobileNetV3-YOLOv4 target detection model
               The research in target detection can be broadly split into two leading schools: Two-Stage and One-Stage target
               detection. The Two-Stage target detection algorithm represented by Faster R-CNN [29]  has an early origin but
               generally suffers from large models and slow operation. Redmon et al. proposed the pioneering One-Stage
               algorithm called YOLO [10]  to address these drawbacks. Our approach is to improve the YOLOv4 to attain a
               perfect balance between detection speed and detection accuracy.

               The YOLOv4 [30]  model proposed by Bochkovskiy et al. has been upgraded in many aspects compared to the
               previous version. Figure 3 draws its structure, which can be divided into three components: backbone, neck,
               and head. YOLOv4 references Cross Stage Partial Networks (CSPNet) and updates the original backbone
               network Darknet53 into CSPDarknet53. CSPDarknet53 can copy the feature map and send it to the next stage
               through the dense block, thus separating the feature map of the base layer. This allows the gradient changes
               to be integrated into the feature map, effectively solving the problem of gradient disappearance. In Yolov4,
               the Spatial Pyramid Pooling (SPP) structure is a new component added to the neck. It first divides the input
               feature map into segments. Then, it applies pooling operations with different sizes of pooling kernels in each
               segment to obtain pooled results for various sizes and receptive fields. Figure 4 demonstrates the pooling
               of three dimensions as an example. The maximum pooling is performed on the feature map to obtain 1 ×
               d, 4 × d, and 16 × d features, respectively, representing the feature map’s dimension. These pooled results
               are concatenated into a fixed-length vector as the input of the next layer. As SPP processes the input feature
               map at multiple scales, it can capture more comprehensive scene information and enhance the adaptability of
               the object detection network to objects of different scales. Regarding feature fusion, YOLOv4 adopts a Path
               Aggregation Network (PAN), which complements Feature Pyramid Networks (FPN). The deep layer network
               responds efficiently to semantic features in convolutional neural networks. Still, it possesses little geometric
               information, which is unsuitable for target detection. In contrast, the shallow layer network responds quickly
               to image features but possesses few semantic features, unfit for image classification. FPN is a top-down feature
   55   56   57   58   59   60   61   62   63   64   65