Page 63 - Read Online
P. 63

Zhuang et al. Intell Robot 2024;4(3):276-92  I http://dx.doi.org/10.20517/ir.2024.18  Page 284

               through this mechanism.


               Since the MobileNetV3 meets our practical performance requirements, we choose to replace the CSPDark-
               net53 as the backbone to obtain the MobileNetV3-YOLOv4 model. The model structure is shown in Fig-
               ure5. MobileNetV3reducesmodel size and computational requirements while maintaininghigh performance.
               MobileNetV3’s transformer-based architecture provides superior feature extraction capabilities to traditional
               CNNs, making it particularly effective for IR imaging where challenges such as low contrast and noise interfer-
               ence are prevalent. Moreover, MobileNetV3’s efficient integration of local, global, and input features enhances
               its ability to identify objects across different scales accurately. This results in improved performance for com-
               plex IR imaging tasks. Empirical results have confirmed that incorporating MobileNetV3 into YOLOv4 main-
               tains high accuracy and reduces computational load and model size. Although MobileNetV3 may not perform
               as well as expected in YOLOv3, its lightweight and efficient feature learning capabilities have significantly im-
               proved IR target detection tasks in YOLOv4. This improvement is not only theoretically reasonable but its
               effectiveness in practical applications has also been verified through experiments. The MobileNetV3 network
               first extracts the features of the input image. Afterward, the SPP module performs maximum pooling on the
               front layer features. It connects the processed results to form a new feature layer, which increases the depth
               of the network and preserves the front layer features, and obtains more local feature information. The PAN
               block upsamples and downsamples the features extracted by the MobileNetV3 to improve the information
               extraction capability of the FPN block. The feature network and feature layers are fused by adaptive pooling
               of different layers, and the fused results are passed into YOLO Head for regression and classification. YOLO
               Head divides the input images into networks of corresponding sizes, and finally, the classification results and
               confidence levels of the objects are obtained by the predefined prior frame determination.



               4. EXPERIMENTS
               The experiments were conducted using a computer that had Ubuntu 18.04 pre-installed. The CPU was Intel
               (R) Core (TM) i59300H, 2.40GHz. The GPU was NVIDIA GeForce GTX 1650, with 64 GB of memory. To
               test the performance of the IR image object detection model proposed in this article, we utilized the common
               FLIR IR dataset and the KAIST IR pedestrian dataset. Firstly, we compared the latest detection algorithms
               and models proposed on various datasets regarding detection accuracy, speed, and model size. Secondly, ab-
               lation experiments were conducted on the enhanced model to assess the effectiveness of various improvement
               methods.


               4.1. Datasets
               4.1.1 The FLIR IR datasets
               The dataset was an IR dataset open-sourced by FLIR in July 2018, applied for many IR image target detection
               training tasks. This FLIR IR dataset provided two types of images: thermal imaging images with annotations
               and corresponding RGB images without annotations. The FLIR dataset contained 14,452 IR images, of which
               10,228 were from multiple short videos, and 4,224 were from a long video of 144 s. All of the images were
               taken from actual streets and highways. Figure 6 shows the FLIR dataset.


               4.1.2 The KAIST datasets
               The KAIST IR pedestrian dataset is a widely used benchmark for evaluating algorithms for detecting objects in
               IR images. The dataset comprises 95,328 pairs of images, each with a resolution of 640 × 512. The dataset offers
               meticulous manual annotations and well-matched visible and IR image pairs. It provides comprehensive cov-
               erage, spanning diverse traffic scenarios such as campuses, streets, and rural areas. Annotations differentiate
               between “person” for individual pedestrians and “people” for groups where individuals are more challeng-
               ing to discern. We extracted 15,684 consecutive images from the raw data to streamline model training and
               performance evaluation. Experimental outcomes validate the dataset’s efficacy in achieving high detection
   58   59   60   61   62   63   64   65   66   67   68