Page 57 - Read Online
P. 57

Zhuang et al. Intell Robot 2024;4(3):276-92  I http://dx.doi.org/10.20517/ir.2024.18  Page 278

               YOLO series algorithms provide efficient and accurate solutions for real-time target detection; IR target recog-
               nition is usually based on IR images for target recognition, which can prove useful in night vision and adverse
               weather conditions.


               2.1. Deep learning networks
               Currently, deep learning methods for target detection are primarily categorized into two types: two-stage and
               one-stage detection algorithms. One-stage detection algorithms such as YOLO and Single Shot Multi-Box
               Detector (SSD) typically use a Fully Convolutional Network (FCN) to directly predict from the original image.
               While they offer fast processing speed, their accuracy in detecting small objects is relatively low. Two-stage
               detection algorithms, such as R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN, capture target details
               more effectively but operate at slower detection speeds.


               YOLO is a fast and efficient target detection algorithm introduced by Redmon et al. in 2016 [10] . Compared
               to traditional two-stage object detection algorithms such as R-CNN, YOLO is a single-stage detection algo-
               rithm capable of achieving real-time detection without compromising accuracy. In a pedestrian detection
               experiment [13] , a Scale-Aware Fast (SAF) R-CNN model was introduced, using multiple subnetworks to de-
               tect pedestrians at different scales, then adaptively combining the outputs to generate the final result. Fan et al.
               proposed a data fusion CNN architecture called RoadSeg, which can extract and fuse features from RGB im-
               agesand infer surface normal information for accurate free space detection [14] . Inanotherstudy, a DS-Net was
               suggested to solve the problem that current neural networks primarily focus on single-task single-task vision
               scenarios [15] . The DS-Net was a multitask convolutional neural network designed for AR-HUD environment
               perception. Li et al. proposed a vision-based framework for target detection and recognition in autonomous
               driving, utilizing an improved YOLOv4 model that reduced the total model parameters by 74% [16] . A U-type
               generative adversarial network (GAN) was first developed to fuse visible and IR images. YOLOv3 combined
               with transfer learning is adopted using the fused images to train the model on an aerial dataset [17] .


               2.2. IR target detection
               The studies mentioned above concentrate on obtaining information from visible images. In recent years, the
               research on IR technology has been more advanced. Vehicle and pedestrian target detection based on IR
               images is gradually becoming an attractive method.


               A novel detection method for IR point targets based on eigentargets has been proposed [18] . Han et al. intro-
               duced the subblock-level ratio-difference joint local contrast measure (SRDLCM), which enhances real small
               targets while suppressing complex backgrounds [19] . A pixel-level classifier was presented for fine-grained de-
               tection of pedestrians in night-time CCTV IR images [20] . Eventually, the method maintained more than a
               90% F1 score on the test. Nevertheless, the dataset used in this study lacked generality because it was ac-
               quired at a specific time and location. Cao et al. proposed a one-stage detector named ThermalDet based on
               the deep neural network [21] . A channel-wise enhancement module was used to assign weights to different
               channels. Besides, a dual-pass fusion block was added, which combined features from all other levels. This
               method reached a mean Average Precision (mAP) of 74.60% on the FLIR dataset. This article [22]  proposes
               an anchor-free infrared pedestrian detection algorithm, which introduced a cross-scale feature fusion module
               and a hierarchical attention mapping module to enhance pedestrian features and suppress background noise.
               This algorithm integrates the anchor-free concept, which simplifies the network and improves model gener-
               alization. A CFRM_3 method [23]  was provided in another work to improve the mono-spectral features with
               the fused multispectral features repeatedly in the network. The experimental results showed that the CFRM_3
               led to substantial accuracy improvements. Du et al. proposed a weak and occluded vehicle detection method
               in complex IR environments [24] . A hard negative example mining block was added to the YOLOv4 model
               to depress the interference caused by complex backgrounds, and the accuracy was increased. Narayanan et
               al. presented a method for IR pedestrian detection using the HOG and the YOLOv3 [25] . This work was com-
   52   53   54   55   56   57   58   59   60   61   62