Page 58 - Read Online

P. 58

Page 279 Zhuang et al. Intell Robot 2024;4(3):276-92 I http://dx.doi.org/10.20517/ir.2024.18

pared with the technique of using the Support Vector Machine (SVM) classifier. The results showed that the
YOLOv3 reached an accuracy of 73%, which was better than that of the SVM algorithm. In the article [26] , two
multi-scale feature extraction and features fusion mechanisms were designed and added to a target detection
model named CMF Net. One of the outstanding advantages of the CMF Net was that the final output back-
bone feature map contained both low-level visual features and high-level semantic features, facilitating the
adaptation of this network to the multi-scale features target.

Although the above works can achieve excellent detection accuracy, there is still much room for improvement
in detecting speed. The ability to process the collected road condition information in real time is paramount
for the intelligent traffic system (ITS). Zhang et al. proposed the CDNet, which implemented real-time cross-
walk detection on the Jetson nano device [27] . In another paper [28] , a high inference speed framework was
introduced to effectively tackle challenges inherent to traffic sign and traffic light detection. Similarly, a deli-
cate balance of accuracy and the real-time performance requirements is considered to implement a pedestrian
and vehicle detection task on resource-constrained edge devices, which is also the main focus of our study.

3. METHODS
Our study can be elaborated in two aspects. On the one hand, we find a new way of image processing that
is more suitable for IR images than the traditional approaches. On the other hand, we fuse the advantages of
the YOLOv4 algorithm and the MobileNetV3 network to build the MobileNetV3-YOLOv4 model. An army
of experiments shows that this method performs well in both accuracy and speed.

3.1. IE-CGAN inversion algorithm
IR thermal imaging is a passive IR night vision technology based on the principle that all objects with tem-
peratures above absolute zero (-273.15 °C) radiate IR light noise in an IR image. These can be considered
non-periodic random variables that lead to low contrast and resolution of IR images. Therefore, it is indis-
pensable to pre-process the IR images before they are input for training.

The histogram equalization algorithm is a standard method in image pre-processing. The distribution of IR
image pixels is extreme, which is different from the RGB images. Consequently, the IR images are usually
darker or brighter, making their contrast relatively low. The histogram equalization algorithm can extend the
dynamic range of the grayscale fetch, enhance the contrast, and make the image transparent. However, this
method also improves the noise in the image, which we want to avoid. The filtering algorithm is also a classical
approach widely used to eliminate image noise, but it simultaneously removes some details.

The traditional methods mentioned above have obvious shortcomings, so they are not fully applicable to pre-
processing IR images. Currently, deep learning is widely applied in image processing. An attractive network
for image enhancement tasks should be equipped with the capabilities to enhance contrast and details while
suppressing the background noise. However, existing network architectures for IR image processing, such
as residual and encoder-decoder architectures, fail to produce optimal results in network performance and
the range of applications. In response to this challenge, Kuang et al. devised a novel conditional Generative
Adversarial Network (GAN)-based architecture [12] . Their innovation yielded visually captivating results char-
acterized by enhanced contrast and sharper details, addressing the shortcomings of previous approaches. We
have further improved their work to obtain a pre-processing method named IE-CGAN inversion algorithm,
which is more suitable for IR images.

IE-CGAN contains a generative sub-network for contrast Enhancement and a discriminative sub-network for
assistance [Figure 1], where D is a deconvolution layer, the concatenated features are restored to the original
resolutionusingadeconvolutionlayerfollowedbyaTanhactivation. Thegenerativemodulefirstextractsinput

53 54 55 56 57 58 59 60 61 62 63