Page 90 - Read Online
P. 90

Page 85                                                                    Li et al. Intell Robot 2021;1(1):84-98  I http://dx.doi.org/10.20517/ir.2021.06



               1. INTRODUCTION
               Predictingdepthfromasingle2Dimageisafundamentaltaskincomputervision. Ithasbeenstudiedformany
                                                                                                        [4]
                                                                          [1]
               years with widespread applications in reality, such as visual navigation , object tracking [2,3] , and surgery .
               Moreover, accurate depth information is vital with considerable influence on the performance of autonomous
               driving, where expensive laser sensors are usually used. Recent advances in convolutional neural networks
               (CNNs) show their powerful ability to learn an image’s high-dimensional features. Especially, the mapping
               relationship between image feature and image depth can be built. Generally, monocular depth estimation ap-
               proaches can be classified into three categories: supervised [5–9] , semi-supervised [10] , and unsupervised [11–19] .
               Both supervised and semi-supervised learning rely on the image depth ground truth. Using a laser sensor to
               obtain the depth ground truth of many images is expensive and difficult. However, unsupervised learning has
               the advantage of eliminating the dependency on the depth ground truth. Therefore, more and more studies
               are training monocular depth estimation networks using unsupervised methods from monocular images or
               stereo pairs. Compared with stereo pairs, a monocular dataset is more general as the input of network. How-
               ever, it needs to estimate the pose transformation between consecutive frames simultaneously. As a result, a
               pose estimation network is necessary that outputs relative 6-DoF pose with given sequences of frames as input.



               Most unsupervised depth estimation networks [5,8,11]  are constructed using typical CNN structures. On the
               one hand, a series of max-pooling and stride operations may reduce the network’s ability to learn image fea-
               tures and cause lower quality of depth map. On the other hand, to improve the performance of the network,
               deeper convolution layers are designed in depth CNNs. They increase the computational burden of the net-
               work and bring extra hardware cost. In most cases, the cost of the network overweighs the benefits generated
               by the network. To improve the depth estimation performance without increasing the network burden, an
               end-to-end unsupervised monocular depth network framework is proposed in this paper. Inspired by previ-
               ous work [20]  on the image classification task, aggregated residual transformations (ResNeXt) are migrated to
               the depth estimation field. Based on typical depth CNNs, the ResNeXt block is embedded to extract more
               delicate image features in the encoder network. In addition, more accurate mapping relationship between the
               feature map and depth map can be built without bringing extra network burden. In addition, the accuracy of
               depth network suffers from some noise (  .  ., haze and rain) in the complex images. To reduce the influence
               of noise, the 2D wavelet discrete transform [21]  is applied to SSIM loss, which can recover high-quality clear
               images. A sample of depth prediction is shown in Figure 1.




               In summary, our proposed network can improve depth prediction accuracy without increasing network com-
               putational complexity. The contributions of this paper can be summarized as follows:



               (1) Based on a ResNeXt block, a novel feature extraction module for depth network is developed to improve
               the accuracy of depth prediction. It can not only extract high-dimensional image features but also guide the
               network to more deeply learn the scene to get farther pixel depth.



               (2) A wavelet SSIM loss is applied to photometric loss to converge the training network. Various patches with
               clearer image information computed by DWT are used as input, rather than the whole image, to the loss func-
               tion, which can remove some noise (daze, rain, etc.) from the image.



               The rest of this paper is organized as follows. The related work on depth estimation is discussed in Section
               2. Section 3 presents an overview of the proposed network architecture and the loss function. Then, some
   85   86   87   88   89   90   91   92   93   94   95