Page 90 - Read Online

P. 90

Page 85 Li et al. Intell Robot 2021;1(1):84-98 I http://dx.doi.org/10.20517/ir.2021.06

1. INTRODUCTION
Predictingdepthfromasingle2Dimageisafundamentaltaskincomputervision. Ithasbeenstudiedformany
[4]
[1]
years with widespread applications in reality, such as visual navigation , object tracking [2,3] , and surgery .
Moreover, accurate depth information is vital with considerable influence on the performance of autonomous
driving, where expensive laser sensors are usually used. Recent advances in convolutional neural networks
(CNNs) show their powerful ability to learn an image’s high-dimensional features. Especially, the mapping
relationship between image feature and image depth can be built. Generally, monocular depth estimation ap-
proaches can be classified into three categories: supervised [5–9] , semi-supervised [10] , and unsupervised [11–19] .
Both supervised and semi-supervised learning rely on the image depth ground truth. Using a laser sensor to
obtain the depth ground truth of many images is expensive and difficult. However, unsupervised learning has
the advantage of eliminating the dependency on the depth ground truth. Therefore, more and more studies
are training monocular depth estimation networks using unsupervised methods from monocular images or
stereo pairs. Compared with stereo pairs, a monocular dataset is more general as the input of network. How-
ever, it needs to estimate the pose transformation between consecutive frames simultaneously. As a result, a
pose estimation network is necessary that outputs relative 6-DoF pose with given sequences of frames as input.

Most unsupervised depth estimation networks [5,8,11] are constructed using typical CNN structures. On the
one hand, a series of max-pooling and stride operations may reduce the network’s ability to learn image fea-
tures and cause lower quality of depth map. On the other hand, to improve the performance of the network,
deeper convolution layers are designed in depth CNNs. They increase the computational burden of the net-
work and bring extra hardware cost. In most cases, the cost of the network overweighs the benefits generated
by the network. To improve the depth estimation performance without increasing the network burden, an
end-to-end unsupervised monocular depth network framework is proposed in this paper. Inspired by previ-
ous work [20] on the image classification task, aggregated residual transformations (ResNeXt) are migrated to
the depth estimation field. Based on typical depth CNNs, the ResNeXt block is embedded to extract more
delicate image features in the encoder network. In addition, more accurate mapping relationship between the
feature map and depth map can be built without bringing extra network burden. In addition, the accuracy of
depth network suffers from some noise ( . ., haze and rain) in the complex images. To reduce the influence
of noise, the 2D wavelet discrete transform [21] is applied to SSIM loss, which can recover high-quality clear
images. A sample of depth prediction is shown in Figure 1.

In summary, our proposed network can improve depth prediction accuracy without increasing network com-
putational complexity. The contributions of this paper can be summarized as follows:

(1) Based on a ResNeXt block, a novel feature extraction module for depth network is developed to improve
the accuracy of depth prediction. It can not only extract high-dimensional image features but also guide the
network to more deeply learn the scene to get farther pixel depth.

(2) A wavelet SSIM loss is applied to photometric loss to converge the training network. Various patches with
clearer image information computed by DWT are used as input, rather than the whole image, to the loss func-
tion, which can remove some noise (daze, rain, etc.) from the image.

The rest of this paper is organized as follows. The related work on depth estimation is discussed in Section
2. Section 3 presents an overview of the proposed network architecture and the loss function. Then, some

85 86 87 88 89 90 91 92 93 94 95