Page 91 - Read Online

P. 91

Li et al. Intell Robot 2021;1(1):84-98 I http://dx.doi.org/10.20517/ir.2021.06 Page 86

Figure 1. The input image from the KITTI dataset (top); the baseline MonoDepth2 [22] (M, ResNet50, without pre-training) depth prediction;
(middle) and our result (bottom).

experiments based on different datasets are presented to verify the performance of the proposed network in
Section 4. Finally, the conclusions and future work are introduced in Section 5.

2. RELATED WORK
2.1. Supervised depth estimation
Based on vast training datasets with depth ground truth, depth estimation networks show great performance
[5]
in recent years. Eigen et al. first demonstrated the huge potential of CNNs in depth prediction from a single
image. They obtained reliable depth estimation results by using a coarse-to-fine depth network. Further, Liu
et al. [7] combined CNNs with Markov random fields (MRF) to learn intermediate features, acquiring clearer
local details of depth map in the visual effect. Laina et al. [8] changed the structure of the depth network and
proposed a residual CNNs to model the mapping relationship between monocular image and its correspond-
[9]
ing depth map. Instead of using absolute depth ground truth, Chen et al. acquired relative depth value labels
between the random pixel pairs from the image to train the depth network. In addition, to obtain dense depth
map, Kuznietsov et al. [10] proposed a semi-supervised method which used both sparse ground truth depth for
supervised learning and a photo consistent loss in stereo images for unsupervised learning.

Even though the works mentioned above significantly contributed to depth estimation, these methods still
suffer from the limitation of depth ground truth.

2.2. Unsupervised depth estimation
Based on stereo or monocular images, unsupervised learning methods focus on how to design the supervisory
signal. The typical solution is to use view synthesis as a proxy task [11,12,14–24] , so as to get rid of depth ground
truth.

2.2.1. Unsupervised depth estimation from stereo images
Using stereo images is a feasible unsupervised way to train a monocular depth network. A depth network
can be obtained by predicting the left–right pixel disparities between stereo pairs during training. It can be

86 87 88 89 90 91 92 93 94 95 96