Page 91 - Read Online
P. 91

Li et al. Intell Robot 2021;1(1):84-98  I http://dx.doi.org/10.20517/ir.2021.06      Page 86



























               Figure 1. The input image from the KITTI dataset (top); the baseline MonoDepth2  [22]  (M, ResNet50, without pre-training) depth prediction;
               (middle) and our result (bottom).


               experiments based on different datasets are presented to verify the performance of the proposed network in
               Section 4. Finally, the conclusions and future work are introduced in Section 5.




               2. RELATED WORK
               2.1. Supervised depth estimation
               Based on vast training datasets with depth ground truth, depth estimation networks show great performance
                                      [5]
               in recent years. Eigen et al. first demonstrated the huge potential of CNNs in depth prediction from a single
               image. They obtained reliable depth estimation results by using a coarse-to-fine depth network. Further, Liu
               et al. [7]  combined CNNs with Markov random fields (MRF) to learn intermediate features, acquiring clearer
               local details of depth map in the visual effect. Laina et al. [8]  changed the structure of the depth network and
               proposed a residual CNNs to model the mapping relationship between monocular image and its correspond-
                                                                           [9]
               ing depth map. Instead of using absolute depth ground truth, Chen et al. acquired relative depth value labels
               between the random pixel pairs from the image to train the depth network. In addition, to obtain dense depth
               map, Kuznietsov et al. [10]  proposed a semi-supervised method which used both sparse ground truth depth for
               supervised learning and a photo consistent loss in stereo images for unsupervised learning.



               Even though the works mentioned above significantly contributed to depth estimation, these methods still
               suffer from the limitation of depth ground truth.



               2.2. Unsupervised depth estimation
               Based on stereo or monocular images, unsupervised learning methods focus on how to design the supervisory
               signal. The typical solution is to use view synthesis as a proxy task [11,12,14–24] , so as to get rid of depth ground
               truth.


               2.2.1. Unsupervised depth estimation from stereo images
               Using stereo images is a feasible unsupervised way to train a monocular depth network. A depth network
               can be obtained by predicting the left–right pixel disparities between stereo pairs during training. It can be
   86   87   88   89   90   91   92   93   94   95   96