Page 98 - Read Online
P. 98

Page 93                              Li et al. Intell Robot 2021;1(1):84-98  I http://dx.doi.org/10.20517/ir.2021.06


                                           Table 1. The standard evaluation metrics for network
                                                                               
                                                Abs Rel      1 Í  |          −         |
                                                            |   |          
                                                                          
                                                                               
                                                Sq Rel      1 Í  ∥          −         ∥
                                                            |   |          
                                                                          
                                                          q
                                                RMSE       1 Í    ∥               −           ∥ 2
                                                        q  |   |              
                                                RMSElog  1 Í    ∥                     −                ∥ 2
                                                         |   |              
                                                                         
                                                                       
                                                                        
                                                        %         ∈    max(  ,           ) <   
                                                                   
                                                                            
               4. EXPERIMENTS
               To evaluate the effectiveness of our approach, some qualitative and quantitative results are provided about
               depth and pose prediction. KITTI dataset is the main data source to train and test depth networks. The KITTI
               odometry split was used to train and test our pose network. Meanwhile, the Make3D dataset was used to
               evaluate the adaptive ability and generalization of the proposed network.
               4.1. Implementation details
               The proposed depth network has dense skip connections which can fully learn deep abstract features. The
               network was trained from scratch without pre-training model weights and post-processing. The Sigmoid out-
               put of depth map is    = 1/(     +   ), where    and    make the depth value    between 0.1 and 100 units. In
               our experiments, the MonoDepth2 [22]  was set to standard ResNet50 encoder for monocular depth network,
               ResNet18 for pose network, and without pre-training. Here, we simplify its name to MD2 for the rest of the
               paper.




               Deep learning framework PyTorch [27]  was used to implement our model. For comparison, the KITTI dataset
               was resized anddownsampled to 640×192. The proposed networkused Adam [28] optimizer with    1 = 0.9 and
                  1 = 0.999 to train 22 epochs. The batch size was set as 4 and the smoothness term    was set to be 0.001. The
               learning rate was set to be 10 for the first 20 epochs and reduced by a factor of 10 for the remaining epochs.
                                       −4
               The settings for the pose network were the same as in [22] . In addition, a single NVIDIA GeForce TITAN X
               with 12 GB GPU memory was used in our experiments.


               4.2. Evaluation metrics
               To evaluate our method, we used some standard evaluation metrics, as shown in Table 1.




                                                                                           
                                                          
               |  | is the number of pixels in image   .          is the predicted depth from model.    is the depth ground
                                                                                           
               truth.       represents the threshold between the depth ground truth and the predicted depth, which is set to be
               1.25, 1.25 , and 1.25 , respectively.
                        2
                                 3
               4.3. KITTI eigen split
               The KITTI Eigen split [16] was used to train the proposed network. Before the network was trained, Zhou’s [16]
               preprocessing was used to remove static images. As a result, the training dataset had 39,810 monocular triplets,
               which contain 29 different scenes. The validation dataset had 4424 images, and there were 697 testing images.
               The image depth ground truth of the KITTI dataset was captured by Velodyne laser. Following the work in [22] ,
               the intrinsics of all images were same, the principal point of the camera was set as image center, and the focal
               length was defined as the average of all focal lengths in the KITTI dataset. In addition, the depth predicted
               results were obtained by using the per-image median ground truth scaling proposed in [16] . When the results
               were evaluated, the maximum depth value was set to be 80 m and the minimum to be 0.1 m.
   93   94   95   96   97   98   99   100   101   102   103