Page 98 - Read Online

P. 98

Page 93 Li et al. Intell Robot 2021;1(1):84-98 I http://dx.doi.org/10.20517/ir.2021.06

Table 1. The standard evaluation metrics for network

Abs Rel 1 Í | − |
| |

Sq Rel 1 Í ∥ − ∥
| |

q
RMSE 1 Í ∥ − ∥ 2
q | |
RMSElog 1 Í ∥ − ∥ 2
| |

% ∈ max( , ) <

4. EXPERIMENTS
To evaluate the effectiveness of our approach, some qualitative and quantitative results are provided about
depth and pose prediction. KITTI dataset is the main data source to train and test depth networks. The KITTI
odometry split was used to train and test our pose network. Meanwhile, the Make3D dataset was used to
evaluate the adaptive ability and generalization of the proposed network.
4.1. Implementation details
The proposed depth network has dense skip connections which can fully learn deep abstract features. The
network was trained from scratch without pre-training model weights and post-processing. The Sigmoid out-
put of depth map is = 1/( + ), where and make the depth value between 0.1 and 100 units. In
our experiments, the MonoDepth2 [22] was set to standard ResNet50 encoder for monocular depth network,
ResNet18 for pose network, and without pre-training. Here, we simplify its name to MD2 for the rest of the
paper.

Deep learning framework PyTorch [27] was used to implement our model. For comparison, the KITTI dataset
was resized anddownsampled to 640×192. The proposed networkused Adam [28] optimizer with 1 = 0.9 and
1 = 0.999 to train 22 epochs. The batch size was set as 4 and the smoothness term was set to be 0.001. The
learning rate was set to be 10 for the first 20 epochs and reduced by a factor of 10 for the remaining epochs.
−4
The settings for the pose network were the same as in [22] . In addition, a single NVIDIA GeForce TITAN X
with 12 GB GPU memory was used in our experiments.

4.2. Evaluation metrics
To evaluate our method, we used some standard evaluation metrics, as shown in Table 1.

| | is the number of pixels in image . is the predicted depth from model. is the depth ground

truth. represents the threshold between the depth ground truth and the predicted depth, which is set to be
1.25, 1.25 , and 1.25 , respectively.
2
3
4.3. KITTI eigen split
The KITTI Eigen split [16] was used to train the proposed network. Before the network was trained, Zhou’s [16]
preprocessing was used to remove static images. As a result, the training dataset had 39,810 monocular triplets,
which contain 29 different scenes. The validation dataset had 4424 images, and there were 697 testing images.
The image depth ground truth of the KITTI dataset was captured by Velodyne laser. Following the work in [22] ,
the intrinsics of all images were same, the principal point of the camera was set as image center, and the focal
length was defined as the average of all focal lengths in the KITTI dataset. In addition, the depth predicted
results were obtained by using the per-image median ground truth scaling proposed in [16] . When the results
were evaluated, the maximum depth value was set to be 80 m and the minimum to be 0.1 m.

93 94 95 96 97 98 99 100 101 102 103