Page 101 - Read Online

P. 101

Li et al. Intell Robot 2021;1(1):84-98 I http://dx.doi.org/10.20517/ir.2021.06 Page 96

Table 4. Model capacity. is the number of parameters of depth network, indicates the total parameters for both
depth and pose network, and is million unit.
Method Params FLOPs Total params
MD2(ResNet50) [22] 25.56M 1.0 10 10 61.8M
ours 25.03M 1.0 10 10 61.3M

Table 5. Odometry results on the KITTI odometry dataset

Method Sequence09 Sequence10 Frames
ORB-SLAM [33] 0.014 ± 0.008 0.012 ± 0.011 -
DDVO [26] 0.045 ± 0.108 0.033 ± 0.074 3
Zhou* [16] 0.05 ± 0.039 0.034 ± 0.028 5→2
Mahjourian [30] 0.013 ± 0.010 0.012 ± 0.011 3
GeoNet [18] 0.012 ± 0.007 0.012 ± 0.009 5
EPC++(M) [19] 0.013 ± 0.007 0.012 ± 0.008 3
Ranjan [24] 0.012 ± 0.007 0.012 ± 0.008 5
MD2(M) 0.018 ± 0.009 0.015 ± 0.010 2
ours 0.017 ± 0.010 0.015 ± 0.010 2

4.4.3. Network capacity
To show our proposed network can improve accuracy without increasing network capacity, the number of
network parameters and the floating-point operations per second ( ) for the network were computed
to evaluate the capacity of the proposed network. The quantitative results are shown in Table 4. For the sake
of fair comparison, the pose network of MD2 and ours were set as ResNet50. Note that ResNet50 serves as
our pose network only for comparison. The pose network adopted in the proposed overall framework is still
ResNet18. Compared with MD2, our proposed method improves the accuracy of the depth network without
adding extra computational burden, as expected.

4.5. Pose estimation
Our pose model was evaluated on the standard KITTI odometry split [16] . This dataset includes 11 driving
sequences. Sequences 00–08 were used to train our pose network without using pose ground truth, while Se-
quences 09 and 10 were used to evaluate our pose model. The average absolute trajectory error with standard
deviation (in meters) was used as evaluation metric. Godard’s [22] handling strategy was followed to evaluate
the result of the two-frame model on the five-frame snippets. Because Godard’s [22] pose estimation results (M,
ResNet50 for depth network, and ResNet18 for pose network) are not provided, we retrained and obtained the
trained result (MD2).

Only two adjacent frames were taken in our pose model at a time, as shown in Table 5. The output was the
relative 6-DoF pose between images. Even though our pose network structure is the same as that in MD2, our
pose model obtains better performance than MD2. In addition, the results are comparable to other previous
methods. Thus, it is observed that the proposed depth network has a positive effect on pose network.

5. CONCLUSIONS
A versatile end-to-end unsupervised learning framework of monocular depth and pose estimation is devel-
oped and evaluated on a dataset in this paper. Aggregated residual transformations (ResNeXt) are embedded
in depth network to extract the input image’s high-dimensional features. In addition, the proposed wavelet
SSIM loss is based on 2D discrete wavelet transform (DWT). Different patches with different frequencies are
computed by DWT as the input to the SSIM loss to converge the network, which can recover high-quality clear
image patches. The evaluation results show that the performance of depth prediction is improved while the
computational burden is reduced. In addition, the proposed method has great adaptive ability on the Make3D

96 97 98 99 100 101 102 103 104 105 106