Page 99 - Read Online
P. 99

Shi et al. Art Int Surg 2024;4:247-57  https://dx.doi.org/10.20517/ais.2024.17                                                               Page 253

               Table 1. Quantitative results with the SCARED dataset
                                          CAP  Abs Rel (% ↓)  Sq Rel (% ↓)  RMSE (mm ↓)  RMSE Log (mm ↓)
                HR-Depth [3]              150  0.080          0.938       7.943         0.104
                MonoViT [17]                   0.074         0.865        7.517         0.097
                Lite-Mono [16]                 0.073          0.803       11.684         0.107
                AJ-Depth [15]                  0.078          0.896        7.578         0.101
                Monodepth2 [1]                 0.083          0.994       8.167          0.107
                AF-SfMLearner [9]              0.062         0.513        5.289         0.087
                Depth anything (Zero-shot) [18]  0.106        1.376       8.695          0.146
                Ours                           0.058         0.452        5.014         0.083
               The unit of % and millimeter (mm) of each metric is indicated in the bracket. The best results are in bold. SCARED: Stereo correspondence and
               reconstruction of endoscopic data; CAP: the capping or restriction of the depth value; Abs Rel: absolute relative error; Sq Rel: square relative error;
               RMSE: root-mean-squared error; RMSE Log: root-mean-square logarithmic error.

               Table 2. Quantitative results with the Hamlyn dataset
                                               Abs Rel (% ↓)  Sq Rel(% ↓)    RMSE (↓)    RMSE Log (↓)
                Endo-Depth-and-Motion [14]  0.185         5.424              16.1        0.225
                AF-SfMLearner [9]           0.175         4.589              14.21       0.209
                Ours                        0.165         4.081              13.497      0.201

               The downward arrow represents the lower, the better, and the upward arrow represents the higher, the better. Each metric’s unit of % and
               millimeter (mm) is indicated in the bracket. The best results are in bold. Abs Rel: Absolute relative error; Sq Rel: square relative error; RMSE: root-
               mean-squared error; RMSE Log: root-mean-square logarithmic error.


               The superior performance of our model demonstrates the better generalization and robustness of the
               proposed LT-RL loss. Overall, our solution is simple yet effective, easy to integrate with conventional
               reprojection loss, and delivers superior performance in monocular depth estimation. Extending the method
               to four temporally adjacent frames improves the accuracy and robustness of depth estimation by providing
               more temporal context. This additional information helps better capture the motion and structural details
               of the scene, leading to more accurate and consistent depth maps. We have conducted an external
               evaluation on the Hamlyn dataset, where our method marginally outperformed existing methods in depth
               estimation in Table 1. While the improvement in depth estimation may seem small, such enhancements can
               be significant for subsequent reconstruction tasks. This demonstrates the robustness and practical value of
               our approach.


               Qualitative results
               The qualitative performance of the experiments is presented in Figures 3-5. The depth prediction of our
                                                                        [16]
               method is compared with the closely related works Lite-Mono  and ground-truth in Figure 3. The
               quantitative results demonstrate the superiority of our model over all competing methods. It is worth noting
               that our model excels not only in generating more continuous depth values and performing better on
               anatomical structures, especially in less textured and reflective regions, but also in areas with complex
               structures and substantial depth variations.

               Figure 4 plots the pose trajectory for a testing video. We compare the predicted trajectory of the Lite-Mono
               pose prediction over the ground-truth (GT) pose with ours. The ground-truth trajectory is represented by a
               grey dashed line, while the trajectory predicted by the model is shown as a black solid line. The trajectories
               demonstrate the accuracy of our model prediction, which is almost similar to GT, where Lite-Mono shows a
               large deviation.
   94   95   96   97   98   99   100   101   102   103   104