Page 102 - Read Online
P. 102
Page 256 Shi et al. Art Int Surg 2024;4:247-57 https://dx.doi.org/10.20517/ais.2024.17
Table 3. Ablation study on less or more frames
Frames Abs Rel (% ↓) Sq Rel(% ↓) RMSE (↓) RMSE Log (↓)
2 0.062 0.513 5.289 0.094
4 0.058 0.452 5.014 0.083
6 0.611 0.448 5.209 0.091
Quantitative comparison with 2, 4 and 6 consecutive frames conducted on the SCARED dataset. The unit of % and millimeter (mm) of each
metric is indicated in the bracket. The best results are in black bold. Abs Rel: Absolute relative error; Sq Rel: square relative error; RMSE: root-
mean-squared error; RMSE Log: root-mean-square logarithmic error; SCARED: stereo correspondence and reconstruction of endoscopic data.
regions, and handle complex anatomical structures. Ablation studies underscore the importance of utilizing
an optimal number of consecutive frames (in this case, 4) to maximize depth estimation performance while
mitigating occlusion. While LT-RL does not affect the inference phase, its requirement for additional frames
during training increases the computational overhead. Additionally, although our method demonstrates
excellent generalization on the Hamlyn dataset, the specificity of our validation datasets suggests that
further research is needed to fully understand LT-RL’s performance across a broader range of endoscopic
and surgical scenarios.
In conclusion, we present LT-RL by integrating longer temporal information to tackle occlusion artifacts in
endoscopic surgery. Our extensive validation and comparison demonstrate the evidence that it is crucial to
consider small camera pose changes in endoscopic surgery, and the proposed LT-RL addressed the issue
successfully. The external validation of the Hamlyn dataset demonstrates the better robustness and
generalization of the proposed method. Although LT-RL requires extra computation for the additional
frames during training, there is no effect in the inference phase as there is no need for loss calculation in
deployment. Our self-supervised loss is simple, flexible and easy to adapt to any network architecture of
convolution and recent transformer-based models. The excellent 3D reconstruction reflects the better depth
and pose learning and prediction of our LT-RL over other models. Future work should consider
investigating the reliability of the LT-RL over vanilla reprojection loss. Computational efficiency can also be
improved by using a shared encoder and an equal number of input frames for both depth and pose
estimation tasks.
DECLARATIONS
Authors’ contributions
Conceptualization, investigation, methodology, validation, visualization, writing - original draft, writing -
review and editing: Shi X
Conceptualization, methodology, visualization, writing - original draft, writing - review and editing: Islam
M
Conceptualization, methodology, writing - review and editing: Clarkson MJ
Conceptualization, validation, visualization, writing - review and editing: Cui B
Availability of data and materials
Our code is available at https://github.com/xiaowshi/Long-Term_Reprojection_Loss.
Financial support and sponsorship
This work was part-funded by the EPSRC grant [EP/W00805X/1].

