Page 94 - Read Online

P. 94

Page 248 Shi et al. Art Int Surg 2024;4:247-57 https://dx.doi.org/10.20517/ais.2024.17

error (RMSE) over the baselines of vanilla reprojection loss.

Conclusion: Our LT-RL self-supervised depth and pose estimation technique is a simple yet effective method to
tackle occlusion artifacts in monocular surgical video. It does not add any training parameters, making it flexible for
integration with any network architecture and improving the performance significantly.

Keywords: Monocular depth estimation, self-supervised learning, reprojection loss, robotic surgery

INTRODUCTION
Depth estimation in robotic surgery is vital for surgical field mapping, instrument tracking, 3D modeling for
surgical training, and lesion inspection in virtual and augmented reality. However, traditional stereo
cameras consist of better depth cues with stereo correspondences and multiview images, and monocular
endoscopes are unable to obtain the depth information. However, in image-guided surgery, such as robotic
and laparoscopic surgery, the monocular endoscope is more popular due to better accessibility and smaller
incisions. Recently, there have been a couple of reprojection loss-based self-supervised depth estimation
techniques using monocular videos for both computer vision and surgical vision . Nevertheless, the small
[1-3]
camera pose changes in the narrow surgical environment requires long-term dependency on the monocular
video frames to address the occlusion artifacts during depth estimation in the surgical environment. In this
work, we propose a long-term reprojection loss (LT-RL) by considering longer temporal adjacent frames
before and after the target frame in self-supervised depth estimation.

There are several works in improving reprojection loss for self-supervised depth estimation. Garg et al.
pioneered self-supervised depth estimation with the proxy task of stereo view synthesis based on a given
camera model using an L1 loss . Monodepth refined this via differentiable bilinear synthesis and a
[5]
[2]
[4]
weight of SSIM and L1 loss . SfM-Learner proposed the first fully monocular self-supervised depth-pose
[7]
[6]
framework by substituting the stereo transform (fixed stereo baseline) with another regression network to
predict the ego-motion of the camera. Monodepth2 optimized this work through the introduction of a
[1]
minimum reprojection loss and edge-aware smoothness loss. The minimum reprojection loss attempts to
address the occlusion artifacts by selecting minimum reprojection loss or photometric error between the
target frame and the first adjacent frames before and after it. However, we argue that selecting minimum
loss by only comparing with the first adjacent frames is not sufficient in the surgical environment where
changes in camera pose are very small.

In this work, we design a LT-RL by considering the second adjacent or four frames before and after the
target frame to select the minimum reprojection loss. In the surgical domain, small camera pose changes
limit the reprojection error to project the pixels that are visible in the target image and are not visible in the
immediate source images before and after the target image. Hence, LT-RL with four adjacent temporally
frames increases the chances of tackling the occlusion artifacts. Our contributions and findings can be
summarized as:

- Design a LT-RL to address occlusion artifact integrating longer temporal information during self-
supervised depth estimation using monocular video in endoscopic surgery.
- Demonstrate the flexibility of the proposed LT-RL by plugging into Monodepth2 network architecture.
- Validate the proposed method with the benchmark surgical depth estimation dataset of Stereo
correspondence and reconstruction of endoscopic data (SCARED) and compare it with state-of-the-art self-
supervised baselines. The results suggest the effectiveness of our LT-RL in both depth estimation and 3D
reconstruction.

89 90 91 92 93 94 95 96 97 98 99