Page 94 - Read Online
P. 94

Page 248                                                                Shi et al. Art Int Surg 2024;4:247-57  https://dx.doi.org/10.20517/ais.2024.17

               error (RMSE) over the baselines of vanilla reprojection loss.

               Conclusion: Our LT-RL self-supervised depth and pose estimation technique is a simple yet effective method to
               tackle occlusion artifacts in monocular surgical video. It does not add any training parameters, making it flexible for
               integration with any network architecture and improving the performance significantly.

               Keywords: Monocular depth estimation, self-supervised learning, reprojection loss, robotic surgery




               INTRODUCTION
               Depth estimation in robotic surgery is vital for surgical field mapping, instrument tracking, 3D modeling for
               surgical training, and lesion inspection in virtual and augmented reality. However, traditional stereo
               cameras consist of better depth cues with stereo correspondences and multiview images, and monocular
               endoscopes are unable to obtain the depth information. However, in image-guided surgery, such as robotic
               and laparoscopic surgery, the monocular endoscope is more popular due to better accessibility and smaller
               incisions. Recently, there have been a couple of reprojection loss-based self-supervised depth estimation
               techniques using monocular videos for both computer vision and surgical vision . Nevertheless, the small
                                                                                   [1-3]
               camera pose changes in the narrow surgical environment requires long-term dependency on the monocular
               video frames to address the occlusion artifacts during depth estimation in the surgical environment. In this
               work, we propose a long-term reprojection loss (LT-RL) by considering longer temporal adjacent frames
               before and after the target frame in self-supervised depth estimation.


               There are several works in improving reprojection loss for self-supervised depth estimation. Garg et al.
               pioneered self-supervised depth estimation with the proxy task of stereo view synthesis based on a given
               camera model using an L1 loss . Monodepth  refined this via differentiable bilinear synthesis  and a
                                                                                                   [5]
                                                       [2]
                                           [4]
               weight of SSIM and L1 loss . SfM-Learner  proposed the first fully monocular self-supervised depth-pose
                                                   [7]
                                      [6]
               framework by substituting the stereo transform (fixed stereo baseline) with another regression network to
               predict the ego-motion of the camera. Monodepth2  optimized this work through the introduction of a
                                                            [1]
               minimum reprojection loss and edge-aware smoothness loss. The minimum reprojection loss attempts to
               address the occlusion artifacts by selecting minimum reprojection loss or photometric error between the
               target frame and the first adjacent frames before and after it. However, we argue that selecting minimum
               loss by only comparing with the first adjacent frames is not sufficient in the surgical environment where
               changes in camera pose are very small.

               In this work, we design a LT-RL by considering the second adjacent or four frames before and after the
               target frame to select the minimum reprojection loss. In the surgical domain, small camera pose changes
               limit the reprojection error to project the pixels that are visible in the target image and are not visible in the
               immediate source images before and after the target image. Hence, LT-RL with four adjacent temporally
               frames increases the chances of tackling the occlusion artifacts. Our contributions and findings can be
               summarized as:


               - Design a LT-RL to address occlusion artifact integrating longer temporal information during self-
               supervised depth estimation using monocular video in endoscopic surgery.
               - Demonstrate the flexibility of the proposed LT-RL by plugging into Monodepth2 network architecture.
               - Validate the proposed method with the benchmark surgical depth estimation dataset of Stereo
               correspondence and reconstruction of endoscopic data (SCARED) and compare it with state-of-the-art self-
               supervised baselines. The results suggest the effectiveness of our LT-RL in both depth estimation and 3D
               reconstruction.
   89   90   91   92   93   94   95   96   97   98   99