Page 97 - Read Online
P. 97

Shi et al. Art Int Surg 2024;4:247-57  https://dx.doi.org/10.20517/ais.2024.17                                                               Page 251

                                                                                                        (7)


               Overall network architecture
               The structure of the depth estimation network and pose network is shown in Figure 2. It consists of a
               regression network for pose and an encoder-decoder network of depth estimation. By following
               Monodepth2, we adopt the depth estimation network of commonly used UNet architecture  with
                                                                                                   [10]
                       [11]
               ResNet18  encoder and corresponding decoder blocks. On the other hand, the pose network is also
               another separate ResNet18 regressor network. Our goal is to demonstrate the effectiveness and flexibility of
               the proposed loss function utilizing existing networking architectures.

               There are a total of five input frames during training, one target and four source frames. The self-supervised
               optimization is performed using a combined loss of our LT-RL and smoothness loss following the baseline
                                                    [9]
                                 [1]
               model of Monodept2  and AF-SfM learner . The combined loss can be expressed as:
                                                                                                        (8)


               To enhance the depth map, we adopted Tihkonov regularizer  R(p) during training by following Equation
               (6) as AFSfMLearner .
                                 [9]

               Dataset
               SCARED dataset
               SCARED is the sub-challenge of the MICCAI EndoVis 2019 challenge . It contains 7 endoscopic videos of
                                                                          [12]
               seven different scenes, and each scene was captured from a stereo viewpoint, providing two perspectives for
               depth perception, but only the left view was used. The data were collected from the internal abdominal
               anatomy of fresh pig cadavers using a da Vinci Xi surgical system and a projector. We downscaled the
               images to 320 × 256 pixels (width × height), which was a quarter of their original size. Bi-linear interpolation
               was used during the down-sampling process to preserve as much visual information as possible. The depth
               capping (CAP) was set to 150 mm followed by , which means that the depth range was scaled within this
                                                       [13]
               threshold. The experiment was conducted with 15,351 images used for training, 1,705 images for validation,
               and 551 images for testing. Following previous work , our data split strategy follows established
                                                                [13]
               methodologies: training set (keyframe1 and keyframe2) from datasets 1-9 and keyframe 3-4 from dataset 8-
               9, validation set (keyframe3) from datasets 2-7, and test set (keyframe4 from datasets 2-7 and keyframe3
               from dataset 1) with no overlap. This approach ensures a robust model evaluation across datasets, aligning
               with field practices.


               Hamlyn dataset
               Hamlyn (https://hamlyn.doc.ic.ac.uk/vision/) dataset consists of 21 videos from various surgical procedures
               and contains complex surgical scenes with deformations, reflections, and occlusions. All 21 videos are used
               for external validation to investigate the depth prediction with occlusion for the proposed method
               following .
                       [14]

               Implementation details
               We adopt the official implementation (https://github.com/ShuweiShao/AF-SfMLearner) of the AF-
               SfMLearner  as our backbone network and base optimizer. The network is trained for 20 epochs,
                         [9]
               employing the Adam optimizer with a batch size of 40 and a learning rate of 10 . The overall network and
                                                                                   -4
               training script are implemented using the Pytorch framework. The optimization is performed in a self-
               supervised manner using our proposed LT-RL loss formulated in the Equation (8). To compare the
   92   93   94   95   96   97   98   99   100   101   102