Page 97 - Read Online
P. 97
Shi et al. Art Int Surg 2024;4:247-57 https://dx.doi.org/10.20517/ais.2024.17 Page 251
(7)
Overall network architecture
The structure of the depth estimation network and pose network is shown in Figure 2. It consists of a
regression network for pose and an encoder-decoder network of depth estimation. By following
Monodepth2, we adopt the depth estimation network of commonly used UNet architecture with
[10]
[11]
ResNet18 encoder and corresponding decoder blocks. On the other hand, the pose network is also
another separate ResNet18 regressor network. Our goal is to demonstrate the effectiveness and flexibility of
the proposed loss function utilizing existing networking architectures.
There are a total of five input frames during training, one target and four source frames. The self-supervised
optimization is performed using a combined loss of our LT-RL and smoothness loss following the baseline
[9]
[1]
model of Monodept2 and AF-SfM learner . The combined loss can be expressed as:
(8)
To enhance the depth map, we adopted Tihkonov regularizer R(p) during training by following Equation
(6) as AFSfMLearner .
[9]
Dataset
SCARED dataset
SCARED is the sub-challenge of the MICCAI EndoVis 2019 challenge . It contains 7 endoscopic videos of
[12]
seven different scenes, and each scene was captured from a stereo viewpoint, providing two perspectives for
depth perception, but only the left view was used. The data were collected from the internal abdominal
anatomy of fresh pig cadavers using a da Vinci Xi surgical system and a projector. We downscaled the
images to 320 × 256 pixels (width × height), which was a quarter of their original size. Bi-linear interpolation
was used during the down-sampling process to preserve as much visual information as possible. The depth
capping (CAP) was set to 150 mm followed by , which means that the depth range was scaled within this
[13]
threshold. The experiment was conducted with 15,351 images used for training, 1,705 images for validation,
and 551 images for testing. Following previous work , our data split strategy follows established
[13]
methodologies: training set (keyframe1 and keyframe2) from datasets 1-9 and keyframe 3-4 from dataset 8-
9, validation set (keyframe3) from datasets 2-7, and test set (keyframe4 from datasets 2-7 and keyframe3
from dataset 1) with no overlap. This approach ensures a robust model evaluation across datasets, aligning
with field practices.
Hamlyn dataset
Hamlyn (https://hamlyn.doc.ic.ac.uk/vision/) dataset consists of 21 videos from various surgical procedures
and contains complex surgical scenes with deformations, reflections, and occlusions. All 21 videos are used
for external validation to investigate the depth prediction with occlusion for the proposed method
following .
[14]
Implementation details
We adopt the official implementation (https://github.com/ShuweiShao/AF-SfMLearner) of the AF-
SfMLearner as our backbone network and base optimizer. The network is trained for 20 epochs,
[9]
employing the Adam optimizer with a batch size of 40 and a learning rate of 10 . The overall network and
-4
training script are implemented using the Pytorch framework. The optimization is performed in a self-
supervised manner using our proposed LT-RL loss formulated in the Equation (8). To compare the

