Page 86 - Read Online
P. 86

Page 192                          Wei et al. Art Int Surg 2024;4:187-98  I http://dx.doi.org/10.20517/ais.2024.12


































               Figure 2. Four typical examples of the SCARED data. For every row, when the robot manipulates the endoscope to move, diversified views
               and corresponding robot kinematics are recorded in sequence. SCARED: Stereo Correspondence And Reconstruction of Endoscopic Data.


               is used to calculate high-quality depth maps for each frame. As a result, the dataset provides endoscopic videos
               with ground-truth depth maps and robot kinematics. Typical examples of the SCARED data are illustrated in
               Figure 2. In addition, the robot kinematics information is utilized to restore the scale.


               In our implementation, we used the network architecture proposed in Mannequin Challenge [27]  with pre-
               trained weights as the monocular depth network for coarse depth adaptation. Twenty fine-tuning epochs were
               used in the surgical scene-specific adaptation. We set    = 4 for the geometric consistency check. For the
               NeRF-based optimization, we followed the settings in NeRF [15] . Specifically, we sampled 64 points in each
               ray and used a batch of 1,024 rays during the training. We added random Gaussian noise with zero mean
               and unit variance to the density to regularize the network. Additionally, positional encoding was utilized to
               capture high-frequency details. Using Adam optimizer with an initial learning rate of 5e-4, which decayed
               exponentially to 5e-5, we trained our NeRF on each surgical scene for 200    iterations. All experiments were
               conducted on a single RTX 2080 Ti.


               3.2 Performance metrics
               Table 1 lists the depth evaluation metrics [28]  used in our experiments, where    and    denote the estimated
                                                                                       ∗
               depth value and the corresponding ground truth, respectively, D represents the estimated depth map, and
                  ∈ {1.25 , 1.25 }. Additionally, since the comparison methods cannot accurately predict depth maps with
                        1
                             2
               an absolute scale from monocular images, we employ the ground truth median scaling method [29] to scale the
               predicted depth. The scaling is performed as follows:


                                                               median(G)                                (8)
                                                D      = D ·    = D ·
                                                               median(D)
               where D      denotes the scaled predicted depth,    represents the scale information calculated by the median
               scaling method, and G is the ground truth depth.
   81   82   83   84   85   86   87   88   89   90   91