Page 84 - Read Online
P. 84

Page 190                          Wei et al. Art Int Surg 2024;4:187-98  I http://dx.doi.org/10.20517/ais.2024.12

               color image C and per-view depth value D of a camera ray r (  ) can be computed using volume rendering as:

                                                                             
                                                Õ                         Õ
                                      C (r (  )) =        · c (x    , d),  D (r (  )) =        ·        (1)
                                                  =1                         =1

                                                    Í   −1
               where       = (1 − exp (−   (x    ) △      )) exp −     (x    ) △      , △      =      +1 −      .
                                                       =1
               In this way, the 3D structure of the surgical scene can be encoded as a continuous implicit function, which
               enables memory-efficient geometric representation with infinite resolution.


               2.3 Coarse depth adaptation and scale recovery with kinematics
               WeusetheSfMtoreconstructthesurgicalsceneasasetof3Dpoints X andcameraposes T = {T    ∈ SE (3) |   =
               1 · · ·   } for the input images extracted from the unlabeled endoscopic video. To eliminate extreme outliers in
               the sparse reconstruction, point cloud filtering is utilized. For each endoscopic image pair, the rigid transfor-
                              
               mation matrix T   +1  from image    to    + 1 can be computed by the camera poses T, where the left superscript
                                
               {  } denotes the pose described under the image coordinate. As the endoscope is attached to a robot, the cam-
               era poses under the robot coordinate system can be calculated from kinematics information, considered as a
                                                                           +1
               referencetorecovertheabsolutescale. Therefore, therelativepose T undertherobotbase {  } iscomputed.
                                                                          
               The absolute scale between the reconstructed structure and the real world can be estimated by:
                                                            −2     
    
 !!
                                                                     t
                                                      1   Õ        
      +1
 2
                                                                       
                                                = exp        log 10  
  
   ,                           (2)
                                                        − 1        
     +1
                                                                     t
                                                            =0           2
               where t    isthetranslationvectorofthecamerapose T   . Althoughwecancomputescaledataforeachframe, the
               noise in the kinematics data and the instability of the poses in T introduce severe noise to each scale. To filter
               the scale, we employ a logarithmic moving average with a multiplicative error model. Based on the computed
               scale factor, we adjust the sparse 3D structure and camera poses to match the real-world values. Afterward,
               the scaled 3D point cloud X is projected onto each image plane with the corresponding scaled camera pose
                                       ′
               T . The re-projected    values are concatenated as the sparse depth supervision D   , where the region with no
                                                                                   
                ′
                  
               points projected onto is set to zero.
               To obtain scene-specific coarse depth from the current endoscopic data, we propose adapting a depth estima-
               tion network. This network is fine-tuned using the sparse depth supervision D   . However, due to the scale
                                                                                  
               ambiguity in the predicted depth map, we utilize the scale-invariant log loss [22]  for training the depth network.
               The scale-invariant log loss is defined as:

                                                   v
                                                   u
                                                   u                    ! 2
                                                   t                  
                                                      1  Õ  2       Õ
                                               L =            −                                         (3)
                                                              
                                                                  2
                                                          =1          =1
               where       = log       − log       ,       represents the coarse depth value predicted by the proposed depth network, and
                                    ′
                  is the value of the corresponding sparse depth supervision D   .    denotes the number of pixels with valid
                                                                    
                ′
                  
               supervision values, and    is a weighting factor.
               2.4 NeRF-based optimization for absolute depth
               According to Equation (2), we can determine the absolute scale between the reconstruction and real-world
               values using the robot kinematics information. To incorporate this calculated scale into dense monocular
               reconstruction, we propose guiding the NeRF sampling process with our coarse depth estimation and scale in-
                                                                  
               formation. First, we align the scale of the coarse depth map D    based on the depth supervision D   . Moreover,
                                                                                                
                                                                                       
               we compute the confidence map of D    by a geometric consistency check. The depth D    is first projected onto
                                              
   79   80   81   82   83   84   85   86   87   88   89