Page 85 - Read Online
P. 85

Wei et al. Art Int Surg 2024;4:187-98  I http://dx.doi.org/10.20517/ais.2024.12     Page 191

               all other views using the following equations:

                                                                   ′
                                                                 
                                                
                                         p   →   , D   →   ∼ K · T  · D    (p    ) · K −1  · h (p    )  (4)
                                                             
                                                    ′                                                   (5)
                                                 D = D    p   →  
                                                    
               where K represents the endoscope intrinsic matrix, and p denotes a pixel in the image. Subsequently, we
                                                        
               calculate the depth reprojection error between D and D   →   . The confidence map E    for each view is defined
                                                               
                                                        ′
                                                          
               as the average value of the top    minimum cross-view depth reprojection errors.
               Next, during ray marching, we sample points using a Gaussian distribution guided by the prior from the scaled
               coarse depth. Assuming the coarse depth value for a pixel p to be z p = D    (p), we sample the candidates using
                                                                           

               the distribution       ∼ N z p ,    , where    p = z p · E    (p). This sampling method ensures that the points are
                                         2
                                         p
               concentrated around tissue surfaces.
               To estimate the absolute depth of endoscopic frames, we can optimize the network parameter    by supervising
               the rendered color images. To be more specific, the loss function utilized to train the network is defined as
               follows:
                                                                         2
                                               L (r (  )) = ∥C (r (  )) − I    (p)∥                     (6)
                                                                         2
               where prepresentsthelocationofthepixelthat r (  ) shootstoward, and I    correspondstotheinputendoscopic
               image.

               2.5 Volumetric reconstruction on fine depth
               To further refine depth accuracy, we use the view synthesis results of NeRF to calculate the per-pixel error for
               the predicted structure. If the rendering at a specific pixel does not match the input endoscopic image well, a
               high error is assigned to the depth prediction of that pixel. The error map R    (p) for the pixel p in the   th view
               is expressed as:
                                               R    (p) = ∥I    (p) − C (p)∥ / 255                      (7)
                                                                    1
               The error map is then used to improve the estimated depth by a filter. We apply an off-the-shelf post-filtering
               approach [23]  to obtain the fine output, which enhances absolute depth estimates, particularly in regions where
               the renderings are not accurate.

               Afterward, these fine depth maps are fused to create a surface reconstruction. We use TSDF [24]  to build a
               volumetric representation of the tissue surface. Since the predicted depth maps and the endoscope poses are
               scaled to the real world, all data are made scale-aware and -consistent before fusion. The surgical scene is
               represented by a discrete voxel grid, and for each of them, a weighted signed distance to the closest surface
               is recorded. The TSDF is updated in a straight manner, using sequential averaging for each voxel and the
               predicted depth for each pixel in every image. Finally, the whole 3D structure is reconstructed by the marching
               cubes method [25]  from the volumetric representation.



               3. RESULTS
               3.1 Dataset and implementation details
               We evaluate our scale-aware monocular reconstruction pipeline on the publicly available SCARED dataset [26] .
               This dataset consists of seven training datasets and two test datasets captured by a da Vinci Xi surgical robot.
               Each dataset is collected from a porcine model and contains four or five keyframes. Each keyframe includes a
               video with kinematic information about the endoscope. From each dataset, we randomly select one keyframe
               and extract a set of 40 to 80 images that cover the entire surgical scene. During the data collection process, the
               robot manipulates an endoscope to observe the interior scenes of the porcine abdominal anatomy. A projector
   80   81   82   83   84   85   86   87   88   89   90