Page 85 - Read Online
P. 85
Wei et al. Art Int Surg 2024;4:187-98 I http://dx.doi.org/10.20517/ais.2024.12 Page 191
all other views using the following equations:
′
p → , D → ∼ K · T · D (p ) · K −1 · h (p ) (4)
′ (5)
D = D p →
where K represents the endoscope intrinsic matrix, and p denotes a pixel in the image. Subsequently, we
calculate the depth reprojection error between D and D → . The confidence map E for each view is defined
′
as the average value of the top minimum cross-view depth reprojection errors.
Next, during ray marching, we sample points using a Gaussian distribution guided by the prior from the scaled
coarse depth. Assuming the coarse depth value for a pixel p to be z p = D (p), we sample the candidates using
the distribution ∼ N z p , , where p = z p · E (p). This sampling method ensures that the points are
2
p
concentrated around tissue surfaces.
To estimate the absolute depth of endoscopic frames, we can optimize the network parameter by supervising
the rendered color images. To be more specific, the loss function utilized to train the network is defined as
follows:
2
L (r ( )) = ∥C (r ( )) − I (p)∥ (6)
2
where prepresentsthelocationofthepixelthat r ( ) shootstoward, and I correspondstotheinputendoscopic
image.
2.5 Volumetric reconstruction on fine depth
To further refine depth accuracy, we use the view synthesis results of NeRF to calculate the per-pixel error for
the predicted structure. If the rendering at a specific pixel does not match the input endoscopic image well, a
high error is assigned to the depth prediction of that pixel. The error map R (p) for the pixel p in the th view
is expressed as:
R (p) = ∥I (p) − C (p)∥ / 255 (7)
1
The error map is then used to improve the estimated depth by a filter. We apply an off-the-shelf post-filtering
approach [23] to obtain the fine output, which enhances absolute depth estimates, particularly in regions where
the renderings are not accurate.
Afterward, these fine depth maps are fused to create a surface reconstruction. We use TSDF [24] to build a
volumetric representation of the tissue surface. Since the predicted depth maps and the endoscope poses are
scaled to the real world, all data are made scale-aware and -consistent before fusion. The surgical scene is
represented by a discrete voxel grid, and for each of them, a weighted signed distance to the closest surface
is recorded. The TSDF is updated in a straight manner, using sequential averaging for each voxel and the
predicted depth for each pixel in every image. Finally, the whole 3D structure is reconstructed by the marching
cubes method [25] from the volumetric representation.
3. RESULTS
3.1 Dataset and implementation details
We evaluate our scale-aware monocular reconstruction pipeline on the publicly available SCARED dataset [26] .
This dataset consists of seven training datasets and two test datasets captured by a da Vinci Xi surgical robot.
Each dataset is collected from a porcine model and contains four or five keyframes. Each keyframe includes a
video with kinematic information about the endoscope. From each dataset, we randomly select one keyframe
and extract a set of 40 to 80 images that cover the entire surgical scene. During the data collection process, the
robot manipulates an endoscope to observe the interior scenes of the porcine abdominal anatomy. A projector

