Page 84 - Read Online

P. 84

Page 190 Wei et al. Art Int Surg 2024;4:187-98 I http://dx.doi.org/10.20517/ais.2024.12

color image C and per-view depth value D of a camera ray r ( ) can be computed using volume rendering as:

Õ Õ
C (r ( )) = · c (x , d), D (r ( )) = · (1)
=1 =1

Í −1
where = (1 − exp (− (x ) △ )) exp − (x ) △ , △ = +1 − .
=1
In this way, the 3D structure of the surgical scene can be encoded as a continuous implicit function, which
enables memory-efficient geometric representation with infinite resolution.

2.3 Coarse depth adaptation and scale recovery with kinematics
WeusetheSfMtoreconstructthesurgicalsceneasasetof3Dpoints X andcameraposes T = {T ∈ SE (3) | =
1 · · · } for the input images extracted from the unlabeled endoscopic video. To eliminate extreme outliers in
the sparse reconstruction, point cloud filtering is utilized. For each endoscopic image pair, the rigid transfor-

mation matrix T +1 from image to + 1 can be computed by the camera poses T, where the left superscript

{ } denotes the pose described under the image coordinate. As the endoscope is attached to a robot, the cam-
era poses under the robot coordinate system can be calculated from kinematics information, considered as a
+1
referencetorecovertheabsolutescale. Therefore, therelativepose T undertherobotbase { } iscomputed.

The absolute scale between the reconstructed structure and the real world can be estimated by:
−2

!!
t
1 Õ
+1
2

= exp log 10

, (2)
− 1
+1
t
=0 2
where t isthetranslationvectorofthecamerapose T . Althoughwecancomputescaledataforeachframe, the
noise in the kinematics data and the instability of the poses in T introduce severe noise to each scale. To filter
the scale, we employ a logarithmic moving average with a multiplicative error model. Based on the computed
scale factor, we adjust the sparse 3D structure and camera poses to match the real-world values. Afterward,
the scaled 3D point cloud X is projected onto each image plane with the corresponding scaled camera pose
′
T . The re-projected values are concatenated as the sparse depth supervision D , where the region with no

′

points projected onto is set to zero.
To obtain scene-specific coarse depth from the current endoscopic data, we propose adapting a depth estima-
tion network. This network is fine-tuned using the sparse depth supervision D . However, due to the scale

ambiguity in the predicted depth map, we utilize the scale-invariant log loss [22] for training the depth network.
The scale-invariant log loss is defined as:

v
u
u ! 2
t
1 Õ 2 Õ
L = − (3)

2
=1 =1
where = log − log , represents the coarse depth value predicted by the proposed depth network, and
′
is the value of the corresponding sparse depth supervision D . denotes the number of pixels with valid

′

supervision values, and is a weighting factor.
2.4 NeRF-based optimization for absolute depth
According to Equation (2), we can determine the absolute scale between the reconstruction and real-world
values using the robot kinematics information. To incorporate this calculated scale into dense monocular
reconstruction, we propose guiding the NeRF sampling process with our coarse depth estimation and scale in-

formation. First, we align the scale of the coarse depth map D based on the depth supervision D . Moreover,

we compute the confidence map of D by a geometric consistency check. The depth D is first projected onto

79 80 81 82 83 84 85 86 87 88 89