Page 82 - Read Online
P. 82
Page 188 Wei et al. Art Int Surg 2024;4:187-98 I http://dx.doi.org/10.20517/ais.2024.12
Conclusion: Combining multi-modal image data with NeRF-based optimization represents a potential approach to
achieve scale-aware 3D reconstruction of monocular endoscopic scenes.
Keywords: Scale-aware reconstruction, NeRF-based optimization, multi-modal data learning, surgical navigation,
robotic surgery
1. INTRODUCTION
Reconstructing scale-aware 3D structures from monocular endoscopes is a fundamental task for some emerg-
ing surgical robotic systems, such as flexible robots [1–3] . It is also a prerequisite for applications such as multi-
modal image registration and automatic navigation based on real-scale 3D modeling of human anatomies [4–6] .
However, relying solely on monocular images is insufficient to accurately recover 3D structures with absolute
scale in the surgical scene. Several methods for scene reconstruction from monocular endoscopes have been
explored. Traditional multi-view stereo methods [7] can simultaneously recover 3D point clouds and camera
poses in scenes with rich features. However, these methods cannot directly reconstruct structures with real
scale, requiring manual estimation of the global scale and then optimization of it using “iterative closest point”
[8]
registration algorithm (ICP) . Recent deep learning-based methods [8–10] have exploited large numbers of
surgical images with certain requirements, such as static tissue surfaces or ground truth depth labels, to train
convolutional neural networks (CNN) for relative depth estimation and further reconstruction. However,
based on our experiments, these methods only predicted relative depth by large training data and computed
3D reconstruction without an accurate scale.
Surgical robotic systems provide richer information beyond images, such as robot kinematics, which describes
how robotic instruments are mechanically controlled. This kinematics information can enhance the percep-
tion in a multi-modal learning style [11] . Despite much work on recognition-related tasks using robotic infor-
mation [12–14] , joint modeling of kinematics and visual data for monocular 3D reconstruction has been rarely
studied to date due to several challenges. First, acquiring large surgical datasets with static scenes for learning-
based methods is difficult. Second, generating accurate ground truth depth labels of real endoscopic images is
hard. Third, for 3D reconstruction, robot kinematics and endoscopic videos represent multi-modal data, and
how to efficiently integrate kinematics data into the images remains underexplored.
Neural radiance fields (NeRF) have emerged as a promising technology [15,16] for quality novel view synthesis
and 3D reconstruction. These methods utilize neural implicit fields to represent continuous scenes. Several
variants of NeRF [17,18] have incorporated sparse 3D points from structure from motion (SfM) techniques to
guide ray termination and optimize the neural implicit field for view synthesis. However, these approaches
have primarily focused on relative depth estimation in natural scenes. In the context of urban environments,
urban radiance fields (URF) [19] have been introduced to apply NeRF-based view synthesis and visual recon-
struction. URF leverages sparse multi-view images along with LiDAR data to reconstruct urban scenes. In the
field of medicine, a recent work called EndoNeRF [20] has presented a pipeline for achieving single-view 3D
reconstruction of dynamic surgical scenes. This methodology specifically addresses the challenges of recon-
structing surgical scenes that involve deformable tissues.
In this paper, we propose a novel approach, KV-EndoNeRF, for reconstructing surgical scenes with an accurate
scale using kinematics and visual data. Our contributions can be summarized as follows: Firstly, we introduce
a NeRF-based pipeline specifically designed for scale-aware reconstruction from multi-modal data, addressing
the challenging problem of reconstructing 3D scenes with scale from a monocular endoscope. Secondly, we
incorporate scale information extracted from robot kinematics and coarse depth information learned from
SfM into the NeRF optimization process, improving the accuracy of the reconstruction. Finally, we evaluate

