Page 82 - Read Online
P. 82

Page 188                          Wei et al. Art Int Surg 2024;4:187-98  I http://dx.doi.org/10.20517/ais.2024.12


               Conclusion: Combining multi-modal image data with NeRF-based optimization represents a potential approach to
               achieve scale-aware 3D reconstruction of monocular endoscopic scenes.


               Keywords: Scale-aware reconstruction, NeRF-based optimization, multi-modal data learning, surgical navigation,
               robotic surgery





               1. INTRODUCTION
               Reconstructing scale-aware 3D structures from monocular endoscopes is a fundamental task for some emerg-
               ing surgical robotic systems, such as flexible robots [1–3] . It is also a prerequisite for applications such as multi-
               modal image registration and automatic navigation based on real-scale 3D modeling of human anatomies [4–6] .
               However, relying solely on monocular images is insufficient to accurately recover 3D structures with absolute
               scale in the surgical scene. Several methods for scene reconstruction from monocular endoscopes have been
               explored. Traditional multi-view stereo methods [7]  can simultaneously recover 3D point clouds and camera
               poses in scenes with rich features. However, these methods cannot directly reconstruct structures with real
               scale, requiring manual estimation of the global scale and then optimization of it using “iterative closest point”
                                       [8]
               registration algorithm (ICP) . Recent deep learning-based methods [8–10]  have exploited large numbers of
               surgical images with certain requirements, such as static tissue surfaces or ground truth depth labels, to train
               convolutional neural networks (CNN) for relative depth estimation and further reconstruction. However,
               based on our experiments, these methods only predicted relative depth by large training data and computed
               3D reconstruction without an accurate scale.


               Surgical robotic systems provide richer information beyond images, such as robot kinematics, which describes
               how robotic instruments are mechanically controlled. This kinematics information can enhance the percep-
               tion in a multi-modal learning style [11] . Despite much work on recognition-related tasks using robotic infor-
               mation [12–14] , joint modeling of kinematics and visual data for monocular 3D reconstruction has been rarely
               studied to date due to several challenges. First, acquiring large surgical datasets with static scenes for learning-
               based methods is difficult. Second, generating accurate ground truth depth labels of real endoscopic images is
               hard. Third, for 3D reconstruction, robot kinematics and endoscopic videos represent multi-modal data, and
               how to efficiently integrate kinematics data into the images remains underexplored.


               Neural radiance fields (NeRF) have emerged as a promising technology [15,16]  for quality novel view synthesis
               and 3D reconstruction. These methods utilize neural implicit fields to represent continuous scenes. Several
               variants of NeRF [17,18]  have incorporated sparse 3D points from structure from motion (SfM) techniques to
               guide ray termination and optimize the neural implicit field for view synthesis. However, these approaches
               have primarily focused on relative depth estimation in natural scenes. In the context of urban environments,
               urban radiance fields (URF) [19]  have been introduced to apply NeRF-based view synthesis and visual recon-
               struction. URF leverages sparse multi-view images along with LiDAR data to reconstruct urban scenes. In the
               field of medicine, a recent work called EndoNeRF [20]  has presented a pipeline for achieving single-view 3D
               reconstruction of dynamic surgical scenes. This methodology specifically addresses the challenges of recon-
               structing surgical scenes that involve deformable tissues.

               In this paper, we propose a novel approach, KV-EndoNeRF, for reconstructing surgical scenes with an accurate
               scale using kinematics and visual data. Our contributions can be summarized as follows: Firstly, we introduce
               a NeRF-based pipeline specifically designed for scale-aware reconstruction from multi-modal data, addressing
               the challenging problem of reconstructing 3D scenes with scale from a monocular endoscope. Secondly, we
               incorporate scale information extracted from robot kinematics and coarse depth information learned from
               SfM into the NeRF optimization process, improving the accuracy of the reconstruction. Finally, we evaluate
   77   78   79   80   81   82   83   84   85   86   87