Page 90 - Read Online
P. 90
Page 196 Wei et al. Art Int Surg 2024;4:187-98 I http://dx.doi.org/10.20517/ais.2024.12
3D reconstruction and the real world. Given the widespread adoption of robotic surgery, it is imperative to
integrate robotic kinematics as a multi-modal data source in the visual reconstruction process.
In Ear-Nose-Thorat (ENT) surgery [6] or colonoscopy [32] , surgeons manipulate flexible endoscopes or instru-
ments to observe anatomies or perform specific operations. Considering the narrow space of the surgical site,
it is crucial for the surgeon or the robot to have an accurate understanding of the 3D structures with real-
scale representation of the environment. Therefore, our proposed method can be applied to ENT surgery and
colonoscopy. When a limited number of monocular images are obtained from the endoscope, the NeRF-based
method can reconstruct the 3D geometry of the tissue surface. For the kinematics data, an external tracking
system, such as EM-Tracker and FBG sensors, can be embedded into the surgical robot. In this case, our
proposed 3D reconstruction method seamlessly integrates into current surgical robotic systems.
While some existing methods employ external sensors, such as stereo cameras [33,34] , to recover real-scale 3D
structures, their practical implementation is hindered by their high cost. Additionally, in certain scenarios like
ENT surgery and colonoscopy, the limited operating space poses challenges for using stereo cameras. Alterna-
tive approaches involve the use of optical tracking [35] or electromagnetic systems [36] to register the endoscope
with CT/MRI data. However, these devices are typically treated as independent sources of information for
multi-modal data registration. In contrast, our method integrates robotic information into a comprehensive
framework, enabling the reconstruction of scale-aware structures from monocular endoscopes. Moreover,
compared to learning-based monocular reconstruction approaches [37] , our proposed NeRF-based method
does not require large amounts of domain-specific training data and can render novel endoscopic views for
surgeons to observe the surgical scenarios. Additionally, while other SLAM-based reconstruction methods [38]
can only recover sparse 3D point clouds without accurate scaling, our framework can obtain dense 3D struc-
tures with an absolute scale to represent tissue surfaces.
However, our method does have some limitations that should be addressed in future work. Firstly, the current
approach relies on two separate processes to extract scale data from robot kinematics and monocular images,
which is complex and time-consuming. To overcome this, we aim to develop an end-to-end learning method
that can efficiently distill information from different modalities. Secondly, the use of the NeRF technique to
represent the 3D geometry requires significant computational resources and training time, making real-time
rendering and reconstruction challenging. To tackle this issue, we plan to investigate more efficient neural rep-
resentations, such as 3D gaussian, which can be integrated into our method to enhance efficiency for real-time
application. Furthermore, while the kinematics information provided by rigid robots is relatively accurate and
has minimal noise, flexible surgical robots can only provide rough and inaccurate kinematics data. Currently,
our framework does not account for errors in robot kinematics during scale recovery. In future work, we
intend to design an optimization module that can jointly utilize the translation and rotation components of
the poses from robot kinematics and visual data. Additionally, we aim to collect more multi-modal data from
different surgical scenes to thoroughly evaluate the performance of our method.
5. CONCLUSION
In this paper, we introduce a novel NeRF-based pipeline that enables scale-aware monocular reconstruction
with limited robotic endoscope data. It neither requires large medical images nor ground truth labels for
networktraining. Wefirstintegratethescaleinformationextractedfromkinematicsandlearning-basedcoarse
depth supervised by SfM into the optimization process of NeRF, resulting in absolute depth estimation. Then,
3D models with a real scale of tissue surfaces are reconstructed by fusing refined absolute depth maps. We also
evaluate the pipeline on SCARED data to demonstrate its accuracy and efficiency. In the future, more robotic
endoscope data will be collected to validate our pipeline. The reconstructed scale-aware 3D structures will be
utilized for automatic navigation systems in various robotic surgeries, including ENT surgery.

