Page 38 - Read Online

P. 38

Page 122 Ding et al. Art Int Surg 2024;4:109-38 https://dx.doi.org/10.20517/ais.2024.16

to contend with significant occlusions such as those persistent in laparoscopic video from robotic-assisted
[236]
minimally invasive surgery (RAMIS) .
3D reconstruction
Going a step beyond recognition, segmentation, and depth estimation, 3D reconstruction aims to generate
explicit geometric information about a scene. In contrast to the depth estimation explained above, where the
distance between the object and camera is represented as a distance value in a pixel, 3D reconstructed
scenes are represented either using discrete representation (point cloud or mesh grid) or continuous
representation (Neural Fields). In the visible light domain, 3D reconstruction refers to the intraoperative 3D
reconstruction of surgical scenes, including anatomical tissues and surgical instruments. While it has been
traditionally employed to reconstruct static tissues and organs, recently, novel techniques have been
introduced for the 3D reconstruction of deformable tissues and to update preoperative 3D models based on
intraoperative anatomical changes. Since most of the preoperative imaging modalities, such as CT and MRI,
are 3D, inter-operative 3D reconstructive enables 3D-3D registration [41,210,238-241] . This makes real-time visible
light imaging-based 3D reconstruction a key geometric understanding task that can aid surgical navigation,
[236]
surgeon-centered augmented reality, and virtual reality .

3D reconstruction methods often use multiple images, acquired either altogether or at various times, to
reconstruct a 3D model of the scene. Conventional reconstruction methods that estimate 3D structures
from multiple 2D images include Structure from Motion (SfM) and Simultaneous Localization and
[242]
Mapping (SLAM) [243-247] . Similar to stereo depth estimation techniques, these methods fundamentally rely on
motion parallax, the difference in object visualization from different image/camera viewpoints, for accurate
estimation of the 3D structure of the object. One of the necessary tasks in estimating structure from motion
is finding the correspondence between the different 2D images. Geometric information processing plays a
key role in detecting and tracking features to establish correspondence. Such feature detection techniques
include scale-invariant feature transform (SIFT) and Speeded-Up Robust Features (SURF). Alternatively, as
visible light imaging-based surgical procedures are often equipped with stereo cameras, the use of depth
estimation for the reconstruction of surgical scenes has also been reported . Utilizing the camera pose
[248]
information, the SLAM-based method allows the surgical scene reconstruction by fusing the depth
information in the 3D space [244-246] . Although SfM and SLAM have shown promising performance in the
natural computer vision domain, their application in the surgical domain has been limited, in part due to
the paucity of features in the limited field of view. Additionally, these techniques assume the object to be
static and rigid, which is not ideal for surgical scenes as the tissues/organs undergo deformations. The low-
light imaging conditions, presence of bodily fluids, occlusions resulting from instrument movements, and
specular reflections further affect the 3D reconstructions.

Discrete representation methods benefit from their sparsity properties, which improve efficiency in surface
production. However, the same property also makes the representation method less robust in handling
complex high-dimensional changes (non-topological deformations and color changes) - a general norm in
[249]
surgical scenes – due to instrument-tissue interactions . To address the deformations in tissue structures
[251]
to an extent, sparse warp fields have also been introduced in SuPer and E-DSSR . Novel techniques
[250]
[236]
are also being explored for updating the preoperative CT models based on the soft tissue deformations and
ablations observed through intraoperative endoscopic imaging . Unlike discrete representation
[252]
methods [248,249,253] , emerging methods now employ continuous representations with the introduction of the
Neural Radiance Field (NeRF) to reconstruct deformable tissues. Using the time-space input, the complex
geometry and appearance are implicitly modeled to achieve high-quality 3D reconstruction [248,249] .
EndoNeRF employed two neural fields, where one is trained for tissue deformation and the other is
[248]
trained for canonical density and appearance. It represents the deformable surgical scene as canonical

33 34 35 36 37 38 39 40 41 42 43