Page 38 - Read Online
P. 38

Page 122                           Ding et al. Art Int Surg 2024;4:109-38  https://dx.doi.org/10.20517/ais.2024.16

               to contend with significant occlusions such as those persistent in laparoscopic video from robotic-assisted
                                              [236]
               minimally invasive surgery (RAMIS) .
               3D reconstruction
               Going a step beyond recognition, segmentation, and depth estimation, 3D reconstruction aims to generate
               explicit geometric information about a scene. In contrast to the depth estimation explained above, where the
               distance between the object and camera is represented as a distance value in a pixel, 3D reconstructed
               scenes are represented either using discrete representation (point cloud or mesh grid) or continuous
               representation (Neural Fields). In the visible light domain, 3D reconstruction refers to the intraoperative 3D
               reconstruction of surgical scenes, including anatomical tissues and surgical instruments. While it has been
               traditionally employed to reconstruct static tissues and organs, recently, novel techniques have been
               introduced for the 3D reconstruction of deformable tissues and to update preoperative 3D models based on
               intraoperative anatomical changes. Since most of the preoperative imaging modalities, such as CT and MRI,
               are 3D, inter-operative 3D reconstructive enables 3D-3D registration [41,210,238-241] . This makes real-time visible
               light imaging-based 3D reconstruction a key geometric understanding task that can aid surgical navigation,
                                                             [236]
               surgeon-centered augmented reality, and virtual reality .

               3D reconstruction methods often use multiple images, acquired either altogether or at various times, to
               reconstruct a 3D model of the scene. Conventional reconstruction methods that estimate 3D structures
               from multiple 2D images include Structure from Motion (SfM)  and Simultaneous Localization and
                                                                       [242]
               Mapping (SLAM) [243-247] . Similar to stereo depth estimation techniques, these methods fundamentally rely on
               motion parallax, the difference in object visualization from different image/camera viewpoints, for accurate
               estimation of the 3D structure of the object. One of the necessary tasks in estimating structure from motion
               is finding the correspondence between the different 2D images. Geometric information processing plays a
               key role in detecting and tracking features to establish correspondence. Such feature detection techniques
               include scale-invariant feature transform (SIFT) and Speeded-Up Robust Features (SURF). Alternatively, as
               visible light imaging-based surgical procedures are often equipped with stereo cameras, the use of depth
               estimation for the reconstruction of surgical scenes has also been reported . Utilizing the camera pose
                                                                                [248]
               information, the SLAM-based method allows the surgical scene reconstruction by fusing the depth
               information in the 3D space [244-246] . Although SfM and SLAM have shown promising performance in the
               natural computer vision domain, their application in the surgical domain has been limited, in part due to
               the paucity of features in the limited field of view. Additionally, these techniques assume the object to be
               static and rigid, which is not ideal for surgical scenes as the tissues/organs undergo deformations. The low-
               light imaging conditions, presence of bodily fluids, occlusions resulting from instrument movements, and
               specular reflections further affect the 3D reconstructions.


               Discrete representation methods benefit from their sparsity properties, which improve efficiency in surface
               production. However, the same property also makes the representation method less robust in handling
               complex high-dimensional changes (non-topological deformations and color changes) - a general norm in
                                                             [249]
               surgical scenes – due to instrument-tissue interactions . To address the deformations in tissue structures
                                                                          [251]
               to an extent, sparse warp fields  have also been introduced in SuPer  and E-DSSR . Novel techniques
                                         [250]
                                                                                        [236]
               are also being explored for updating the preoperative CT models based on the soft tissue deformations and
               ablations observed through intraoperative endoscopic imaging . Unlike discrete representation
                                                                          [252]
               methods [248,249,253] , emerging methods now employ continuous representations with the introduction of the
               Neural Radiance Field (NeRF) to reconstruct deformable tissues. Using the time-space input, the complex
               geometry and appearance are implicitly modeled to achieve high-quality 3D reconstruction [248,249] .
               EndoNeRF  employed two neural fields, where one is trained for tissue deformation and the other is
                         [248]
               trained for canonical density and appearance. It represents the deformable surgical scene as canonical
   33   34   35   36   37   38   39   40   41   42   43