Page 42 - Read Online
P. 42

Page 126                           Ding et al. Art Int Surg 2024;4:109-38  https://dx.doi.org/10.20517/ais.2024.16

               Table 2. Summarization of methods
                Category     Methods
                                                                                          [2]
                Segmentation and   EFFNet [117] , Bamba et al. [118] , Cerón et al. [119] , Wang et al. [120] , Zhang et al. [121] , Yang et al. [122] , CaRTS , TC-CaRTS [123] ,
                                      [124]      [125]      [145]      [146]       [179]     [180]
                detection    Colleoni et al.  , daVinci GAN  , AdptiveSAM  , SurgicalSAM  , Stenmark et al.  , Cheng et al.  , Zhao
                                [181,186]  [182]    [183]        [95]                [184]  [185]
                             et al.  , Robu et al.  , Lee et al.  , Seenivasan et al.  , García-Peraza-Herrera et al.  , Jo et al.  ,
                             Alshirbaji et al. [187]
                                  [194]         [192]  [196]           [197]           [198]
                Depth estimation  Hannah  , Marr and Poggio  , Arnold  , Okutomi and Kanade  , Szeliski and Coughlan  , Bouguet and
                             Perona [188] , Iddan and Yahav [189] , Torralba and Oliva  [191] , Mueller-Richter et al. [201] , Stoyanov et al. [195] , Nayar et al. [190] , Lo
                                [199]     [203]   [206]        [202]     [200]               [207]
                             et al.  , Sinha et al.  , Liu et al.  , Bogdanova et al.  , Sinha et al.  , Visentini-Scarzanella et al.  , Mahmood
                             et al. [211] , Zhan et al. [209] , Liu et al. [204] , Guo et al. [212] , Wong and Soatto [216] , Chen et al. [213] , Li et al. [205] , Liu et al. [210] ,
                                       [214]    [215]     [217]       [67]     [68]      [219]  [193]
                             Schreiber et al.  , Tong et al.  , Widya et al.  , Ozyoruk et al.  , Allan et al.  , Hwang et al.  , Szeliski  , Oda
                                [208]     [218]  [220]           [222]        [71]      [221]     [223]
                             et al.  , Shao et al.  , Li et al.  , Tukra and Giannarou  , Ali and Pandey  , Masuda et al.  , Zhao et al.  , Han
                                [224]     [225]     [226]
                             et al.  , Yang et al.  , Zhang et al.
                                       [250]   [243]    [244]            [245]      [242]  [251]    [247]
                3D reconstruction  Dynamicfusion  , Lin et al.  , Song et al.  , Zhou and Jagadeesan  , Wdiya et al.  , Li et al.  , Wei et al.  ,
                             EMDQ-SLAM [246] , E-DSSR [236] , EndoNeRF [248] , EndoSurf [249] , Nguyen et al. [253] , Mangulabnan et al. [252]
                                    [254]    [255]  [256]    [259]               [265]  [266]
                Pose estimation  Hein et al.  , Félix et al.  , Tatoo  , Allan et al.  , Kadkhodamohammadi et al.  , Padoy  ,
                             Kadkhodamohammadi et al. [267]
                                 [272]     [273]     [274]     [275]  [7]    [278]      [279]     [280]
                Applications  FIVRS  , Ishida et al.  , Ishida et al.  , Sahu et al.  , Twin-S , Shi et al.  , Poletti et al.  , Aubert et al.  ,
                             Hernigou et al. [281]
               adjacent pixels. Point-based representations provide accurate positions for each point. It also allows the
               establishment of explicit connections between points to form boundaries, improving rendering efficiency
               and providing more geometric constraints for interaction and interpretation. Thus, polygon mesh
               representation has been widely used in 3D surface reconstruction and digital modeling. However, it is
               computationally less efficient, and the sparsity of the points still limits the accuracy. Latent space
               representation, while offering the possibility of dimensional reduction (ruled encoding) for efficiency and
               high-level semantic incorporation (neural encoding), lacks accuracy due to information loss and reliability
               due to limited interpretability and generalizability. Functional representations offer a new approach to
               representing geometric information through mathematical constraints or mappings. Ruled functions have
               the advantages of processing efficiency, easy interactivity, and quantized interpretability. However, it lacks
               the ability to represent complex surfaces. On the other hand, the neural fields [48,49] , leveraging on neural
               networks’ universal approximation ability, demonstrate remarkable representation capabilities. This enables
               the efficient continuous 3D representation of complex and dynamic surgical scenes, which makes it a
               popular topic. However, the use of black-box networks sacrifices interactivity and interpretability. While no
               one geometric representation is optimal for all cases, the current advances require the user to choose data
               presentation based on the trade-off and the task-specific priority. For tasks that require robust performance
               in all aspects, future work could explore novel data representation that fuses sparse representations and
               neural fields, to achieve better surface representation with lesser computation load.

               Geometric scene understanding has made immense strides in the general computer vision domain.
               However, its progress in the surgical domain is hindered by multiple factors. Firstly, limited data availability
               and annotations have become a major roadblock in adapting advanced but data-consuming architectures
               like ViT  from the computer vision domain. This significantly impacts the accuracy and reliability of
                      [233]
               segmentation and detection, which are prerequisites for the success of DT. Although self-supervised
               techniques of using stereo matching exist that might exempt depth estimation from lack of annotations, the
               stability  of training needs careful attention. While efforts have been made to bridge the gap in the scale of
                      [287]
               data between computer vision and the surgical domain through synthetic data generation and sim-to-real
               generalization techniques [288,289] , this direction also poses challenges due to the lack of interpretability for
               neural networks. Secondly, the complexity of the surgical scene, including non-static organs and deformable
               tissues, poses another major challenge when updating the DT models relies solely on pose estimation, with
               the assumption that 3D models are rigid. Although dynamic 3D reconstruction methods exist [248,249] , they
   37   38   39   40   41   42   43   44   45   46   47