Page 42 - Read Online
P. 42
Page 126 Ding et al. Art Int Surg 2024;4:109-38 https://dx.doi.org/10.20517/ais.2024.16
Table 2. Summarization of methods
Category Methods
[2]
Segmentation and EFFNet [117] , Bamba et al. [118] , Cerón et al. [119] , Wang et al. [120] , Zhang et al. [121] , Yang et al. [122] , CaRTS , TC-CaRTS [123] ,
[124] [125] [145] [146] [179] [180]
detection Colleoni et al. , daVinci GAN , AdptiveSAM , SurgicalSAM , Stenmark et al. , Cheng et al. , Zhao
[181,186] [182] [183] [95] [184] [185]
et al. , Robu et al. , Lee et al. , Seenivasan et al. , García-Peraza-Herrera et al. , Jo et al. ,
Alshirbaji et al. [187]
[194] [192] [196] [197] [198]
Depth estimation Hannah , Marr and Poggio , Arnold , Okutomi and Kanade , Szeliski and Coughlan , Bouguet and
Perona [188] , Iddan and Yahav [189] , Torralba and Oliva [191] , Mueller-Richter et al. [201] , Stoyanov et al. [195] , Nayar et al. [190] , Lo
[199] [203] [206] [202] [200] [207]
et al. , Sinha et al. , Liu et al. , Bogdanova et al. , Sinha et al. , Visentini-Scarzanella et al. , Mahmood
et al. [211] , Zhan et al. [209] , Liu et al. [204] , Guo et al. [212] , Wong and Soatto [216] , Chen et al. [213] , Li et al. [205] , Liu et al. [210] ,
[214] [215] [217] [67] [68] [219] [193]
Schreiber et al. , Tong et al. , Widya et al. , Ozyoruk et al. , Allan et al. , Hwang et al. , Szeliski , Oda
[208] [218] [220] [222] [71] [221] [223]
et al. , Shao et al. , Li et al. , Tukra and Giannarou , Ali and Pandey , Masuda et al. , Zhao et al. , Han
[224] [225] [226]
et al. , Yang et al. , Zhang et al.
[250] [243] [244] [245] [242] [251] [247]
3D reconstruction Dynamicfusion , Lin et al. , Song et al. , Zhou and Jagadeesan , Wdiya et al. , Li et al. , Wei et al. ,
EMDQ-SLAM [246] , E-DSSR [236] , EndoNeRF [248] , EndoSurf [249] , Nguyen et al. [253] , Mangulabnan et al. [252]
[254] [255] [256] [259] [265] [266]
Pose estimation Hein et al. , Félix et al. , Tatoo , Allan et al. , Kadkhodamohammadi et al. , Padoy ,
Kadkhodamohammadi et al. [267]
[272] [273] [274] [275] [7] [278] [279] [280]
Applications FIVRS , Ishida et al. , Ishida et al. , Sahu et al. , Twin-S , Shi et al. , Poletti et al. , Aubert et al. ,
Hernigou et al. [281]
adjacent pixels. Point-based representations provide accurate positions for each point. It also allows the
establishment of explicit connections between points to form boundaries, improving rendering efficiency
and providing more geometric constraints for interaction and interpretation. Thus, polygon mesh
representation has been widely used in 3D surface reconstruction and digital modeling. However, it is
computationally less efficient, and the sparsity of the points still limits the accuracy. Latent space
representation, while offering the possibility of dimensional reduction (ruled encoding) for efficiency and
high-level semantic incorporation (neural encoding), lacks accuracy due to information loss and reliability
due to limited interpretability and generalizability. Functional representations offer a new approach to
representing geometric information through mathematical constraints or mappings. Ruled functions have
the advantages of processing efficiency, easy interactivity, and quantized interpretability. However, it lacks
the ability to represent complex surfaces. On the other hand, the neural fields [48,49] , leveraging on neural
networks’ universal approximation ability, demonstrate remarkable representation capabilities. This enables
the efficient continuous 3D representation of complex and dynamic surgical scenes, which makes it a
popular topic. However, the use of black-box networks sacrifices interactivity and interpretability. While no
one geometric representation is optimal for all cases, the current advances require the user to choose data
presentation based on the trade-off and the task-specific priority. For tasks that require robust performance
in all aspects, future work could explore novel data representation that fuses sparse representations and
neural fields, to achieve better surface representation with lesser computation load.
Geometric scene understanding has made immense strides in the general computer vision domain.
However, its progress in the surgical domain is hindered by multiple factors. Firstly, limited data availability
and annotations have become a major roadblock in adapting advanced but data-consuming architectures
like ViT from the computer vision domain. This significantly impacts the accuracy and reliability of
[233]
segmentation and detection, which are prerequisites for the success of DT. Although self-supervised
techniques of using stereo matching exist that might exempt depth estimation from lack of annotations, the
stability of training needs careful attention. While efforts have been made to bridge the gap in the scale of
[287]
data between computer vision and the surgical domain through synthetic data generation and sim-to-real
generalization techniques [288,289] , this direction also poses challenges due to the lack of interpretability for
neural networks. Secondly, the complexity of the surgical scene, including non-static organs and deformable
tissues, poses another major challenge when updating the DT models relies solely on pose estimation, with
the assumption that 3D models are rigid. Although dynamic 3D reconstruction methods exist [248,249] , they

