Page 34 - Read Online
P. 34
Page 118 Ding et al. Art Int Surg 2024;4:109-38 https://dx.doi.org/10.20517/ais.2024.16
Table 1. Summarization of datasets
Category Dataset
Segmentation and EndoVis [50,51] , Lap. I2I Translation [65] , Sinus-Surgery-C/L [58] , CholecSeg8k [59] , CaDIS [53] , RobustMIS [56] , Kvasir-
[60] [61] [64] [52] [54] [55] [57]
detection Instrument , AutoLaparo , SurgToolLoc , SAR-RARP50 , SurT , STIR , SegSTRONG-C
[67] [68] [70] [41] [71] [72]
Depth estimation EndoSLAM , SCARED , Rectified Hamlyn , JHU Nasal Cavity , Arthronet , AIST Colonoscopy , JHU
[73]
Colonoscopy
[75] [74] [69] [70] [67] [41] [73]
3D reconstruction JIGSAW , Lin et al. , Hamlyn , Rectified Hamlyn , EndoSLAM , JHU Nasal Cavity , JHU Colonoscopy
Pose estimation SurgRIPE [76] , Laparoscopic Non-robotic Dataset [77] , 3dStool [78] , dVPose [79] , Edinburgh Simulated Surgical Tools
Dataset [80] , Synthetic Surgical Needle 6DoF Pose Datasets [81] , POV-Surgery [83] , i3PosNet [84]
segmentation methods started using convolutional neural networks (CNNs) and adopted an end-to-end
training regime.
In the general vision domain, segmentation was initially explored as a semantic segmentation task in the 2D
space. As the target labels were generated from a fixed number of candidate classes, the output could be
represented in the same format of multidimensional arrays as the input, making the encoder-decoder
[91]
architecture-based FCN a perfect fit for the task. Subsequent works [92-95] improved the performance by
improving the feature aggregation techniques within the encoder-decoder architecture.
Object detection methods, unlike semantic segmentation, require the generation of rectangular bounding
boxes for an arbitrary number of detected objects, in the format of a vector consisting of continuous center
location and size that represents the spatial extent. The output representation that encompasses an arbitrary
number of detected objects distinguishes instances from each other. As a consequence, the simple encoder-
decoder architecture no longer meets the requirement. Instead, the detection pipeline based on region
[99]
proposals represented by R-CNN [96-98] series and YOLO /SSD leads to the rapid emergence and
[100]
development of two different paradigms of CNN-based object detection methods: two-stage paradigm [101-103]
and one-stage paradigm [104-108] .
The development paths for detection and segmentation converged with the need for instance-level
segmentation. Instance segmentation requires a method to not only generate class labels for pixels but also
distinguish instances from each other. Instance segmentation methods [109-112] that adopt a two-stage
paradigm follow the “detect then segment” pipeline led by Mask R-CNN . Single-stage paradigm
[109]
methods [113-116] are free from generating bounding boxes first. The EndoVis instrument segmentation
challenge and scene segmentation challenge [50,51] provide surgical-specific benchmarks and methods [117-122]
targeting surgical video/images are proposed. These methods adopt architectures from the general
computer vision domain, and some explore feature fusion mechanisms and data-inspired training
techniques to better suit the training on surgical data. Meanwhile, due to the high-stakes nature of surgical
procedures, the robustness of the segmentation and detection method is also an aspect that receives
attention [2,56,123-125] .
With the rise of vision transformers (ViT), ViT-based methods for segmentation and detection have
received more attention. While numerous task-specific architectures have been introduced for semantic
segmentation [126,127] and object detection [128-132] , universal architectures/methods have also been
explored [133,134] . Transformers architecture’s strong ability to deal with large-scale vision and language data
and the effort of collecting immense amounts of data from industry lead to the rise of foundation models.
[135]
CLIP ignited the development of vision-language models [136-138] for segmentation and detection,
pre-trained on large-scale paired data. These models enable few-shot or zero-shot learning for specific visual
tasks. However, the switch from CNNs to ViT in the medical domain [139,140] has not been as successful as in

