Page 34 - Read Online
P. 34

Page 118                           Ding et al. Art Int Surg 2024;4:109-38  https://dx.doi.org/10.20517/ais.2024.16

               Table 1. Summarization of datasets
                Category      Dataset
                Segmentation and   EndoVis [50,51] , Lap. I2I Translation [65] , Sinus-Surgery-C/L [58] , CholecSeg8k [59] , CaDIS [53] , RobustMIS [56] , Kvasir-
                                     [60]      [61]      [64]       [52]  [54]  [55]       [57]
                detection     Instrument  , AutoLaparo  , SurgToolLoc  , SAR-RARP50  , SurT  , STIR  , SegSTRONG-C
                                     [67]    [68]         [70]          [41]    [71]          [72]
                Depth estimation  EndoSLAM  , SCARED  , Rectified Hamlyn  , JHU Nasal Cavity  , Arthronet  , AIST Colonoscopy  , JHU
                                      [73]
                              Colonoscopy
                                   [75]    [74]   [69]          [70]     [67]          [41]         [73]
                3D reconstruction  JIGSAW  , Lin et al.  , Hamlyn  , Rectified Hamlyn  , EndoSLAM  , JHU Nasal Cavity  , JHU Colonoscopy
                Pose estimation  SurgRIPE [76] , Laparoscopic Non-robotic Dataset [77] , 3dStool [78] , dVPose [79] , Edinburgh Simulated Surgical Tools
                              Dataset [80] , Synthetic Surgical Needle 6DoF Pose Datasets [81] , POV-Surgery [83] , i3PosNet [84]
               segmentation methods started using convolutional neural networks (CNNs) and adopted an end-to-end
               training regime.

               In the general vision domain, segmentation was initially explored as a semantic segmentation task in the 2D
               space. As the target labels were generated from a fixed number of candidate classes, the output could be
               represented in the same format of multidimensional arrays as the input, making the encoder-decoder
                                    [91]
               architecture-based FCN  a perfect fit for the task. Subsequent works [92-95]  improved the performance by
               improving the feature aggregation techniques within the encoder-decoder architecture.

               Object detection methods, unlike semantic segmentation, require the generation of rectangular bounding
               boxes for an arbitrary number of detected objects, in the format of a vector consisting of continuous center
               location and size that represents the spatial extent. The output representation that encompasses an arbitrary
               number of detected objects distinguishes instances from each other. As a consequence, the simple encoder-
               decoder architecture no longer meets the requirement. Instead, the detection pipeline based on region
                                                                  [99]
               proposals represented by R-CNN [96-98]  series and YOLO /SSD  leads to the rapid emergence and
                                                                        [100]
               development of two different paradigms of CNN-based object detection methods: two-stage paradigm [101-103]
               and one-stage paradigm [104-108] .

               The development paths for detection and segmentation converged with the need for instance-level
               segmentation. Instance segmentation requires a method to not only generate class labels for pixels but also
               distinguish instances from each other. Instance segmentation methods [109-112]  that adopt a two-stage
               paradigm follow the “detect then segment” pipeline led by Mask R-CNN . Single-stage paradigm
                                                                                  [109]
               methods [113-116]  are free from generating bounding boxes first. The EndoVis instrument segmentation
               challenge and scene segmentation challenge [50,51]  provide surgical-specific benchmarks and methods [117-122]
               targeting surgical video/images are proposed. These methods adopt architectures from the general
               computer vision domain, and some explore feature fusion mechanisms and data-inspired training
               techniques to better suit the training on surgical data. Meanwhile, due to the high-stakes nature of surgical
               procedures, the robustness of the segmentation and detection method is also an aspect that receives
               attention [2,56,123-125] .

               With the rise of vision transformers (ViT), ViT-based methods for segmentation and detection have
               received more attention. While numerous task-specific architectures have been introduced for semantic
               segmentation [126,127]   and  object  detection [128-132] , universal  architectures/methods  have  also  been
               explored [133,134] . Transformers architecture’s strong ability to deal with large-scale vision and language data
               and the effort of collecting immense amounts of data from industry lead to the rise of foundation models.
                    [135]
               CLIP  ignited the development of vision-language models [136-138]  for segmentation and detection,
               pre-trained on large-scale paired data. These models enable few-shot or zero-shot learning for specific visual
               tasks. However, the switch from CNNs to ViT in the medical domain [139,140]  has not been as successful as in
   29   30   31   32   33   34   35   36   37   38   39