Page 35 - Read Online
P. 35

Ding et al. Art Int Surg 2024;4:109-38  https://dx.doi.org/10.20517/ais.2024.16     Page 119

               general computer vision. Sophisticated architecture and training design, or strong prior knowledge, are
               required for transformers to achieve comparable performance to CNNs. Transformers’ success relies on
               large-scale datasets and transfer learning ability for data from similar domains. Thus, the lack of annotated
               surgical videos and the domain gap between surgical video and natural video or among different surgeries
               hampers the development of ViT-based methods in surgical scenes. Segment anything models (SAM) ,
                                                                                                       [66]
               trained on 1 billion masks and 11 million images, are capable of generating instance-level segmentation
               based on spatial or semantic prompts in an open vocabulary manner. SAM has led the development of
               segmentation and detection in a new era. Instead of training a model from scratch, prompting or fine-
               tuning SAM [141-144]  has become the backbone for most tasks with a limited number of annotations. SAM
                                                                                                       [145]
               provides a promising direction for segmentation and detection in surgical scenes. AdaptiveSAM
                                                [146]
               designed an adaptor, and SurgicalSAM  trained class-specific prompts for fine-tuning SAM. However, the
               domain gap between the training data of SAM and surgical videos becomes an obstacle. A study  shows
                                                                                                  [147]
               that SAM has surprising zero-shot generalization ability but lacks the ability to precisely capture the surgical
               scenes, and the robustness of SAM is still questionable as it encounters significant performance degradation
               against many corruptions.

               There are also methods designed for 3D segmentation or detection [148,149] . However, either their input
                                                                [150]
                                                                                        [151]
               representation is not visible light imaging like point cloud  or image including depth , or they require a
                                                                      [153]
               3D reconstruction from monocular  or multi-view 2D images  as a prerequisite for success. These are
                                             [152]
               all beyond the scope of this paper or section.
               Video object segmentation and tracking
               With the introduction of strong computation resources and large-scale video-based annotation, the success
               of image-based segmentation and detection has led to the rise of video object segmentation and tracking.
               Video object segmentation and tracking aim to segment or track the objects with initial segmentation or
               detection results across the video sequence while maintaining their identities. The initial status can be given
               by an image-based segmentation or detection algorithm, or through manual annotation. Providing
               geometric understanding similar to that of instance-based segmentation and detection methods. Video
               object segmentation and tracking are vital for updating geometric understanding in a dynamic scene.


               The development of video object segmentation algorithms mainly focuses on exploring the extraction and
               aggregation of spatial and temporal features for segmentation propagation and refinement. Temporal
               features are first explored with test-time online fine-tuning techniques [154,155]  and recurrent approaches [156,157] .
               However, tracking under occlusions is challenging without spatial context. To address this, space-time
               memory [158-160]  incorporates both spatial and temporal context with a running memory maintained for
               information aggregation. Exploiting the transformer architecture’s robust ability to deal with sequential
               data, such as videos, video ViT-based methods [161,162]  have also showcased favorable results. Cutie
                                                                                                       [163]
               incorporates memory mechanism, transformer architecture, and feature fusion, achieving state-of-the-art
               results in video object segmentation.

               The progression of video object tracking has witnessed the emergence of various task variants, such as video
               object detection [164-167] , single object tracking [168-170] , and multi-object tracking [171-173] . Traditional algorithms
               rely on sophisticated pipelines with handcrafted feature descriptors and optical flow for appearance
               modeling, smoothness assumptions-based motion modeling, object interaction, and occlusion handling,
               probabilistic inference, and deterministic optimization . Applying neural networks simplifies the pipeline.
                                                             [174]
               One aspect of the effort focuses on modeling temporal feature flow or temporal feature aggregations [164-170]
               for better feature-level correspondence between frames. The other aspect focuses on improving object-level
               correspondence in multi-object scenarios [171-173] .
   30   31   32   33   34   35   36   37   38   39   40