Page 35 - Read Online
P. 35
Ding et al. Art Int Surg 2024;4:109-38 https://dx.doi.org/10.20517/ais.2024.16 Page 119
general computer vision. Sophisticated architecture and training design, or strong prior knowledge, are
required for transformers to achieve comparable performance to CNNs. Transformers’ success relies on
large-scale datasets and transfer learning ability for data from similar domains. Thus, the lack of annotated
surgical videos and the domain gap between surgical video and natural video or among different surgeries
hampers the development of ViT-based methods in surgical scenes. Segment anything models (SAM) ,
[66]
trained on 1 billion masks and 11 million images, are capable of generating instance-level segmentation
based on spatial or semantic prompts in an open vocabulary manner. SAM has led the development of
segmentation and detection in a new era. Instead of training a model from scratch, prompting or fine-
tuning SAM [141-144] has become the backbone for most tasks with a limited number of annotations. SAM
[145]
provides a promising direction for segmentation and detection in surgical scenes. AdaptiveSAM
[146]
designed an adaptor, and SurgicalSAM trained class-specific prompts for fine-tuning SAM. However, the
domain gap between the training data of SAM and surgical videos becomes an obstacle. A study shows
[147]
that SAM has surprising zero-shot generalization ability but lacks the ability to precisely capture the surgical
scenes, and the robustness of SAM is still questionable as it encounters significant performance degradation
against many corruptions.
There are also methods designed for 3D segmentation or detection [148,149] . However, either their input
[150]
[151]
representation is not visible light imaging like point cloud or image including depth , or they require a
[153]
3D reconstruction from monocular or multi-view 2D images as a prerequisite for success. These are
[152]
all beyond the scope of this paper or section.
Video object segmentation and tracking
With the introduction of strong computation resources and large-scale video-based annotation, the success
of image-based segmentation and detection has led to the rise of video object segmentation and tracking.
Video object segmentation and tracking aim to segment or track the objects with initial segmentation or
detection results across the video sequence while maintaining their identities. The initial status can be given
by an image-based segmentation or detection algorithm, or through manual annotation. Providing
geometric understanding similar to that of instance-based segmentation and detection methods. Video
object segmentation and tracking are vital for updating geometric understanding in a dynamic scene.
The development of video object segmentation algorithms mainly focuses on exploring the extraction and
aggregation of spatial and temporal features for segmentation propagation and refinement. Temporal
features are first explored with test-time online fine-tuning techniques [154,155] and recurrent approaches [156,157] .
However, tracking under occlusions is challenging without spatial context. To address this, space-time
memory [158-160] incorporates both spatial and temporal context with a running memory maintained for
information aggregation. Exploiting the transformer architecture’s robust ability to deal with sequential
data, such as videos, video ViT-based methods [161,162] have also showcased favorable results. Cutie
[163]
incorporates memory mechanism, transformer architecture, and feature fusion, achieving state-of-the-art
results in video object segmentation.
The progression of video object tracking has witnessed the emergence of various task variants, such as video
object detection [164-167] , single object tracking [168-170] , and multi-object tracking [171-173] . Traditional algorithms
rely on sophisticated pipelines with handcrafted feature descriptors and optical flow for appearance
modeling, smoothness assumptions-based motion modeling, object interaction, and occlusion handling,
probabilistic inference, and deterministic optimization . Applying neural networks simplifies the pipeline.
[174]
One aspect of the effort focuses on modeling temporal feature flow or temporal feature aggregations [164-170]
for better feature-level correspondence between frames. The other aspect focuses on improving object-level
correspondence in multi-object scenarios [171-173] .

